re.sub()

Python re.sub() Method Explained

The re.sub() method is used for searching and replacing text patterns in strings. It’s one of the most powerful regex methods for text processing.

Syntax

python

re.sub(pattern, repl, string, count=0, flags=0)
  • pattern: The regex pattern to search for
  • repl: The replacement string or function
  • string: The input text to process
  • count: Maximum number of replacements (0 = all)
  • flags: Optional modifiers

Example 1: Basic Text Replacement

python

import re

text = "The color of the sky is blue. My favorite color is blue too."

# Replace all occurrences of 'color' with 'colour' (British English)
result = re.sub(r'color', 'colour', text)
print("Basic replacement:")
print(result)

# Replace only the first occurrence
result_first = re.sub(r'color', 'colour', text, count=1)
print("\nFirst occurrence only:")
print(result_first)

# Case-insensitive replacement
text2 = "Color, COLOR, color, CoLoR"
result_ci = re.sub(r'color', 'colour', text2, flags=re.IGNORECASE)
print("\nCase-insensitive replacement:")
print(result_ci)

Output:

text

Basic replacement:
The colour of the sky is blue. My favourite colour is blue too.

First occurrence only:
The colour of the sky is blue. My favorite color is blue too.

Case-insensitive replacement:
colour, colour, colour, colour

Example 2: Advanced Replacement with Backreferences

python

import re

# Reformat date from MM/DD/YYYY to YYYY-MM-DD
dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print("Date reformatting:")
print(result)



"""
Explaining re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
This regex pattern is used to reformat dates from MM/DD/YYYY format to YYYY-MM-DD format. Let me break it down:

The Search Pattern: r'(\d{2})/(\d{2})/(\d{4})'
1. (\d{2}) - First Capture Group (Month)
\d{2} = exactly 2 digits

( ) = capture this group

Matches: Month (e.g., "12", "01", "06")

2. / - Literal slash
Matches the slash between month and day

3. (\d{2}) - Second Capture Group (Day)
\d{2} = exactly 2 digits

( ) = capture this group

Matches: Day (e.g., "25", "15", "30")

4. / - Literal slash
Matches the slash between day and year

5. (\d{4}) - Third Capture Group (Year)
\d{4} = exactly 4 digits

( ) = capture this group

Matches: Year (e.g., "2023", "2024")

The Replacement Pattern: r'\3-\1-\2'
This is where the magic happens! The backreferences rearrange the captured groups:

\3 = Third capture group (Year)

- = Literal hyphen

\1 = First capture group (Month)

- = Literal hyphen

\2 = Second capture group (Day)

So: Year + hyphen + Month + hyphen + Day

Example:
python
import re

dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)

print("Before:", dates)
print("After: ", result)
Output:

text
Before: 12/25/2023, 01/15/2024, 06/30/2023
After: 2023-12-25, 2024-01-15, 2023-06-30
Step-by-step Transformation:
For the date "12/25/2023":

Capture groups:

Group 1 (\1) = "12" (Month)

Group 2 (\2) = "25" (Day)

Group 3 (\3) = "2023" (Year)

Replacement: \3-\1-\2 becomes "2023-12-25"

Visual Breakdown:
text
Original: 12 / 25 / 2023
↑ ↑ ↑ ↑ ↑
\1 / \2 / \3

Replacement: \3 - \1 - \2
↓ ↓ ↓ ↓ ↓
2023 - 12 - 25
Key Points:
Capture groups (parentheses) allow you to "remember" parts of the match

Backreferences (\1, \2, \3) let you reuse captured groups in the replacement

Order matters: The replacement pattern \3-\1-\2 changes the order from MM/DD/YYYY to YYYY-MM-DD

ISO 8601 format: This converts to the international standard date format

This pattern is extremely useful for data cleaning and standardizing date formats across different systems!
"""








# Mask credit card numbers (show only last 4 digits)
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
print("\nCredit card masking:")
print(masked)




"""
Explaining Credit Card Masking with re.sub()
This code is used to mask credit card numbers by showing only the last 4 digits while hiding the rest. Let me break it down:

The Code:
python
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
The Search Pattern: r'\d{4}-\d{4}-\d{4}-(\d{4})'
1. \d{4}- × 3
\d{4} = exactly 4 digits

- = literal hyphen

Matches: First three groups of 4 digits + hyphens

Example: "4111-1111-1111-"

2. (\d{4}) - Capture Group (Last 4 digits)
\d{4} = exactly 4 digits

( ) = capture this group

Matches: The last 4 digits of the card number

Example: "1111" (captured as group \1)

The Replacement Pattern: r'XXXX-XXXX-XXXX-\1'
This replaces the matched card number with a masked version:

XXXX-XXXX-XXXX- = Literal text to hide the first 12 digits

\1 = First capture group (the last 4 digits)

Example:
python
import re

cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)

print("Original:", cards)
print("Masked: ", masked)
Output:

text
Original: Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004
Masked: Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004
Step-by-step Transformation:
For the card number "4111-1111-1111-1111":

Pattern matches: "4111-1111-1111-1111"

Capture group: (\d{4}) captures "1111" (last 4 digits)

Replacement: XXXX-XXXX-XXXX-\1 becomes "XXXX-XXXX-XXXX-1111"

Visual Breakdown:
text
Original: 4111 - 1111 - 1111 - 1111
├──┼┼ ┼──┼┼ ┼──┼┼ ┼──┤
│ ││ │ ││ │ ││ │ └── Captured (\1)
│ ││ │ ││ │ ││ └── Not captured
│ ││ │ ││ │ │└── Not captured
│ ││ │ ││ │ └── Not captured
└──┴┴ ┴──┴┴ ┴──┴┴ ┴── All matched by pattern

Replacement: XXXX - XXXX - XXXX - \1
↓↓↓↓ ↓↓↓↓ ↓↓↓↓ ↓↓↓↓
XXXX XXXX XXXX 1111
Why This is Useful:
Security: Protects sensitive credit card information

PCI Compliance: Payment Card Industry standards require masking card numbers

User Experience: Shows enough digits for identification while protecting privacy

Data Display: Safe for logging, debugging, or displaying in user interfaces

Alternative Pattern:
You could also use this more flexible pattern that handles cards with or without hyphens:

python
masked = re.sub(r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
This pattern is essential for any application that handles payment information while maintaining security standards!
"""












# Capitalize first letter of each word
text = "hello world! this is python programming."
capitalized = re.sub(r'\b(\w)', lambda match: match.group(1).upper(), text)
print("\nCapitalize words:")
print(capitalized)

Output:

text

Date reformatting:
2023-12-25, 2024-01-15, 2023-06-30

Credit card masking:
Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004

Capitalize words:
Hello World! This Is Python Programming.

Example 3: Using Functions for Dynamic Replacement

python

import re

# Increment all numbers in text by 1
text = "I have 5 apples, 3 oranges, and 10 bananas."

def increment_number(match):
    number = int(match.group())
    return str(number + 1)

result = re.sub(r'\d+', increment_number, text)
print("Increment numbers:")
print(result)

# Convert temperatures from Fahrenheit to Celsius
temp_text = "Temperatures: 32°F, 68°F, 100°F, 212°F"

def f_to_c(match):
    f_temp = int(match.group(1))
    c_temp = (f_temp - 32) * 5/9
    return f"{c_temp:.1f}°C"

result = re.sub(r'(\d+)°F', f_to_c, temp_text)
print("\nTemperature conversion:")
print(result)

# Censor bad words
text = "This is a damn bad example with crap words."
bad_words = ['damn', 'crap', 'bad']

def censor_word(match):
    word = match.group()
    return word[0] + '*' * (len(word) - 1)

result = re.sub(r'\b(' + '|'.join(bad_words) + r')\b', censor_word, text, flags=re.IGNORECASE)
print("\nWord censoring:")
print(result)

Output:

text

Increment numbers:
I have 6 apples, 4 oranges, and 11 bananas.

Temperature conversion:
Temperatures: 0.0°C, 20.0°C, 37.8°C, 100.0°C

Word censoring:
This is a d***n b** example with c*** words.

Key Features of re.sub():

  1. Pattern matching: Use regex patterns for complex search criteria
  2. Backreferences: Use \1\2, etc., to reference captured groups
  3. Function replacement: Use functions for dynamic replacements
  4. Count control: Limit number of replacements with count parameter
  5. Flags support: Use flags like re.IGNORECASEre.MULTILINE

Common Use Cases:

  • Data cleaning and formatting
  • Text normalization
  • Pattern-based text transformation
  • Data masking and anonymization
  • Template processing

The re.sub() method is incredibly versatile for any text processing task that requires pattern-based search and replace operations!

Similar Posts

  • Demo And course Content

    What is Python? Python is a high-level, interpreted, and general-purpose programming language known for its simplicity and readability. It supports multiple programming paradigms, including: Python’s design philosophy emphasizes code readability (using indentation instead of braces) and developer productivity. History of Python Fun Fact: Python is named after Monty Python’s Flying Circus (a British comedy show), not the snake! 🐍🎭 Top Career Paths After Learning Core Python 🐍…

  • Static Methods

    The primary use of a static method in Python classes is to define a function that logically belongs to the class but doesn’t need access to the instance’s data (like self) or the class’s state (like cls). They are essentially regular functions that are grouped within a class namespace. Key Characteristics and Use Cases General…

  • Positional-Only Arguments in Python

    Positional-Only Arguments in Python Positional-only arguments are function parameters that must be passed by position (order) and cannot be passed by keyword name. Syntax Use the / symbol in the function definition to indicate that all parameters before it are positional-only: python def function_name(param1, param2, /, param3, param4): # function body Simple Examples Example 1: Basic Positional-Only Arguments python def calculate_area(length,…

  • Predefined Character Classes

    Predefined Character Classes Pattern Description Equivalent . Matches any character except newline \d Matches any digit [0-9] \D Matches any non-digit [^0-9] \w Matches any word character [a-zA-Z0-9_] \W Matches any non-word character [^a-zA-Z0-9_] \s Matches any whitespace character [ \t\n\r\f\v] \S Matches any non-whitespace character [^ \t\n\r\f\v] 1. Literal Character a Matches: The exact character…

  • Python Variables: A Complete Guide with Interview Q&A

    Here’s a detailed set of notes on Python variables that you can use to explain the concept to your students. These notes are structured to make it easy for beginners to understand. Python Variables: Notes for Students 1. What is a Variable? 2. Rules for Naming Variables Python has specific rules for naming variables: 3….

Leave a Reply

Your email address will not be published. Required fields are marked *