Python re.sub() Method Explained

The re.sub() method is used for searching and replacing text patterns in strings. It’s one of the most powerful regex methods for text processing.

Syntax

python

re.sub(pattern, repl, string, count=0, flags=0)

pattern: The regex pattern to search for
repl: The replacement string or function
string: The input text to process
count: Maximum number of replacements (0 = all)
flags: Optional modifiers

Example 1: Basic Text Replacement

python

import re

text = "The color of the sky is blue. My favorite color is blue too."

# Replace all occurrences of 'color' with 'colour' (British English)
result = re.sub(r'color', 'colour', text)
print("Basic replacement:")
print(result)

# Replace only the first occurrence
result_first = re.sub(r'color', 'colour', text, count=1)
print("\nFirst occurrence only:")
print(result_first)

# Case-insensitive replacement
text2 = "Color, COLOR, color, CoLoR"
result_ci = re.sub(r'color', 'colour', text2, flags=re.IGNORECASE)
print("\nCase-insensitive replacement:")
print(result_ci)

Output:

text

Basic replacement:
The colour of the sky is blue. My favourite colour is blue too.

First occurrence only:
The colour of the sky is blue. My favorite color is blue too.

Case-insensitive replacement:
colour, colour, colour, colour

Example 2: Advanced Replacement with Backreferences

python

import re

# Reformat date from MM/DD/YYYY to YYYY-MM-DD
dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print("Date reformatting:")
print(result)



"""
Explaining re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
This regex pattern is used to reformat dates from MM/DD/YYYY format to YYYY-MM-DD format. Let me break it down:

The Search Pattern: r'(\d{2})/(\d{2})/(\d{4})'
1. (\d{2}) - First Capture Group (Month)
\d{2} = exactly 2 digits

( ) = capture this group

Matches: Month (e.g., "12", "01", "06")

2. / - Literal slash
Matches the slash between month and day

3. (\d{2}) - Second Capture Group (Day)
\d{2} = exactly 2 digits

( ) = capture this group

Matches: Day (e.g., "25", "15", "30")

4. / - Literal slash
Matches the slash between day and year

5. (\d{4}) - Third Capture Group (Year)
\d{4} = exactly 4 digits

( ) = capture this group

Matches: Year (e.g., "2023", "2024")

The Replacement Pattern: r'\3-\1-\2'
This is where the magic happens! The backreferences rearrange the captured groups:

\3 = Third capture group (Year)

- = Literal hyphen

\1 = First capture group (Month)

- = Literal hyphen

\2 = Second capture group (Day)

So: Year + hyphen + Month + hyphen + Day

Example:
python
import re

dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)

print("Before:", dates)
print("After: ", result)
Output:

text
Before: 12/25/2023, 01/15/2024, 06/30/2023
After:  2023-12-25, 2024-01-15, 2023-06-30
Step-by-step Transformation:
For the date "12/25/2023":

Capture groups:

Group 1 (\1) = "12" (Month)

Group 2 (\2) = "25" (Day)

Group 3 (\3) = "2023" (Year)

Replacement: \3-\1-\2 becomes "2023-12-25"

Visual Breakdown:
text
Original:  12   /  25   /  2023
           ↑    ↑   ↑    ↑   ↑
           \1   /  \2   /  \3

Replacement: \3    -   \1    -   \2
             ↓     ↓   ↓     ↓   ↓
            2023   -   12    -   25
Key Points:
Capture groups (parentheses) allow you to "remember" parts of the match

Backreferences (\1, \2, \3) let you reuse captured groups in the replacement

Order matters: The replacement pattern \3-\1-\2 changes the order from MM/DD/YYYY to YYYY-MM-DD

ISO 8601 format: This converts to the international standard date format

This pattern is extremely useful for data cleaning and standardizing date formats across different systems!
"""








# Mask credit card numbers (show only last 4 digits)
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
print("\nCredit card masking:")
print(masked)




"""
Explaining Credit Card Masking with re.sub()
This code is used to mask credit card numbers by showing only the last 4 digits while hiding the rest. Let me break it down:

The Code:
python
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
The Search Pattern: r'\d{4}-\d{4}-\d{4}-(\d{4})'
1. \d{4}- × 3
\d{4} = exactly 4 digits

- = literal hyphen

Matches: First three groups of 4 digits + hyphens

Example: "4111-1111-1111-"

2. (\d{4}) - Capture Group (Last 4 digits)
\d{4} = exactly 4 digits

( ) = capture this group

Matches: The last 4 digits of the card number

Example: "1111" (captured as group \1)

The Replacement Pattern: r'XXXX-XXXX-XXXX-\1'
This replaces the matched card number with a masked version:

XXXX-XXXX-XXXX- = Literal text to hide the first 12 digits

\1 = First capture group (the last 4 digits)

Example:
python
import re

cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)

print("Original:", cards)
print("Masked:  ", masked)
Output:

text
Original: Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004
Masked:   Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004
Step-by-step Transformation:
For the card number "4111-1111-1111-1111":

Pattern matches: "4111-1111-1111-1111"

Capture group: (\d{4}) captures "1111" (last 4 digits)

Replacement: XXXX-XXXX-XXXX-\1 becomes "XXXX-XXXX-XXXX-1111"

Visual Breakdown:
text
Original:  4111 - 1111 - 1111 - 1111
           ├──┼┼ ┼──┼┼ ┼──┼┼ ┼──┤
           │  ││ │  ││ │  ││ │  └── Captured (\1)
           │  ││ │  ││ │  ││ └── Not captured
           │  ││ │  ││ │  │└── Not captured  
           │  ││ │  ││ │  └── Not captured
           └──┴┴ ┴──┴┴ ┴──┴┴ ┴── All matched by pattern

Replacement: XXXX - XXXX - XXXX - \1
             ↓↓↓↓   ↓↓↓↓   ↓↓↓↓   ↓↓↓↓
             XXXX   XXXX   XXXX   1111
Why This is Useful:
Security: Protects sensitive credit card information

PCI Compliance: Payment Card Industry standards require masking card numbers

User Experience: Shows enough digits for identification while protecting privacy

Data Display: Safe for logging, debugging, or displaying in user interfaces

Alternative Pattern:
You could also use this more flexible pattern that handles cards with or without hyphens:

python
masked = re.sub(r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
This pattern is essential for any application that handles payment information while maintaining security standards!
"""












# Capitalize first letter of each word
text = "hello world! this is python programming."
capitalized = re.sub(r'\b(\w)', lambda match: match.group(1).upper(), text)
print("\nCapitalize words:")
print(capitalized)

Output:

text

Date reformatting:
2023-12-25, 2024-01-15, 2023-06-30

Credit card masking:
Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004

Capitalize words:
Hello World! This Is Python Programming.

Example 3: Using Functions for Dynamic Replacement

python

import re

# Increment all numbers in text by 1
text = "I have 5 apples, 3 oranges, and 10 bananas."

def increment_number(match):
    number = int(match.group())
    return str(number + 1)

result = re.sub(r'\d+', increment_number, text)
print("Increment numbers:")
print(result)

# Convert temperatures from Fahrenheit to Celsius
temp_text = "Temperatures: 32°F, 68°F, 100°F, 212°F"

def f_to_c(match):
    f_temp = int(match.group(1))
    c_temp = (f_temp - 32) * 5/9
    return f"{c_temp:.1f}°C"

result = re.sub(r'(\d+)°F', f_to_c, temp_text)
print("\nTemperature conversion:")
print(result)

# Censor bad words
text = "This is a damn bad example with crap words."
bad_words = ['damn', 'crap', 'bad']

def censor_word(match):
    word = match.group()
    return word[0] + '*' * (len(word) - 1)

result = re.sub(r'\b(' + '|'.join(bad_words) + r')\b', censor_word, text, flags=re.IGNORECASE)
print("\nWord censoring:")
print(result)

Output:

text

Increment numbers:
I have 6 apples, 4 oranges, and 11 bananas.

Temperature conversion:
Temperatures: 0.0°C, 20.0°C, 37.8°C, 100.0°C

Word censoring:
This is a d***n b** example with c*** words.

Key Features of re.sub():

Pattern matching: Use regex patterns for complex search criteria
Backreferences: Use \1, \2, etc., to reference captured groups
Function replacement: Use functions for dynamic replacements
Count control: Limit number of replacements with count parameter
Flags support: Use flags like re.IGNORECASE, re.MULTILINE

Common Use Cases:

Data cleaning and formatting
Text normalization
Pattern-based text transformation
Data masking and anonymization
Template processing

The re.sub() method is incredibly versatile for any text processing task that requires pattern-based search and replace operations!

re.sub()

Python re.sub() Method Explained

Syntax

Example 1: Basic Text Replacement

Example 2: Advanced Replacement with Backreferences

Example 3: Using Functions for Dynamic Replacement

Key Features of re.sub():

Common Use Cases:

pop(), remove(), clear(), and del

Special Sequences in Python Regex

difference between positional and keyword arguments

Negative slicing in Python lists

Dot (.) ,Caret (^),Dollar Sign ($), Asterisk (*) ,Plus Sign (+) Metacharacters

Functions as Objects

Leave a Reply Cancel reply

Python re.sub() Method Explained

Syntax

Example 1: Basic Text Replacement

Example 2: Advanced Replacement with Backreferences

Example 3: Using Functions for Dynamic Replacement

Key Features of re.sub():

Common Use Cases:

Similar Posts

Leave a Reply Cancel reply