re.sub()

Python re.sub() Method Explained

The re.sub() method is used for searching and replacing text patterns in strings. It’s one of the most powerful regex methods for text processing.

Syntax

python

re.sub(pattern, repl, string, count=0, flags=0)
  • pattern: The regex pattern to search for
  • repl: The replacement string or function
  • string: The input text to process
  • count: Maximum number of replacements (0 = all)
  • flags: Optional modifiers

Example 1: Basic Text Replacement

python

import re

text = "The color of the sky is blue. My favorite color is blue too."

# Replace all occurrences of 'color' with 'colour' (British English)
result = re.sub(r'color', 'colour', text)
print("Basic replacement:")
print(result)

# Replace only the first occurrence
result_first = re.sub(r'color', 'colour', text, count=1)
print("\nFirst occurrence only:")
print(result_first)

# Case-insensitive replacement
text2 = "Color, COLOR, color, CoLoR"
result_ci = re.sub(r'color', 'colour', text2, flags=re.IGNORECASE)
print("\nCase-insensitive replacement:")
print(result_ci)

Output:

text

Basic replacement:
The colour of the sky is blue. My favourite colour is blue too.

First occurrence only:
The colour of the sky is blue. My favorite color is blue too.

Case-insensitive replacement:
colour, colour, colour, colour

Example 2: Advanced Replacement with Backreferences

python

import re

# Reformat date from MM/DD/YYYY to YYYY-MM-DD
dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print("Date reformatting:")
print(result)



"""
Explaining re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
This regex pattern is used to reformat dates from MM/DD/YYYY format to YYYY-MM-DD format. Let me break it down:

The Search Pattern: r'(\d{2})/(\d{2})/(\d{4})'
1. (\d{2}) - First Capture Group (Month)
\d{2} = exactly 2 digits

( ) = capture this group

Matches: Month (e.g., "12", "01", "06")

2. / - Literal slash
Matches the slash between month and day

3. (\d{2}) - Second Capture Group (Day)
\d{2} = exactly 2 digits

( ) = capture this group

Matches: Day (e.g., "25", "15", "30")

4. / - Literal slash
Matches the slash between day and year

5. (\d{4}) - Third Capture Group (Year)
\d{4} = exactly 4 digits

( ) = capture this group

Matches: Year (e.g., "2023", "2024")

The Replacement Pattern: r'\3-\1-\2'
This is where the magic happens! The backreferences rearrange the captured groups:

\3 = Third capture group (Year)

- = Literal hyphen

\1 = First capture group (Month)

- = Literal hyphen

\2 = Second capture group (Day)

So: Year + hyphen + Month + hyphen + Day

Example:
python
import re

dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)

print("Before:", dates)
print("After: ", result)
Output:

text
Before: 12/25/2023, 01/15/2024, 06/30/2023
After: 2023-12-25, 2024-01-15, 2023-06-30
Step-by-step Transformation:
For the date "12/25/2023":

Capture groups:

Group 1 (\1) = "12" (Month)

Group 2 (\2) = "25" (Day)

Group 3 (\3) = "2023" (Year)

Replacement: \3-\1-\2 becomes "2023-12-25"

Visual Breakdown:
text
Original: 12 / 25 / 2023
↑ ↑ ↑ ↑ ↑
\1 / \2 / \3

Replacement: \3 - \1 - \2
↓ ↓ ↓ ↓ ↓
2023 - 12 - 25
Key Points:
Capture groups (parentheses) allow you to "remember" parts of the match

Backreferences (\1, \2, \3) let you reuse captured groups in the replacement

Order matters: The replacement pattern \3-\1-\2 changes the order from MM/DD/YYYY to YYYY-MM-DD

ISO 8601 format: This converts to the international standard date format

This pattern is extremely useful for data cleaning and standardizing date formats across different systems!
"""








# Mask credit card numbers (show only last 4 digits)
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
print("\nCredit card masking:")
print(masked)




"""
Explaining Credit Card Masking with re.sub()
This code is used to mask credit card numbers by showing only the last 4 digits while hiding the rest. Let me break it down:

The Code:
python
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
The Search Pattern: r'\d{4}-\d{4}-\d{4}-(\d{4})'
1. \d{4}- × 3
\d{4} = exactly 4 digits

- = literal hyphen

Matches: First three groups of 4 digits + hyphens

Example: "4111-1111-1111-"

2. (\d{4}) - Capture Group (Last 4 digits)
\d{4} = exactly 4 digits

( ) = capture this group

Matches: The last 4 digits of the card number

Example: "1111" (captured as group \1)

The Replacement Pattern: r'XXXX-XXXX-XXXX-\1'
This replaces the matched card number with a masked version:

XXXX-XXXX-XXXX- = Literal text to hide the first 12 digits

\1 = First capture group (the last 4 digits)

Example:
python
import re

cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)

print("Original:", cards)
print("Masked: ", masked)
Output:

text
Original: Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004
Masked: Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004
Step-by-step Transformation:
For the card number "4111-1111-1111-1111":

Pattern matches: "4111-1111-1111-1111"

Capture group: (\d{4}) captures "1111" (last 4 digits)

Replacement: XXXX-XXXX-XXXX-\1 becomes "XXXX-XXXX-XXXX-1111"

Visual Breakdown:
text
Original: 4111 - 1111 - 1111 - 1111
├──┼┼ ┼──┼┼ ┼──┼┼ ┼──┤
│ ││ │ ││ │ ││ │ └── Captured (\1)
│ ││ │ ││ │ ││ └── Not captured
│ ││ │ ││ │ │└── Not captured
│ ││ │ ││ │ └── Not captured
└──┴┴ ┴──┴┴ ┴──┴┴ ┴── All matched by pattern

Replacement: XXXX - XXXX - XXXX - \1
↓↓↓↓ ↓↓↓↓ ↓↓↓↓ ↓↓↓↓
XXXX XXXX XXXX 1111
Why This is Useful:
Security: Protects sensitive credit card information

PCI Compliance: Payment Card Industry standards require masking card numbers

User Experience: Shows enough digits for identification while protecting privacy

Data Display: Safe for logging, debugging, or displaying in user interfaces

Alternative Pattern:
You could also use this more flexible pattern that handles cards with or without hyphens:

python
masked = re.sub(r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
This pattern is essential for any application that handles payment information while maintaining security standards!
"""












# Capitalize first letter of each word
text = "hello world! this is python programming."
capitalized = re.sub(r'\b(\w)', lambda match: match.group(1).upper(), text)
print("\nCapitalize words:")
print(capitalized)

Output:

text

Date reformatting:
2023-12-25, 2024-01-15, 2023-06-30

Credit card masking:
Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004

Capitalize words:
Hello World! This Is Python Programming.

Example 3: Using Functions for Dynamic Replacement

python

import re

# Increment all numbers in text by 1
text = "I have 5 apples, 3 oranges, and 10 bananas."

def increment_number(match):
    number = int(match.group())
    return str(number + 1)

result = re.sub(r'\d+', increment_number, text)
print("Increment numbers:")
print(result)

# Convert temperatures from Fahrenheit to Celsius
temp_text = "Temperatures: 32°F, 68°F, 100°F, 212°F"

def f_to_c(match):
    f_temp = int(match.group(1))
    c_temp = (f_temp - 32) * 5/9
    return f"{c_temp:.1f}°C"

result = re.sub(r'(\d+)°F', f_to_c, temp_text)
print("\nTemperature conversion:")
print(result)

# Censor bad words
text = "This is a damn bad example with crap words."
bad_words = ['damn', 'crap', 'bad']

def censor_word(match):
    word = match.group()
    return word[0] + '*' * (len(word) - 1)

result = re.sub(r'\b(' + '|'.join(bad_words) + r')\b', censor_word, text, flags=re.IGNORECASE)
print("\nWord censoring:")
print(result)

Output:

text

Increment numbers:
I have 6 apples, 4 oranges, and 11 bananas.

Temperature conversion:
Temperatures: 0.0°C, 20.0°C, 37.8°C, 100.0°C

Word censoring:
This is a d***n b** example with c*** words.

Key Features of re.sub():

  1. Pattern matching: Use regex patterns for complex search criteria
  2. Backreferences: Use \1\2, etc., to reference captured groups
  3. Function replacement: Use functions for dynamic replacements
  4. Count control: Limit number of replacements with count parameter
  5. Flags support: Use flags like re.IGNORECASEre.MULTILINE

Common Use Cases:

  • Data cleaning and formatting
  • Text normalization
  • Pattern-based text transformation
  • Data masking and anonymization
  • Template processing

The re.sub() method is incredibly versatile for any text processing task that requires pattern-based search and replace operations!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *