re.sub()
Python re.sub() Method Explained
The re.sub() method is used for searching and replacing text patterns in strings. It’s one of the most powerful regex methods for text processing.
Syntax
python
re.sub(pattern, repl, string, count=0, flags=0)
- pattern: The regex pattern to search for
- repl: The replacement string or function
- string: The input text to process
- count: Maximum number of replacements (0 = all)
- flags: Optional modifiers
Example 1: Basic Text Replacement
python
import re
text = "The color of the sky is blue. My favorite color is blue too."
# Replace all occurrences of 'color' with 'colour' (British English)
result = re.sub(r'color', 'colour', text)
print("Basic replacement:")
print(result)
# Replace only the first occurrence
result_first = re.sub(r'color', 'colour', text, count=1)
print("\nFirst occurrence only:")
print(result_first)
# Case-insensitive replacement
text2 = "Color, COLOR, color, CoLoR"
result_ci = re.sub(r'color', 'colour', text2, flags=re.IGNORECASE)
print("\nCase-insensitive replacement:")
print(result_ci)
Output:
text
Basic replacement: The colour of the sky is blue. My favourite colour is blue too. First occurrence only: The colour of the sky is blue. My favorite color is blue too. Case-insensitive replacement: colour, colour, colour, colour
Example 2: Advanced Replacement with Backreferences
python
import re
# Reformat date from MM/DD/YYYY to YYYY-MM-DD
dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print("Date reformatting:")
print(result)
"""
Explaining re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
This regex pattern is used to reformat dates from MM/DD/YYYY format to YYYY-MM-DD format. Let me break it down:
The Search Pattern: r'(\d{2})/(\d{2})/(\d{4})'
1. (\d{2}) - First Capture Group (Month)
\d{2} = exactly 2 digits
( ) = capture this group
Matches: Month (e.g., "12", "01", "06")
2. / - Literal slash
Matches the slash between month and day
3. (\d{2}) - Second Capture Group (Day)
\d{2} = exactly 2 digits
( ) = capture this group
Matches: Day (e.g., "25", "15", "30")
4. / - Literal slash
Matches the slash between day and year
5. (\d{4}) - Third Capture Group (Year)
\d{4} = exactly 4 digits
( ) = capture this group
Matches: Year (e.g., "2023", "2024")
The Replacement Pattern: r'\3-\1-\2'
This is where the magic happens! The backreferences rearrange the captured groups:
\3 = Third capture group (Year)
- = Literal hyphen
\1 = First capture group (Month)
- = Literal hyphen
\2 = Second capture group (Day)
So: Year + hyphen + Month + hyphen + Day
Example:
python
import re
dates = "12/25/2023, 01/15/2024, 06/30/2023"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', dates)
print("Before:", dates)
print("After: ", result)
Output:
text
Before: 12/25/2023, 01/15/2024, 06/30/2023
After: 2023-12-25, 2024-01-15, 2023-06-30
Step-by-step Transformation:
For the date "12/25/2023":
Capture groups:
Group 1 (\1) = "12" (Month)
Group 2 (\2) = "25" (Day)
Group 3 (\3) = "2023" (Year)
Replacement: \3-\1-\2 becomes "2023-12-25"
Visual Breakdown:
text
Original: 12 / 25 / 2023
↑ ↑ ↑ ↑ ↑
\1 / \2 / \3
Replacement: \3 - \1 - \2
↓ ↓ ↓ ↓ ↓
2023 - 12 - 25
Key Points:
Capture groups (parentheses) allow you to "remember" parts of the match
Backreferences (\1, \2, \3) let you reuse captured groups in the replacement
Order matters: The replacement pattern \3-\1-\2 changes the order from MM/DD/YYYY to YYYY-MM-DD
ISO 8601 format: This converts to the international standard date format
This pattern is extremely useful for data cleaning and standardizing date formats across different systems!
"""
# Mask credit card numbers (show only last 4 digits)
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
print("\nCredit card masking:")
print(masked)
"""
Explaining Credit Card Masking with re.sub()
This code is used to mask credit card numbers by showing only the last 4 digits while hiding the rest. Let me break it down:
The Code:
python
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
The Search Pattern: r'\d{4}-\d{4}-\d{4}-(\d{4})'
1. \d{4}- × 3
\d{4} = exactly 4 digits
- = literal hyphen
Matches: First three groups of 4 digits + hyphens
Example: "4111-1111-1111-"
2. (\d{4}) - Capture Group (Last 4 digits)
\d{4} = exactly 4 digits
( ) = capture this group
Matches: The last 4 digits of the card number
Example: "1111" (captured as group \1)
The Replacement Pattern: r'XXXX-XXXX-XXXX-\1'
This replaces the matched card number with a masked version:
XXXX-XXXX-XXXX- = Literal text to hide the first 12 digits
\1 = First capture group (the last 4 digits)
Example:
python
import re
cards = "Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004"
masked = re.sub(r'\d{4}-\d{4}-\d{4}-(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
print("Original:", cards)
print("Masked: ", masked)
Output:
text
Original: Visa: 4111-1111-1111-1111, MasterCard: 5500-0000-0000-0004
Masked: Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004
Step-by-step Transformation:
For the card number "4111-1111-1111-1111":
Pattern matches: "4111-1111-1111-1111"
Capture group: (\d{4}) captures "1111" (last 4 digits)
Replacement: XXXX-XXXX-XXXX-\1 becomes "XXXX-XXXX-XXXX-1111"
Visual Breakdown:
text
Original: 4111 - 1111 - 1111 - 1111
├──┼┼ ┼──┼┼ ┼──┼┼ ┼──┤
│ ││ │ ││ │ ││ │ └── Captured (\1)
│ ││ │ ││ │ ││ └── Not captured
│ ││ │ ││ │ │└── Not captured
│ ││ │ ││ │ └── Not captured
└──┴┴ ┴──┴┴ ┴──┴┴ ┴── All matched by pattern
Replacement: XXXX - XXXX - XXXX - \1
↓↓↓↓ ↓↓↓↓ ↓↓↓↓ ↓↓↓↓
XXXX XXXX XXXX 1111
Why This is Useful:
Security: Protects sensitive credit card information
PCI Compliance: Payment Card Industry standards require masking card numbers
User Experience: Shows enough digits for identification while protecting privacy
Data Display: Safe for logging, debugging, or displaying in user interfaces
Alternative Pattern:
You could also use this more flexible pattern that handles cards with or without hyphens:
python
masked = re.sub(r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?(\d{4})', r'XXXX-XXXX-XXXX-\1', cards)
This pattern is essential for any application that handles payment information while maintaining security standards!
"""
# Capitalize first letter of each word
text = "hello world! this is python programming."
capitalized = re.sub(r'\b(\w)', lambda match: match.group(1).upper(), text)
print("\nCapitalize words:")
print(capitalized)
Output:
text
Date reformatting: 2023-12-25, 2024-01-15, 2023-06-30 Credit card masking: Visa: XXXX-XXXX-XXXX-1111, MasterCard: XXXX-XXXX-XXXX-0004 Capitalize words: Hello World! This Is Python Programming.
Example 3: Using Functions for Dynamic Replacement
python
import re
# Increment all numbers in text by 1
text = "I have 5 apples, 3 oranges, and 10 bananas."
def increment_number(match):
number = int(match.group())
return str(number + 1)
result = re.sub(r'\d+', increment_number, text)
print("Increment numbers:")
print(result)
# Convert temperatures from Fahrenheit to Celsius
temp_text = "Temperatures: 32°F, 68°F, 100°F, 212°F"
def f_to_c(match):
f_temp = int(match.group(1))
c_temp = (f_temp - 32) * 5/9
return f"{c_temp:.1f}°C"
result = re.sub(r'(\d+)°F', f_to_c, temp_text)
print("\nTemperature conversion:")
print(result)
# Censor bad words
text = "This is a damn bad example with crap words."
bad_words = ['damn', 'crap', 'bad']
def censor_word(match):
word = match.group()
return word[0] + '*' * (len(word) - 1)
result = re.sub(r'\b(' + '|'.join(bad_words) + r')\b', censor_word, text, flags=re.IGNORECASE)
print("\nWord censoring:")
print(result)
Output:
text
Increment numbers: I have 6 apples, 4 oranges, and 11 bananas. Temperature conversion: Temperatures: 0.0°C, 20.0°C, 37.8°C, 100.0°C Word censoring: This is a d***n b** example with c*** words.
Key Features of re.sub():
- Pattern matching: Use regex patterns for complex search criteria
- Backreferences: Use
\1,\2, etc., to reference captured groups - Function replacement: Use functions for dynamic replacements
- Count control: Limit number of replacements with
countparameter - Flags support: Use flags like
re.IGNORECASE,re.MULTILINE
Common Use Cases:
- Data cleaning and formatting
- Text normalization
- Pattern-based text transformation
- Data masking and anonymization
- Template processing
The re.sub() method is incredibly versatile for any text processing task that requires pattern-based search and replace operations!