(?),Greedy vs. Non-Greedy, Backslash () ,Square Brackets [] Metacharacters
The Question Mark (?) in Python Regex
The question mark ? in Python’s regular expressions has two main uses:
1. Making a Character or Group Optional (0 or 1 occurrence)
This is the most common use – it makes the preceding character or group optional.
Examples:
Example 1: Optional ‘s’ for plural words
python
import re pattern = r"colour?s" # 'u' is optional text = "color and colours" matches = re.findall(pattern, text) print(matches) # Output: ['color', 'colours']
Example 2: Optional country code in phone numbers
python
import re
pattern = r"(\+1-)?\d{3}-\d{3}-\d{4}" # +1- is optional
text = "123-456-7890 and +1-987-654-3210"
matches = re.findall(pattern, text)
print(matches) # Output: ['', '+1-']
Example 3: Optional file extension
python
import re pattern = r"file\.(txt)?$" # .txt is optional text = "file and file.txt" matches = re.findall(pattern, text) print(matches) # Output: ['', 'txt']
2. Making Quantifiers Non-Greedy (Lazy Matching)
When used after quantifiers like *, +, or {}, ? makes them non-greedy (match as little as possible).
Examples:
Example 4: Greedy vs Non-greedy matching
python
import re
text = "<div>Hello</div><div>World</div>"
# Greedy matching (default)
greedy_match = re.search(r"<div>.*</div>", text)
print("Greedy:", greedy_match.group()) # Matches entire string
# Non-greedy matching (with ?)
non_greedy = re.search(r"<div>.*?</div>", text)
print("Non-greedy:", non_greedy.group()) # Matches only first <div>
Example 5: Extracting content between quotes
python
import re
text = '"Hello" and "World"'
# Greedy - matches everything between first and last quote
greedy = re.findall(r'"(.*)"', text)
print("Greedy:", greedy) # Output: ['Hello" and "World']
# Non-greedy - matches each quoted section separately
non_greedy = re.findall(r'"(.*?)"', text)
print("Non-greedy:", non_greedy) # Output: ['Hello', 'World']
Example 6: Extracting HTML tags content
python
import re html = "<p>First</p><p>Second</p><p>Third</p>" # Non-greedy extraction matches = re.findall(r"<p>(.*?)</p>", html) print(matches) # Output: ['First', 'Second', 'Third']
Key Points:
?after a character makes it optional (0 or 1 occurrence)??,*?,+?,{m,n}?make quantifiers non-greedy- Non-greedy matching stops at the first possible match rather than the longest possible match
- Use parentheses
( )?to make groups of characters optional
The question mark is one of the most versatile metacharacters in regex, essential for creating flexible patterns and controlling matching behavior.
Greedy vs. Non-Greedy Metacharacters in Python Regex
Understanding the Difference
In regular expressions, greedy quantifiers try to match as much as possible, while non-greedy (or lazy) quantifiers try to match as little as possible.
Quantifiers That Can Be Greedy or Non-Greedy
*– 0 or more occurrences+– 1 or more occurrences?– 0 or 1 occurrence{m,n}– between m and n occurrences
To make them non-greedy, simply add a ? after them.
Examples
Example 1: Basic Text Extraction
python
import re
text = "Hello <div>Content</div> World <div>More content</div> End"
# Greedy matching - matches the LONGEST possible string
greedy_match = re.search(r'<div>.*</div>', text)
print("Greedy:", greedy_match.group())
# Output: <div>Content</div> World <div>More content</div>
# Non-greedy matching - matches the SHORTEST possible string
non_greedy_match = re.search(r'<div>.*?</div>', text)
print("Non-greedy:", non_greedy_match.group())
# Output: <div>Content</div>
Example 2: Extracting Multiple Matches
python
import re
text = "Item: Apple, Item: Banana, Item: Cherry"
# Greedy - finds one long match
greedy_matches = re.findall(r'Item: .*,', text)
print("Greedy matches:", greedy_matches)
# Output: ['Item: Apple, Item: Banana, Item: Cherry,']
# Non-greedy - finds each item separately
non_greedy_matches = re.findall(r'Item: .*?,', text)
print("Non-greedy matches:", non_greedy_matches)
# Output: ['Item: Apple,', 'Item: Banana,', 'Item: Cherry,']
Example 3: HTML Tag Extraction
python
import re
html = "<p>First paragraph</p><p>Second paragraph</p><p>Third paragraph</p>"
# Greedy - matches everything between first <p> and last </p>
greedy = re.findall(r'<p>.*</p>', html)
print("Greedy:", greedy)
# Output: ['<p>First paragraph</p><p>Second paragraph</p><p>Third paragraph</p>']
# Non-greedy - matches each paragraph individually
non_greedy = re.findall(r'<p>.*?</p>', html)
print("Non-greedy:", non_greedy)
# Output: ['<p>First paragraph</p>', '<p>Second paragraph</p>', '<p>Third paragraph</p>']
Example 4: Email Extraction from Text
python
import re
text = "Emails: john@example.com, jane@test.org, and bob@mail.net are all valid."
# Greedy - matches one long string
greedy_emails = re.findall(r'\w+@\w+\.\w+.*', text)
print("Greedy emails:", greedy_emails)
# Output: ['john@example.com, jane@test.org, and bob@mail.net']
# Non-greedy - matches each email separately
non_greedy_emails = re.findall(r'\w+@\w+\.\w+', text)
print("Non-greedy emails:", non_greedy_emails)
# Output: ['john@example.com', 'jane@test.org', 'bob@mail.net']
When to Use Each Approach
- Use greedy matching when you want to capture the largest possible match
- Use non-greedy matching when you want to capture the smallest possible matches
Practical Tip
In most cases, you’ll want to use non-greedy matching (.*?) when extracting multiple items from text, as it gives you more precise control over what gets matched.
The Backslash (\) in Python Regex
The backslash \ has two main purposes in regular expressions:
1. Escaping Special Characters
Turns special regex characters into literal characters.
2. Creating Special Sequences
Creates special matching patterns like \d, \w, etc.
Example 1: Escaping Special Characters
python
import re
text = "The price is $100.50 (including tax)"
pattern = r"\$100\.50" # Escape $ and .
match = re.search(pattern, text)
print("Match:", match.group()) # Output: $100.50
Explanation: Without \, $ and . would have special meanings in regex.
Example 2: Matching Parentheses
python
import re
text = "Call me at (555) 123-4567"
pattern = r"\(\d{3}\)" # Escape parentheses
match = re.search(pattern, text)
print("Match:", match.group()) # Output: (555)
Explanation: \( and \) match literal parentheses instead of creating a group.
Example 3: Matching a Literal Backslash
python
import re
text = "The path is C:\\Windows\\System32"
pattern = r"\\" # Match a literal backslash
matches = re.findall(pattern, text)
print("Backslashes found:", matches) # Output: ['\\', '\\']
print("Count:", len(matches)) # Output: 2
Explanation: \\ matches a single literal backslash character.
Example 4: Using Special Sequences
python
import re
text = "Room 25A has 3 windows and 2 doors"
pattern = r"\d+" # \d matches any digit
matches = re.findall(pattern, text)
print("Numbers found:", matches) # Output: ['25', '3', '2']
Explanation: \d is a special sequence that matches any digit (0-9).
Example 5: Matching Word Characters
python
import re
text = "User_id: john_doe123, Email: test@example.com"
pattern = r"\w+" # \w matches word characters (a-z, A-Z, 0-9, _)
matches = re.findall(pattern, text)
print("Word characters:", matches)
# Output: ['User_id', 'john_doe123', 'Email', 'test', 'example', 'com']
Explanation: \w matches alphanumeric characters and underscores.
Common Special Sequences with Backslash:
| Sequence | Meaning | Example |
|---|---|---|
\d | Any digit (0-9) | \d+ matches “123” |
\D | Any NON-digit | \D+ matches “abc” |
\w | Word character (a-z, A-Z, 0-9, _) | \w+ matches “hello_123” |
\W | NON-word character | \W+ matches “!@#” |
\s | Whitespace (space, tab, newline) | \s+ matches ” “ |
\S | NON-whitespace | \S+ matches “hello” |
\b | Word boundary | \bword\b matches “word” but not “password” |
Key Points:
- Use
\to escape special characters:\.,\$,\?, etc. - Use
\\to match a literal backslash - Special sequences like
\d,\wprovide shortcuts for common patterns - The backslash changes the meaning of the character that follows it
Metacharacters – The square brackets ( [] ) with very basic 10 examples
Square Brackets [] in Python Regex
Square brackets [] are used to create character classes – they match any ONE character from the specified set.
Basic Examples
Example 1: Match any vowel
python
import re
text = "The quick brown fox jumps"
pattern = r"[aeiou]" # Match any vowel
matches = re.findall(pattern, text)
print("Vowels:", matches) # Output: ['e', 'u', 'i', 'o', 'o', 'u']
Example 2: Match any digit
python
import re
text = "Room 25B, Floor 3, Building 42"
pattern = r"[0123456789]" # Match any digit
matches = re.findall(pattern, text)
print("Digits:", matches) # Output: ['2', '5', '3', '4', '2']
Example 3: Match uppercase letters
python
import re
text = "Hello World from Python 3.9"
pattern = r"[ABCDEFGHIJKLMNOPQRSTUVWXYZ]" # Match any uppercase letter
matches = re.findall(pattern, text)
print("Uppercase:", matches) # Output: ['H', 'W', 'P']
Using Ranges
Example 4: Digit range (0-9)
python
import re
text = "Prices: $10, $25, $100"
pattern = r"[0-9]" # Match any digit from 0 to 9
matches = re.findall(pattern, text)
print("All digits:", matches) # Output: ['1', '0', '2', '5', '1', '0', '0']
Example 5: Letter range (a-z)
python
import re
text = "Hello World 123"
pattern = r"[a-z]" # Match any lowercase letter
matches = re.findall(pattern, text)
print("Lowercase letters:", matches) # Output: ['e', 'l', 'l', 'o', 'o', 'r', 'l', 'd']
Example 6: Multiple ranges
python
import re
text = "UserID: JohnDoe25 (Active)"
pattern = r"[A-Za-z0-9]" # Match any alphanumeric character
matches = re.findall(pattern, text)
print("Alphanumeric:", matches)
# Output: ['U', 's', 'e', 'r', 'I', 'D', 'J', 'o', 'h', 'n', 'D', 'o', 'e', '2', '5', 'A', 'c', 't', 'i', 'v', 'e']
Special Cases
Example 7: Match specific symbols
python
import re
text = "Hello! How are you? I'm fine, thanks."
pattern = r"[!?,.]" # Match any of these punctuation marks
matches = re.findall(pattern, text)
print("Punctuation:", matches) # Output: ['!', '?', ',', '.']
Example 8: Excluding characters (using ^)
python
import re
text = "Hello123 World!"
pattern = r"[^0-9]" # Match anything EXCEPT digits
matches = re.findall(pattern, text)
print("Non-digits:", "".join(matches)) # Output: "Hello World!"
Example 9: Match hexadecimal characters
python
import re
text = "Hex: A1B2C3, FF00FF, 123ABC"
pattern = r"[0-9A-Fa-f]" # Match hexadecimal characters
matches = re.findall(pattern, text)
print("Hex chars:", matches)
# Output: ['A', '1', 'B', '2', 'C', '3', 'F', 'F', '0', '0', 'F', 'F', '1', '2', '3', 'A', 'B', 'C']
Example 10: Complex character class
python
import re
text = "Email: user@example.com, Phone: (555) 123-4567"
pattern = r"[a-zA-Z0-9@._()-]" # Match email/phone related characters
matches = re.findall(pattern, text)
print("Email/phone chars:", "".join(matches))
# Output: "Emailuser@example.comPhone(555)123-4567"
Key Points:
- Single character:
[abc]matches one character that is either ‘a’, ‘b’, or ‘c’ - Ranges: Use hyphen for ranges:
[a-z],[0-9],[A-Z] - Multiple ranges: Combine ranges:
[a-zA-Z0-9] - Negation: Use
^at start to exclude:[^0-9]= not a digit - Special characters: Inside brackets, most special characters lose their special meaning
- Escape still needed: For literal
-,^,], or\, you still need to escape them:[\-\^\\\]]
python
# Match hyphen literally text = "A-B-C 123" matches = re.findall(r"[\-A-C]", text) # Match hyphen or A-C print(matches) # Output: ['A', '-', 'B', '-', 'C']
Square Brackets [] Examples
python
import re
string = "The Euro STOXX 600 index, which tracks all stock markets across Europe including the FTSE, fell by 11.48% – the worst day since it launched in 1998. The panic selling prompted by the coronavirus has wiped £2.7tn off the value of STOXX 600 shares since its all-time peak on 19 February."
# Example 1: Find specific letters [wxkq]
result = re.findall(r"[wxkq]", string)
print("1. Letters w, x, k, q:", result)
# Output: ['x', 'w', 'k', 'k', 'k', 'w', 'w', 'k']
# Matches all occurrences of w, x, k, q in the text
# Example 2: Find letters between a-d [a-d]
result = re.findall(r"[a-d]", string)
print("2. Letters a-d:", result)
# Output: ['d', 'c', 'a', 'c', 'a', 'c', 'a', 'a', 'c', 'c', 'd', 'b', 'd', 'a', 'c', 'a', 'c', 'd', 'a', 'c', 'd', 'b', 'c', 'a', 'a', 'd', 'a', 'a', 'c', 'a', 'a', 'b', 'a']
# Matches all a, b, c, d letters in the text
# Example 3: Find uppercase letters between S-W [S-W]
result = re.findall(r"[S-W]", string)
print("3. Uppercase S-W:", result)
# Output: ['T', 'S', 'T', 'T', 'S', 'T', 'S', 'T']
# Matches uppercase letters from S to W (S, T, U, V, W)
# Example 4: Find digits between 0-5 [0-5]
result = re.findall(r"[0-5]", string)
print("4. Digits 0-5:", result)
# Output: ['0', '0', '1', '1', '4', '1', '2', '0', '0', '1']
# Matches digits 0, 1, 2, 3, 4, 5 from numbers like 600, 11.48, 1998, etc.
# Example 5: Find letter pairs where first is a-f, second is c-w [a-f][c-w]
result = re.findall(r"[a-f][c-w]", string)
print("5. Letter pairs a-f followed by c-w:", result)
# Output: ['de', 'ch', 'ac', 'al', 'ck', 'ar', 'et', 'ac', 'cl', 'di', 'fe', 'ce', 'au', 'ch', 'ed', 'an', 'el', 'ed', 'co', 'av', 'as', 'ed', 'ff', 'al', 'ar', 'es', 'ce', 'al', 'ak', 'br', 'ar']
# Matches pairs like "de" in "index", "ch" in "which", etc.
# Example 6: Find digit pairs where first is 0-5, second is 7-9 [0-5][7-9]
result = re.findall(r"[0-5][7-9]", string)
print("6. Digit pairs 0-5 followed by 7-9:", result)
# Output: ['48', '19', '19']
# Matches "48" from 11.48%, "19" from 1998, and "19" from 19 February
# Example 7: Find digit followed by lowercase letter [0-9][a-z]
result = re.findall(r"[0-9][a-z]", string)
print("7. Digit followed by lowercase letter:", result)
# Output: ['7t']
# Matches "7t" from £2.7tn (digit 7 followed by letter t)
# Example 8: Find everything EXCEPT the letter X [^X]
result = re.findall(r"[^X]", string)
print("8. Everything except 'X':", "".join(result)[:100] + "...")
# Returns all characters except the letter X (very long output)
# Example 9: Find literal parentheses and dots [(.+?)]
result = re.findall(r"[(.+?)]", string)
print("9. Parentheses and dots:", result)
# Output: ['.', '.', '.', '.']
# Matches literal dot characters (escaped with \ but shown as .)
# Example 10: Find everything EXCEPT digits 0-5 and closing bracket [^[0-5\]]
result = re.findall(r"[^[0-5\]]", string)
print("10. Everything except 0-5 digits and ]:", "".join(result)[:100] + "...")
# Returns all characters except digits 0-5 and closing bracket ]
Key Insights from These Examples:
- Single character matching:
[abc]matches any one character from the set - Ranges:
[a-d]matches a, b, c, or d - Multiple characters:
[a-f][c-w]matches two-character sequences - Negation:
[^X]matches everything EXCEPT X - Special characters: Inside brackets, most special characters lose their meaning
- Escape needed: For literal
],-, or^, you need to escape them with\
These examples show how square brackets allow flexible pattern matching for specific character sets or ranges!