positive lookahead assertion
A positive lookahead assertion in Python’s re module is a zero-width assertion that checks if the pattern that follows it is present, without including that pattern in the overall match. It is written as (?=...).
The key is that it’s a “lookahead”—the regex engine looks ahead in the string to see if the pattern inside the parentheses is there. If it is, the match succeeds, but the lookahead part itself is not consumed or returned as part of the match.
Example: Finding Words Followed by a Specific Word
Let’s say you want to find all numbers in a string that are followed by the word “dollars”, but you only want to match the numbers, not the word “dollars” itself.
Python
import re
text = "The price is 500 dollars, but the tax is 50 dollars."
# This pattern looks for one or more digits (\d+), followed by a positive lookahead
# that checks for a space and the word "dollars".
pattern = r'\d+(?= dollars)'
matches = re.findall(pattern, text)
print(matches)
- Output:
['500', '50']
How it Works: Step-by-Step
\d+: This part of the pattern matches one or more digits (500,50).(?= dollars): This is the positive lookahead assertion.- The regex engine, after matching
\d+, “looks ahead” to see if the next characters are a space followed by the word “dollars”. - Since it finds ” dollars” after
500and50, the assertion isTrue. - Crucially, the
(?= dollars)part itself is not included in the final match. It just verifies a condition.
- The regex engine, after matching
Without the positive lookahead, using a simple \d+ dollars pattern would match 500 dollars and 50 dollars, which is not what we wanted.
Why use it?
Positive lookahead assertions are useful for:
- Constraining a match: Ensuring a pattern is present under specific conditions without including that condition in the match result.
- Overlapping matches: They can be used to find overlapping matches that might not be possible with standard regex patterns.
Let’s say you want to find all street names in an address string that are followed by “Street” or “St.”, but you only want to match the name of the street itself.
Python
import re
address = "123 Main Street, 45 Elm St., and 67 Oak Avenue."
# This pattern finds any word (\w+) that is immediately followed by a space
# and either "Street" or "St."
pattern = r'\w+(?= Street| St\.)'
matches = re.findall(pattern, address)
print(matches)
- Output:
['Main', 'Elm']
How it Works
\w+: This part matches one or more word characters (letters, numbers, or underscore).(?= Street| St\.): This is the positive lookahead assertion.- It looks ahead to see if the matched word is followed by a space and then either the literal string “Street” or “St.”.
- The
|acts as an “OR” condition, checking for either option. - The
\.is an escaped period, as a period is a special character in regex that matches any character. - The assertion is successful for “Main” (which is followed by ” Street”) and “Elm” (followed by ” St.”).
- The street names “Main” and “Elm” are returned in the list
matches, but the “Street” or ” St.” parts are not included, because they were only part of the lookahead check.
1. Finding words that aren’t verbs
Imagine you want to find all occurrences of the word “play” that are not followed by “ing” or “ed” (i.e., you want to find the root word “play” when it’s not a verb in a continuous or past tense).
Python
import re
text = "Let's go play a game. We played cards. He is playing football."
# This pattern matches the word "play" only if it's not followed by "ing" or "ed"
pattern = r'play(?!ing|ed)'
matches = re.findall(pattern, text)
print(matches)
- Output:
['play']
How it Works
- The pattern first matches the literal word
play. - The negative lookahead
(?!ing|ed)then checks what immediately follows. - It finds “play” in “Let’s go play a game.” and checks what follows. Since ” a” is not “ing” or “ed”, the match is successful.
- It finds “play” in “We played cards.” and checks what follows. The
edis present, so the negative lookahead fails, and “play” is not matched. - It finds “play” in “He is playing football.” and checks what follows. The
ingis present, so the negative lookahead fails again.
2. Matching filenames without a specific extension
Let’s say you have a list of filenames and you want to find all .txt files that are not followed by a .bak extension.
Python
import re
filenames = "notes.txt, draft.txt.bak, final_report.txt, backup.txt.bak"
# This pattern looks for a .txt extension that is NOT followed by .bak
# The \. is an escaped period.
pattern = r'\.txt(?!.bak)'
matches = re.findall(pattern, filenames)
print(matches)
- Output:
['.txt', '.txt']
How it Works
- The pattern looks for the literal string
.txt. - The negative lookahead
(?!.bak)checks if.bakfollows. - The first
.in.txtis found innotes.txt. The negative lookahead checks for.bakand doesn’t find it, so.txtis matched. - The first
.indraft.txt.bakis found. The negative lookahead finds.bakand fails the match. - The
.infinal_report.txtis found, and since.bakisn’t after it,.txtis matched. - The
.inbackup.txt.bakis found, but the lookahead fails the match.