positive lookahead assertion

A positive lookahead assertion in Python’s re module is a zero-width assertion that checks if the pattern that follows it is present, without including that pattern in the overall match. It is written as (?=...).

The key is that it’s a “lookahead”—the regex engine looks ahead in the string to see if the pattern inside the parentheses is there. If it is, the match succeeds, but the lookahead part itself is not consumed or returned as part of the match.

Example: Finding Words Followed by a Specific Word

Let’s say you want to find all numbers in a string that are followed by the word “dollars”, but you only want to match the numbers, not the word “dollars” itself.

Python

import re

text = "The price is 500 dollars, but the tax is 50 dollars."

# This pattern looks for one or more digits (\d+), followed by a positive lookahead
# that checks for a space and the word "dollars".
pattern = r'\d+(?= dollars)'

matches = re.findall(pattern, text)

print(matches)

Output:
- ['500', '50']

How it Works: Step-by-Step

\d+: This part of the pattern matches one or more digits (500, 50).
(?= dollars): This is the positive lookahead assertion.
- The regex engine, after matching \d+, “looks ahead” to see if the next characters are a space followed by the word “dollars”.
- Since it finds ” dollars” after 500 and 50, the assertion is True.
- Crucially, the (?= dollars) part itself is not included in the final match. It just verifies a condition.

Without the positive lookahead, using a simple \d+ dollars pattern would match 500 dollars and 50 dollars, which is not what we wanted.

Why use it?

Positive lookahead assertions are useful for:

Constraining a match: Ensuring a pattern is present under specific conditions without including that condition in the match result.
Overlapping matches: They can be used to find overlapping matches that might not be possible with standard regex patterns.

Let’s say you want to find all street names in an address string that are followed by “Street” or “St.”, but you only want to match the name of the street itself.

Python

import re

address = "123 Main Street, 45 Elm St., and 67 Oak Avenue."

# This pattern finds any word (\w+) that is immediately followed by a space
# and either "Street" or "St."
pattern = r'\w+(?= Street| St\.)'

matches = re.findall(pattern, address)

print(matches)

Output:
- ['Main', 'Elm']

How it Works

\w+: This part matches one or more word characters (letters, numbers, or underscore).
(?= Street| St\.): This is the positive lookahead assertion.
- It looks ahead to see if the matched word is followed by a space and then either the literal string “Street” or “St.”.
- The | acts as an “OR” condition, checking for either option.
- The \. is an escaped period, as a period is a special character in regex that matches any character.
- The assertion is successful for “Main” (which is followed by ” Street”) and “Elm” (followed by ” St.”).
The street names “Main” and “Elm” are returned in the list matches, but the “Street” or ” St.” parts are not included, because they were only part of the lookahead check.

1. Finding words that aren’t verbs

Imagine you want to find all occurrences of the word “play” that are not followed by “ing” or “ed” (i.e., you want to find the root word “play” when it’s not a verb in a continuous or past tense).

Python

import re

text = "Let's go play a game. We played cards. He is playing football."

# This pattern matches the word "play" only if it's not followed by "ing" or "ed"
pattern = r'play(?!ing|ed)'

matches = re.findall(pattern, text)

print(matches)

Output:
- ['play']

How it Works

The pattern first matches the literal word play.
The negative lookahead (?!ing|ed) then checks what immediately follows.
It finds “play” in “Let’s go play a game.” and checks what follows. Since ” a” is not “ing” or “ed”, the match is successful.
It finds “play” in “We played cards.” and checks what follows. The ed is present, so the negative lookahead fails, and “play” is not matched.
It finds “play” in “He is playing football.” and checks what follows. The ing is present, so the negative lookahead fails again.

2. Matching filenames without a specific extension

Let’s say you have a list of filenames and you want to find all .txt files that are not followed by a .bak extension.

Python

import re

filenames = "notes.txt, draft.txt.bak, final_report.txt, backup.txt.bak"

# This pattern looks for a .txt extension that is NOT followed by .bak
# The \. is an escaped period.
pattern = r'\.txt(?!.bak)'

matches = re.findall(pattern, filenames)

print(matches)

Output:
- ['.txt', '.txt']

How it Works

The pattern looks for the literal string .txt.
The negative lookahead (?!.bak) checks if .bak follows.
The first . in .txt is found in notes.txt. The negative lookahead checks for .bak and doesn’t find it, so .txt is matched.
The first . in draft.txt.bak is found. The negative lookahead finds .bak and fails the match.
The . in final_report.txt is found, and since .bak isn’t after it, .txt is matched.
The . in backup.txt.bak is found, but the lookahead fails the match.

Example: Finding Words Followed by a Specific Word

How it Works: Step-by-Step

How it Works

1. Finding words that aren’t verbs

How it Works

2. Matching filenames without a specific extension

How it Works

Combined Character Classes

Various types of data types in python

re.I, re.S, re.X

group() and groups()

Dynamically Typed vs. Statically Typed Languages 🔄↔️

What is general-purpose programming language

Leave a Reply Cancel reply

Example: Finding Words Followed by a Specific Word

How it Works: Step-by-Step

How it Works

1. Finding words that aren’t verbs

How it Works

2. Matching filenames without a specific extension

How it Works

Similar Posts

Leave a Reply Cancel reply