positive lookbehind assertion
A positive lookbehind assertion in Python’s re module is a zero-width assertion that checks if the pattern that precedes it is present, without including that pattern in the overall match. It’s the opposite of a lookahead. It is written as (?<=...).
The key constraint for lookbehind assertions in Python is that the pattern inside the parentheses must be of a fixed length or have a specific number of alternations with a fixed length. For example, (?<=abc) is valid, but (?<=a|b) is not because a and b have different lengths. However, (?<=a|b|c) is valid because all alternatives have a fixed length of one character.
Example: Finding Words After a Specific Word
Let’s say you want to find all numbers in a string that are preceded by the word “cost:”, but you only want to match the numbers, not the word “cost:”.
Python
import re
text = "The total cost: 50. The final price: 20."
# This pattern looks for one or more digits (\d+), preceded by a positive lookbehind
# that checks for the literal string "cost: ".
pattern = r'(?<=cost: )\d+'
matches = re.findall(pattern, text)
print(matches)
- Output:
['50']
How it Works: Step-by-Step
\d+: This part of the pattern looks for one or more digits (50,20).(?<=cost: ): This is the positive lookbehind assertion.- The regex engine, after matching
50, “looks behind” to see if the preceding characters are “cost: “. - Since it finds “cost: ” before
50, the assertion isTrue. - The lookbehind part itself is not included in the final match. It just verifies a condition.
- The regex engine, after matching
Without the positive lookbehind, a simple cost: \d+ pattern would match cost: 50, which is not what was intended.
Why use it?
Positive lookbehind assertions are useful for:
- Targeted Matching: Finding a specific pattern only if it’s in a certain context.
- Excluding Preceding Characters: Matching a string without including the characters that come before it.
. Extracting currency values
Let’s say you have a string with different currencies and you only want to extract the dollar amounts.
Python
import re
text = "The cost is $50 and €10. The total is $200 and £5."
# This pattern matches any number (\d+) that is preceded by a dollar sign ($)
# Note that we escape the dollar sign with a backslash since it's a special character in regex.
pattern = r'(?<=\$)\d+'
matches = re.findall(pattern, text)
print(matches)
- Output:
['50', '200']
How it Works
- The pattern
\d+looks for one or more digits. - The lookbehind
(?<=\$)checks to make sure the dollar sign$immediately precedes the digits. - The digits
50and200meet this condition and are returned in the list. The$symbol is not part of the match itself. The numbers10and5are ignored because they are not preceded by a dollar sign.
2. Getting names after a title
Imagine you have a list of people’s names with titles, and you only want to extract the names that are preceded by the title “Mr.”
Python
import re
names = "Mr. John Smith, Ms. Jane Doe, Mr. Peter Jones"
# This pattern looks for a word (\w+) that is preceded by "Mr. "
# We are specific here with the space after "Mr." to avoid matching other words.
pattern = r'(?<=Mr\. )\w+'
matches = re.findall(pattern, names)
print(matches)
- Output:
['John', 'Peter']
How it Works
- The pattern
\w+looks for one or more word characters (the names themselves). - The lookbehind
(?<=Mr\. )checks that the preceding text is the literal string “Mr. “. Note the backslash to escape the period.which is a special regex character. - The lookbehind finds “Mr. ” before “John” and “Peter”, but not before “Jane”, so only “John” and “Peter” are returned as matches.