positive lookahead assertion

A positive lookahead assertion in Python’s re module is a zero-width assertion that checks if the pattern that follows it is present, without including that pattern in the overall match. It is written as (?=...).

The key is that it’s a “lookahead”โ€”the regex engine looks ahead in the string to see if the pattern inside the parentheses is there. If it is, the match succeeds, but the lookahead part itself is not consumed or returned as part of the match.

Example: Finding Words Followed by a Specific Word

Let’s say you want to find all numbers in a string that are followed by the word “dollars”, but you only want to match the numbers, not the word “dollars” itself.

Python

import re

text = "The price is 500 dollars, but the tax is 50 dollars."

# This pattern looks for one or more digits (\d+), followed by a positive lookahead
# that checks for a space and the word "dollars".
pattern = r'\d+(?= dollars)'

matches = re.findall(pattern, text)

print(matches)
  • Output:
    • ['500', '50']

How it Works: Step-by-Step

  1. \d+: This part of the pattern matches one or more digits (500, 50).
  2. (?= dollars): This is the positive lookahead assertion.
    • The regex engine, after matching \d+, “looks ahead” to see if the next characters are a space followed by the word “dollars”.
    • Since it finds ” dollars” after 500 and 50, the assertion is True.
    • Crucially, the (?= dollars) part itself is not included in the final match. It just verifies a condition.

Without the positive lookahead, using a simple \d+ dollars pattern would match 500 dollars and 50 dollars, which is not what we wanted.

Why use it?

Positive lookahead assertions are useful for:

  • Constraining a match: Ensuring a pattern is present under specific conditions without including that condition in the match result.
  • Overlapping matches: They can be used to find overlapping matches that might not be possible with standard regex patterns.

Let’s say you want to find all street names in an address string that are followed by “Street” or “St.”, but you only want to match the name of the street itself.

Python

import re

address = "123 Main Street, 45 Elm St., and 67 Oak Avenue."

# This pattern finds any word (\w+) that is immediately followed by a space
# and either "Street" or "St."
pattern = r'\w+(?= Street| St\.)'

matches = re.findall(pattern, address)

print(matches)
  • Output:
    • ['Main', 'Elm']

How it Works

  • \w+: This part matches one or more word characters (letters, numbers, or underscore).
  • (?= Street| St\.): This is the positive lookahead assertion.
    • It looks ahead to see if the matched word is followed by a space and then either the literal string “Street” or “St.”.
    • The | acts as an “OR” condition, checking for either option.
    • The \. is an escaped period, as a period is a special character in regex that matches any character.
    • The assertion is successful for “Main” (which is followed by ” Street”) and “Elm” (followed by ” St.”).
  • The street names “Main” and “Elm” are returned in the list matches, but the “Street” or ” St.” parts are not included, because they were only part of the lookahead check.

1. Finding words that aren’t verbs

Imagine you want to find all occurrences of the word “play” that are not followed by “ing” or “ed” (i.e., you want to find the root word “play” when it’s not a verb in a continuous or past tense).

Python

import re

text = "Let's go play a game. We played cards. He is playing football."

# This pattern matches the word "play" only if it's not followed by "ing" or "ed"
pattern = r'play(?!ing|ed)'

matches = re.findall(pattern, text)

print(matches)
  • Output:
    • ['play']

How it Works

  • The pattern first matches the literal word play.
  • The negative lookahead (?!ing|ed) then checks what immediately follows.
  • It finds “play” in “Let’s go play a game.” and checks what follows. Since ” a” is not “ing” or “ed”, the match is successful.
  • It finds “play” in “We played cards.” and checks what follows. The ed is present, so the negative lookahead fails, and “play” is not matched.
  • It finds “play” in “He is playing football.” and checks what follows. The ing is present, so the negative lookahead fails again.

2. Matching filenames without a specific extension

Let’s say you have a list of filenames and you want to find all .txt files that are not followed by a .bak extension.

Python

import re

filenames = "notes.txt, draft.txt.bak, final_report.txt, backup.txt.bak"

# This pattern looks for a .txt extension that is NOT followed by .bak
# The \. is an escaped period.
pattern = r'\.txt(?!.bak)'

matches = re.findall(pattern, filenames)

print(matches)
  • Output:
    • ['.txt', '.txt']

How it Works

  • The pattern looks for the literal string .txt.
  • The negative lookahead (?!.bak) checks if .bak follows.
  • The first . in .txt is found in notes.txt. The negative lookahead checks for .bak and doesn’t find it, so .txt is matched.
  • The first . in draft.txt.bak is found. The negative lookahead finds .bak and fails the match.
  • The . in final_report.txt is found, and since .bak isn’t after it, .txt is matched.
  • The . in backup.txt.bak is found, but the lookahead fails the match.

Similar Posts

  • Combined Character Classes

    Combined Character Classes Explained with Examples 1. [a-zA-Z0-9_] – Word characters (same as \w) Description: Matches any letter (lowercase or uppercase), any digit, or underscore Example 1: Extract all word characters from text python import re text = “User_name123! Email: test@example.com” result = re.findall(r'[a-zA-Z0-9_]’, text) print(result) # [‘U’, ‘s’, ‘e’, ‘r’, ‘_’, ‘n’, ‘a’, ‘m’, ‘e’, ‘1’, ‘2’,…

  • Various types of data types in python

    ๐Ÿ”ข Numeric Types Used for storing numbers: ๐Ÿ”ก Text Type Used for textual data: ๐Ÿ“ฆ Sequence Types Ordered collections: ๐Ÿ”‘ Mapping Type Used for key-value pairs: ๐Ÿงฎ Set Types Collections of unique elements: โœ… Boolean Type ๐Ÿช™ Binary Types Used to handle binary data: Want to explore examples of each, or dive deeper into when…

  • re.I, re.S, re.X

    Python re Flags: re.I, re.S, re.X Explained Flags modify how regular expressions work. They’re used as optional parameters in re functions like re.search(), re.findall(), etc. 1. re.I or re.IGNORECASE Purpose: Makes the pattern matching case-insensitive Without re.I (Case-sensitive): python import re text = “Hello WORLD hello World” # Case-sensitive search matches = re.findall(r’hello’, text) print(“Case-sensitive:”, matches) # Output: [‘hello’] # Only finds lowercase…

  • group() and groups()

    Python re group() and groups() Methods Explained The group() and groups() methods are used with match objects to extract captured groups from regex patterns. They work on the result of re.search(), re.match(), or re.finditer(). group() Method groups() Method Example 1: Basic Group Extraction python import retext = “John Doe, age 30, email: john.doe@email.com”# Pattern with multiple capture groupspattern = r'(\w+)\s+(\w+),\s+age\s+(\d+),\s+email:\s+([\w.]+@[\w.]+)’///The Pattern: r'(\w+)\s+(\w+),\s+age\s+(\d+),\s+email:\s+([\w.]+@[\w.]+)’Breakdown by Capture…

  • Dynamically Typed vs. Statically Typed Languages ๐Ÿ”„โ†”๏ธ

    Dynamically Typed vs. Statically Typed Languages ๐Ÿ”„โ†”๏ธ Dynamically Typed Languages ๐Ÿš€ Python Pros: Cons: Statically Typed Languages ๐Ÿ”’ Java Pros: Cons: Key Differences ๐Ÿ†š Feature Dynamically Typed Statically Typed Type Checking Runtime Compile-time Variable Types Can change during execution Fixed after declaration Error Detection Runtime exceptions Compile-time failures Speed Slower (runtime checks) Faster (optimized binaries)…

  • What is general-purpose programming language

    A general-purpose programming language is a language designed to be used for a wide variety of tasks and applications, rather than being specialized for a particular domain. They are versatile tools that can be used to build anything from web applications and mobile apps to desktop software, games, and even operating systems. Here’s a breakdown…

Leave a Reply

Your email address will not be published. Required fields are marked *