re Programs
# find titles
import re
def extract_book_titles(text):
"""
Extracts book titles from the given text using a regular expression.
Args:
text: A string containing the bookshelf data.
Returns:
A list of book titles.
"""
# The regular expression to find book titles.
# It looks for a semicolon, then captures everything up to the next semicolon.
regex = r';\s*(.*?);'
# Find all non-overlapping matches in the text.
titles = re.findall(regex, text)
return titles
# The full content of the 'bookshelf.txt' file provided by the user.
file_content = """
Terry-Thomas;Filling the Gap;1959
Harpo Marx;Harpo Speaks;1961
Charlie Chaplin;My Autobiography;1964
Moe Howard;Moe Howard and the Three Stooges, AKA I Came, I Stooged, I Conquered (released posthumously);1974
Sid Caesar;Where Have I Been?;1982
Bill Cosby;Fatherhood;1986
Mel Blanc;That's NOT All, Folks;1988
Gilda Radner;It's Always Something;1989
Richard Pryor;Pryor Convictions;1995
Damon Wayans;Bootleg;1996
Stephen Fry;Moab Is My Washpot;1997
Jenny McCarthy;Jen-X: My Open Book;1997
Chris Rock;Rock This;1997
Sandra Bernhard;Confessions of a Pretty Lady;1998
Danny Bonaduce;Random Acts of Badness;2001
Fran Drescher;Cancer Schmancer;2002
Alan Thicke;How Men Have Babies: a New Father's Survival Guide;2003
Rodney Dangerfield;It's Not Easy Being Me: a Lifetime of No Respect But Plenty of Sex and Drugs;2004
Tom Green;Hollywood Causes Cancer;2004
Rik Mayall;Bigger Than Hitler & Better Than Christ;2005
Tommy Chong;The I Chong: Meditations From the Joint;2006
Alan Thicke;How to Raise Kids Who Won't Hate You;2006
Steve Martin;Born Standing Up;2007
Denis Leary;Why We Suck;2008
Stephen Fry;Ernie: The Autobiography;2009
Frankie Boyle;My Shit Life So Far;2009
Craig Ferguson;American on Purpose;2009
Todd Bridges;Killing Willis;2010
Kevin Smith;Tough Sh*t: Life Advice from a Fat, Lazy Slob Who Still Made Good;2012
Jimmie Walker;Dyn-o-mite!;2012
Andrew Dice Clay;The Filthy Truth;2014
John Cleese;So, Anyway...;2014
Cheech Marin;Cheech Is Not My Real Name...But Don't Call Me Chong!;2017
Eric Idle;Always Look On the Bright Side Of Life;2018
"""
# Call the function and print the results.
book_titles = extract_book_titles(file_content)
for title in book_titles:
print(title)
The regular expression r';\s*(.*?);' is used to find and extract text that is located between two semicolons.
r'': Therprefix in front of the string denotes a “raw string” in Python. This tells the interpreter to treat backslashes as literal characters instead of escape sequences. This is a common practice when writing regular expressions to avoid issues with characters like\nor\t.;: This matches a literal semicolon. The pattern begins by looking for a semicolon character.\s*: This part matches any whitespace character (such as spaces, tabs, or newlines) that appears zero or more times. It accounts for potential spaces between the semicolon and the text you want to capture.(.*?): This is the core part of the expression.(): The parentheses create a capturing group. This means that whatever text is matched within these parentheses will be “captured” or “extracted” as a separate result..*: The dot.matches any character except a newline. The asterisk*means it will match the preceding character (in this case, any character) zero or more times.?: The question mark?makes the*non-greedy (or lazy). Instead of matching as many characters as possible until the end of the line, it matches as few characters as possible until it finds the next part of the pattern, which is the final semicolon.
;: This matches the literal closing semicolon, marking the end of the text to be captured.
In summary, this expression finds a semicolon, then non-greedily captures all characters up to the next semicolon. This is an effective way to extract the middle value from a semicolon-separated string.
Title 1 to 25 chars
f = open(r"E:\bookshelf.txt")
string = f.read()
import re
result = re.findall(r".+?;(.{1,25});.+?", string)
print(result)
['Filling the Gap', 'Harpo Speaks', 'My Autobiography', 'Where Have I Been?', 'Fatherhood', "That's NOT All, Folks", "It's Always Something", 'Pryor Convictions', 'Bootleg', 'Moab Is My Washpot', 'Jen-X: My Open Book', 'Rock This', 'Random Acts of Badness', 'Cancer Schmancer', 'Hollywood Causes Cancer', 'Born Standing Up', 'Why We Suck', 'Ernie: The Autobiography', 'My Shit Life So Far', 'American on Purpose', 'Killing Willis', 'Dyn-o-mite!', 'The Filthy Truth', 'So, Anyway...']
The regular expression r".+?;(.{1,25});.+?" is designed to extract a specific piece of text between two semicolons.
r: This indicates a raw string in Python, which prevents backslashes from being interpreted as escape sequences..+?: This matches one or more (+) of any character (.) in a non-greedy (?) way. It will match the first part of the string up to the first semicolon.;: This matches the literal semicolon that separates the first part of the string from the text you want to capture.(and): These create a capturing group, which saves the text that matches the pattern inside..{1,25}: This is the core of the extraction. It matches any character (.) that is repeated at least once but no more than 25 times. This sets a length constraint on the captured text.;: This matches the literal semicolon that follows the captured text..+?: This again matches one or more (+) of any character (.) in a non-greedy (?) way, continuing the match to the end of the line.
In the context of the provided text file, this regular expression would attempt to capture book titles that are between 1 and 25 characters long. For example, it would match “Bootleg” from
Damon Wayans;Bootleg;1996 but would fail to match “Moe Howard and the Three Stooges, AKA I Came, I Stooged, I Conquered (released posthumously)” because that title is longer than 25 characters.
Find authors relesed in 2000 to 2009
>>> result = re.findall(r"(.+?);.+?;20[0-9][0-9]", string)
>>> result
['Danny Bonaduce', 'Fran Drescher', 'Alan Thicke', 'Rodney Dangerfield', 'Tom Green', 'Rik Mayall', 'Tommy Chong', 'Alan Thicke', 'Steve Martin', 'Denis Leary', 'Stephen Fry', 'Frankie Boyle', 'Craig Ferguson', 'Todd Bridges', 'Kevin Smith', 'Jimmie Walker', 'Andrew Dice Clay', 'John Cleese', 'Cheech Marin', 'Eric Idle']
The regular expression r"(.+?);.+?;20[0-9][0-9]" is designed to extract a specific piece of information from the provided text, which follows the pattern Author;Title;Year.
r: This indicates a raw string in Python, which prevents backslashes from being treated as escape characters.(and): These parentheses create a capturing group. The text matched by the pattern inside will be saved as a result..+?: This part matches one or more (+) of any character (.) in a non-greedy (?) way. It will match the author’s name at the beginning of the line up to the first semicolon.;: This matches the first literal semicolon..+?: This again matches one or more of any character in a non-greedy way, capturing the book title until the next semicolon.;: This matches the second literal semicolon.20[0-9][0-9]: This matches the publication year. It specifically looks for a number that starts with “20” followed by any two digits from 0 to 9.
In the context of the provided text, this regular expression would capture the author’s name for any book published in the 21st century (from the year 2000 to 2099). It would match authors like Danny Bonaduce and Fran Drescher , but would not match authors like Terry-Thomas or Harpo Marx, whose books were published in the 1900s.