A Python module for text analysis and string matching
See https://docs.python.org/3.4/library/re.html#module-re for documentation
See http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ for a Regex cheat sheet
import re
re.findall('is','This is an arbitrary sentence.')
re.findall('..is.','This is an arbitrary sentence.')
Why didn't the second is show up? because the 2 string would overlap. (try adding 2 spacesbetween this and is)
Now lets grab strings that look like 'w___s' where the space is length 1, 2, or 3
re.findall('w.{1,3}s','This is an arbitrary sentence whichs is full of a lot more words.')
re.findall('w.{1,6}s','This is an arbitrary sentence which is full of a lot more words words.')
re.findall('.el+','She sells sea shells by the sea shore. She likes to sell them to elfs.')
Let's require the strings to contain 'el' as the 2nd and third letter and end in an 's'
re.findall('.el+s','She sells sea shells by the sea shore. She likes to sell them to elfs.')
Let's require the strings to start with 'e', contain at least 3 characters, and end in an 's'
re.findall('e.+s','She sells sea shells by the sea shore. She likes to sell them to elfs.')
Wait! That's almost everything! (1 big string)
Matching is "greedy" in that it returns the largest matching substring by default, and no other conflicting (overlapping) strings.
re.findall('e.+?s','She sells sea shells by the sea shore. She likes to sell them to elfs.')
For example, let's find the shortest strings that start with 'e' and end in either 'l' or 's'.
re.findall('e.+?[ls]','She sells sea shells by the sea shore. She likes to sell them to elfs.')
Now lets require the second letter to be 'e' and the second-to-last letter to be either 'l' or 's' by adding a '.' at the end
re.findall('.e[ls].','She sells sea shells by the sea shore. She likes to sell them to elfs.')
What does the following identify?
re.findall('.e[a-z].','She sells sea shells by the sea shore. She likes to sell them to elfs.')
emailmatcher = '[a-z]+@[a-z]+\.[a-z]+' # Since '.' means something else, '\.' is used to indicate a period.
#find strings that start with a string, then '@', then another string, then '\.' , then another string
re.findall(emailmatcher,'My addresss is danet@buffalo.edu')
What if there is a letter in the email?
emailmatcher = '[a-z]+@[a-z]+\.[a-z]+'
re.findall(emailmatcher,'My addresss is danet5@buffalo.edu')
Spend a few minutes trying to modify the above to allow capital letters, numbers, and '_' before the '@' symbol
emailmatcher =
re.findall(emailmatcher,'My addressses are danet5@buffalo.edu and danet5@Buffalo.com')
#add '-' too
s = 'clear color coooool'
myre = 'c[aeiou]*l'
re.findall(myre,s)
Let's find edge, but note ledge, ledger, or edges.
s = 'edge, ledge, ledger, and edges'
myre = '\\bedge\\b'
re.findall(myre,s)
Modify the code to identify:
(a) edge and edges, but not the others
(b) edge and ledger, but not the others
myre =
re.findall(myre,s)
Find websites on a webpage by looking for strings:
'http:// some string . html'
s = 'blah <a href=http://foo.com/blah.html'
myre = 'http://.+\.html'
re.findall(myre,s)
If you don't want the 'http://' part, then use the parentheses '()' for the part you want
s = 'blah <a href=http://foo.com/blah.html'
myre = 'http://(.+\.html)'
re.findall(myre,s)
For heavy-duty applications, faster if we pre-compile the regex:
myre = re.compile('http://(.+\.html)')
s = 'blah <a href=http://foo.com/blah.html'
re.findall(myre,s)
Grab the full date 'Tue, Feb 13, 2017' from the string below
s = 'Today is Tue, Feb 13, 2017, I think.'
myre =
re.findall(myre,s)
But this can pick up matching strings that make no sense:
s = 'Today is Nod, Boo 13, 2017, I think.'
re.findall(myre,s)
So lets restring the possible strings to specific names and months
myre = 'Mon|Tue'
s = 'Today is Tue, Feb 13, 2017, I think. Or is it Monday?'
re.findall(myre,s)
myre = '\\bMon\\b|\\bTue\\b'
s = 'Today is Tue, Feb 13, 2017, I think. Or is it Monday?'
re.findall(myre,s)
myre = '(?:Mon|Tue|Wed|Thu|Fri),' # non-capturing group
s = 'Today is Thu, Feb 13, 2017, I think. Or is it Monday?'
re.findall(myre,s)
Combine this with our previous code to identify the full date and time for this string
s = 'Today is Thr, Feb 13, 2017, I think. Or is it Monday? No, today is Nod, Boo 13, 2017.'