Regular expressions¶

A Python module for text analysis and string matching

See https://docs.python.org/3.4/library/re.html#module-re for documentation

See http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ for a Regex cheat sheet

import re

re.findall('is','This is an arbitrary sentence.')

['is', 'is']

Note that 'is' shows up twice: once in 'this' and once in 'is'

Know the Special Characters!¶

Use the '.' to indicate any character (except newline)</p>¶

re.findall('..is.','This is an arbitrary sentence.')

['This ']

Why didn't the second is show up? because the 2 string would overlap. (try adding 2 spacesbetween this and is)

Now lets grab strings that look like 'w___s' where the space is length 1, 2, or 3

re.findall('w.{1,3}s','This is an arbitrary sentence whichs is full of a lot more words.')

['words']

re.findall('w.{1,6}s','This is an arbitrary sentence which is full of a lot more words words.')

['which is', 'words', 'words']

Use '+' for repetitions. Let's find strings that include 'el'¶

re.findall('.el+','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['sell', 'hell', 'sell', ' el']

Let's require the strings to contain 'el' as the 2nd and third letter and end in an 's'

re.findall('.el+s','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['sells', 'hells']

Let's require the strings to start with 'e', contain at least 3 characters, and end in an 's'

re.findall('e.+s','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['e sells sea shells by the sea shore. She likes to sell them to elfs']

Wait! That's almost everything! (1 big string)

Matching is "greedy" in that it returns the largest matching substring by default, and no other conflicting (overlapping) strings.

The symbol '?' changes it to "lazy" matching, in that it returns the shortest matching substring.¶

re.findall('e.+?s','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['e s',
 'ells',
 'ea s',
 'ells',
 'e s',
 'ea s',
 'e. She likes',
 'ell them to elfs']

Use the square brackets '[]' to enclose multiple options.¶

For example, let's find the shortest strings that start with 'e' and end in either 'l' or 's'.

re.findall('e.+?[ls]','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['e s',
 'ell',
 'ea s',
 'ell',
 'e s',
 'ea s',
 'e. She l',
 'es to s',
 'ell',
 'em to el']

Now lets require the second letter to be 'e' and the second-to-last letter to be either 'l' or 's' by adding a '.' at the end

re.findall('.e[ls].','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['sell', 'hell', 'kes ', 'sell', ' elf']

What does the following identify?

re.findall('.e[a-z].','She sells sea shells by the sea shore. She likes to sell them to elfs.')

['sell', 'sea ', 'hell', 'sea ', 'kes ', 'sell', 'hem ', ' elf']

Lets use our knowledge to look for email addresses!¶

emailmatcher = '[a-z]+@[a-z]+\.[a-z]+' # Since '.' means something else, '\.' is used to indicate a period.

#find strings that start with a string, then '@', then another string, then '\.' , then another string 

re.findall(emailmatcher,'My addresss is danet@buffalo.edu')

['danet@buffalo.edu']

What if there is a letter in the email?

emailmatcher = '[a-z]+@[a-z]+\.[a-z]+'
re.findall(emailmatcher,'My addresss is danet5@buffalo.edu')

[]

Spend a few minutes trying to modify the above to allow capital letters, numbers, and '_' before the '@' symbol

emailmatcher = 
re.findall(emailmatcher,'My addressses are danet5@buffalo.edu and danet5@Buffalo.com') 

#add '-' too

['danet5@buffalo.edu']

The asterisk '*' after a character allows that character to repeated 0 or more times¶

s = 'clear color coooool'
myre = 'c[aeiou]*l'
re.findall(myre,s)

['cl', 'col', 'coooool']

'\b' indicates a whitespace, and allows you to identify the ends of words¶

Let's find edge, but note ledge, ledger, or edges.

s = 'edge, ledge, ledger, and edges'
myre = '\\bedge\\b'
re.findall(myre,s)

['edge']

Modify the code to identify:

(a) edge and edges, but not the others

(b) edge and ledger, but not the others

myre = 

re.findall(myre,s)

['edge', 'ledge']

Find websites on a webpage by looking for strings:

'http:// some string . html'

s = 'blah <a href=http://foo.com/blah.html'
myre = 'http://.+\.html'
re.findall(myre,s)

['http://foo.com/blah.html']

If you don't want the 'http://' part, then use the parentheses '()' for the part you want

s = 'blah <a href=http://foo.com/blah.html'
myre = 'http://(.+\.html)'
re.findall(myre,s)

['foo.com/blah.html']

For heavy-duty applications, faster if we pre-compile the regex:

myre = re.compile('http://(.+\.html)')
s = 'blah <a href=http://foo.com/blah.html'
re.findall(myre,s)

['foo.com/blah.html']

Grab the full date 'Tue, Feb 13, 2017' from the string below

s = 'Today is Tue, Feb 13, 2017, I think.'


myre = 
re.findall(myre,s)

['Tue, Feb 13, 2017']

But this can pick up matching strings that make no sense:

s = 'Today is Nod, Boo 13, 2017, I think.'
re.findall(myre,s)

['Nod, Boo 14, 2017']

So lets restring the possible strings to specific names and months

myre = 'Mon|Tue'
s = 'Today is Tue, Feb 13, 2017, I think. Or is it Monday?'
re.findall(myre,s)

['Tue', 'Mon']

myre = '\\bMon\\b|\\bTue\\b'
s = 'Today is Tue, Feb 13, 2017, I think. Or is it Monday?'
re.findall(myre,s)

['Tue']

myre = '(?:Mon|Tue|Wed|Thu|Fri),'  # non-capturing group
s = 'Today is Thu, Feb 13, 2017, I think. Or is it Monday?'
re.findall(myre,s)

['Thu,']

Combine this with our previous code to identify the full date and time for this string

s = 'Today is Thr, Feb 13, 2017, I think. Or is it Monday? No, today is Nod, Boo 13, 2017.'