Quiz 1 - Regular Expressions

Thursday, Feb 15, 2016

In all your answers, explain what you are thinking so that I can give you partial credit in case you don't get it exactly right. In 2, 3, 4, be careful to use a backslash to “escape” characters that you mean literally, but have special meanings in regex.

1. What is the output of the following and why?

In [1]:
import re 
some_string = '3009 39 11a493 9 -1.23ax9 -663.9 345.069 20'
myre = '3.{1,2}9'
re.findall(myre,some_string)
Out[1]:
['3009', '3 9', '3ax9', '3.9']

Comments:

It returns strings that start with 3 and end in 9, and the strings must contain 3-4 characters.

2. Compose a regular expression (myre = ) that will return words (i.e., not partial words or strings) that satisfies the following 3 criteria: (i) starts with either ’s’ or ’a’; (ii) ends with ’t’; and (iii) contains 2 or more letters. Explain your answer. Your answer should work for the following code:

In [2]:
myre = '\\b[sa][a-z]*t\\b'
some_string = 'It should find only act, at, and sat but not cat or saturn.'
re.findall(myre,some_string)
Out[2]:
['act', 'at', 'sat']

Comments:

  • [sa] ensures the string starts with s or a
  • [a-z] ensures it contains only letters and is a sincel word
  • [t] unsures it ends in 't'
  • Using + instead of * missing the word 'at'
  • Not using the \b picks up skat is incorrect since it picks up skate from skater.

3. Compose a regular expression that will match any of the following 'Spring' , 'Fall' , 'Summer' , 'Winter' but not 'Christmas' nor 'Thisword'.

In [7]:
some_string = 'My favorite season is Summer, but Winter is good too. Christmas is fun.'
myre = '\\b[A-Z][a-z]{2,4}[glr]\\b'
re.findall(myre, some_string)
Out[7]:
['Summer', 'Winter']
In [8]:
myre = '(\\bSpring\\b|\\bSummer\\b|\\bFall\\b|\\bWinter\\b\\b)'
re.findall( myre, some_string )
Out[8]:
['Summer', 'Winter']

Comments:

There are many possibilities here. The second option above is probably better because if the intent is to match only the seasons, then we might as well just list the possibilities since there are only 4 of them. The first option matches some non-existent strings.

4. Using the parentheses '()' function, compose a regular expression that will extract and return the name of an image (including .jpg) from the following html code:

Mountain View Here is some random sentence.
In [5]:
some_string = '<img src="picture1.jpg" alt="Mountain View">\nHere is some random sentence.\n<img src="picture2.jpg" width=400>'
myre = '<img src="(.+?)"'
re.findall(myre,some_string)
Out[5]:
['picture1.jpg', 'picture2.jpg']

Comments:

  • The regex must start with <img src="
  • You can use * or + here.
  • The parentheses should contain only the (.*?) or (.+?)
  • You must the the ? to return the shortest string