Advanced Text Mining

We first study the history of first names in the US. You will extend this for project 1.

Make sure to download and unzip file

http://www.acsu.buffalo.edu/~danet/Sp18/MTH448/class3/class3_files/names.zip

This national data can also be obtained from

https://www.ssa.gov/oact/babynames/limits.html

look at one file and notice each line contains: name,gender,count

Noticing the naming format, we can load in all data using a script using glob and *

In [14]:
from glob import glob
files = sorted( glob('class3_files/names/yob*.txt') )
files[:5]
Out[14]:
['class3_files/names/yob1880.txt',
 'class3_files/names/yob1881.txt',
 'class3_files/names/yob1882.txt',
 'class3_files/names/yob1883.txt',
 'class3_files/names/yob1884.txt']
In [15]:
from glob import glob
files = sorted( glob('class3_files/names/yob*.txt') )
number_names = [];
count = 0;
for file in files:
    f = open(file)
    s = f.read()
    number_names.append(len(s))    
    print(file,number_names[count])
    f.close()
    count +=1;
    
class3_files/names/yob1880.txt 22933
class3_files/names/yob1881.txt 22130
class3_files/names/yob1882.txt 24432
class3_files/names/yob1883.txt 23918
class3_files/names/yob1884.txt 26373
class3_files/names/yob1885.txt 26331
class3_files/names/yob1886.txt 27430
class3_files/names/yob1887.txt 27158
class3_files/names/yob1888.txt 30413
class3_files/names/yob1889.txt 29707
class3_files/names/yob1890.txt 30926
class3_files/names/yob1891.txt 30526
class3_files/names/yob1892.txt 33621
class3_files/names/yob1893.txt 32602
class3_files/names/yob1894.txt 33876
class3_files/names/yob1895.txt 35183
class3_files/names/yob1896.txt 35656
class3_files/names/yob1897.txt 34908
class3_files/names/yob1898.txt 37660
class3_files/names/yob1899.txt 35099
class3_files/names/yob1900.txt 43129
class3_files/names/yob1901.txt 36431
class3_files/names/yob1902.txt 38922
class3_files/names/yob1903.txt 39290
class3_files/names/yob1904.txt 41248
class3_files/names/yob1905.txt 42348
class3_files/names/yob1906.txt 42212
class3_files/names/yob1907.txt 45881
class3_files/names/yob1908.txt 46823
class3_files/names/yob1909.txt 49274
class3_files/names/yob1910.txt 54081
class3_files/names/yob1911.txt 56894
class3_files/names/yob1912.txt 74378
class3_files/names/yob1913.txt 81737
class3_files/names/yob1914.txt 93567
class3_files/names/yob1915.txt 110085
class3_files/names/yob1916.txt 114180
class3_files/names/yob1917.txt 116880
class3_files/names/yob1918.txt 122678
class3_files/names/yob1919.txt 122343
class3_files/names/yob1920.txt 126988
class3_files/names/yob1921.txt 128324
class3_files/names/yob1922.txt 127389
class3_files/names/yob1923.txt 126014
class3_files/names/yob1924.txt 128740
class3_files/names/yob1925.txt 126164
class3_files/names/yob1926.txt 123843
class3_files/names/yob1927.txt 123498
class3_files/names/yob1928.txt 120548
class3_files/names/yob1929.txt 116586
class3_files/names/yob1930.txt 116132
class3_files/names/yob1931.txt 110296
class3_files/names/yob1932.txt 111409
class3_files/names/yob1933.txt 106863
class3_files/names/yob1934.txt 108888
class3_files/names/yob1935.txt 107266
class3_files/names/yob1936.txt 105432
class3_files/names/yob1937.txt 106395
class3_files/names/yob1938.txt 107306
class3_files/names/yob1939.txt 105900
class3_files/names/yob1940.txt 106691
class3_files/names/yob1941.txt 108135
class3_files/names/yob1942.txt 112224
class3_files/names/yob1943.txt 112088
class3_files/names/yob1944.txt 108983
class3_files/names/yob1945.txt 107501
class3_files/names/yob1946.txt 115740
class3_files/names/yob1947.txt 123693
class3_files/names/yob1948.txt 122178
class3_files/names/yob1949.txt 122545
class3_files/names/yob1950.txt 122946
class3_files/names/yob1951.txt 125018
class3_files/names/yob1952.txt 127059
class3_files/names/yob1953.txt 129236
class3_files/names/yob1954.txt 130716
class3_files/names/yob1955.txt 132456
class3_files/names/yob1956.txt 135174
class3_files/names/yob1957.txt 137576
class3_files/names/yob1958.txt 137175
class3_files/names/yob1959.txt 140219
class3_files/names/yob1960.txt 142147
class3_files/names/yob1961.txt 145407
class3_files/names/yob1962.txt 145729
class3_files/names/yob1963.txt 146607
class3_files/names/yob1964.txt 148144
class3_files/names/yob1965.txt 143048
class3_files/names/yob1966.txt 145236
class3_files/names/yob1967.txt 148253
class3_files/names/yob1968.txt 154485
class3_files/names/yob1969.txt 164234
class3_files/names/yob1970.txt 176810
class3_files/names/yob1971.txt 182845
class3_files/names/yob1972.txt 184059
class3_files/names/yob1973.txt 187304
class3_files/names/yob1974.txt 194203
class3_files/names/yob1975.txt 202218
class3_files/names/yob1976.txt 207554
class3_files/names/yob1977.txt 216830
class3_files/names/yob1978.txt 217718
class3_files/names/yob1979.txt 227448
class3_files/names/yob1980.txt 231875
class3_files/names/yob1981.txt 232375
class3_files/names/yob1982.txt 235104
class3_files/names/yob1983.txt 231499
class3_files/names/yob1984.txt 233195
class3_files/names/yob1985.txt 240190
class3_files/names/yob1986.txt 247319
class3_files/names/yob1987.txt 256946
class3_files/names/yob1988.txt 268648
class3_files/names/yob1989.txt 285743
class3_files/names/yob1990.txt 297345
class3_files/names/yob1991.txt 302058
class3_files/names/yob1992.txt 306076
class3_files/names/yob1993.txt 311910
class3_files/names/yob1994.txt 312204
class3_files/names/yob1995.txt 313055
class3_files/names/yob1996.txt 316642
class3_files/names/yob1997.txt 323119
class3_files/names/yob1998.txt 333650
class3_files/names/yob1999.txt 340796
class3_files/names/yob2000.txt 355234
class3_files/names/yob2001.txt 360803
class3_files/names/yob2002.txt 364204
class3_files/names/yob2003.txt 371373
class3_files/names/yob2004.txt 381412
class3_files/names/yob2005.txt 387591
class3_files/names/yob2006.txt 406174
class3_files/names/yob2007.txt 416261
class3_files/names/yob2008.txt 417613
class3_files/names/yob2009.txt 413181
class3_files/names/yob2010.txt 405415
class3_files/names/yob2011.txt 403051
class3_files/names/yob2012.txt 400908
class3_files/names/yob2013.txt 395142
class3_files/names/yob2014.txt 394396
class3_files/names/yob2015.txt 392318
class3_files/names/yob2016.txt 389747
In [16]:
%pylab inline

plt.plot(number_names)
Populating the interactive namespace from numpy and matplotlib
/Users/drt/anaconda3/lib/python3.6/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"
Out[16]:
[<matplotlib.lines.Line2D at 0x10f8b5860>]

Read all the data into memory

In [17]:
from glob import glob
from numpy import *
files = sorted( glob('class3_files/names/yob*.txt') )
nyears = len(files);
def year(filename): return int(filename[-8:-4])
firstyear = year(files[0])
firstyear

d = {} # another dictionary
gd = {'F':0,'M':1} # keep track of males and females
for file in files:
    f = open(file) 
    lines = f.read().split('\n') # separate the long string into a list
    for line in lines:
        if len(line)==0: 
            continue # ignore empty lines

        # each line is of the form 'name,gender,count' so read these into varibles
        name,gender,count = line.split(',') # the deliminator is ','
        count = int(count) # typecast the count into a number - specifically and int: 1,2,3,4...
        
        # if it's a new name, add it to the dictionary
        if name not in d:
            d[name] = zeros((2,nyears),dtype=int)

        # update the count for that line
        d[name][ gd[gender], year(file)-firstyear] = count
    f.close()
    
#d['Edward']
In [18]:
#plot name popularity for two Edward

name = 'Edward'
plot( range(firstyear,year(files[-1])+1)   ,d[name][1] ,'b')  # males
plot( range(firstyear,year(files[-1])+1)   ,d[name][0] ,'r'); # females

Gender-predominance of some names has flipped over time

In [19]:
#plot name popularity for two Edward

name = 'Leslie'
plot( range(firstyear,year(files[-1])+1)   ,d[name][0] ,'r')
plot( range(firstyear,year(files[-1])+1)   ,d[name][1] ,'b');

For Project 1. Extend this result somehow....

Exercise 3: Political Transcripts

The goal is study and compare word usage for the 2 presidential candidates: Clinton and Trump.

Step 0: Download the file

separate lists for each speaker

http://www.acsu.buffalo.edu/~danet/Sp18/MTH448/class2/class2_files/political_transcript.txt

Exercise 4: First names database

Step 1: Open file and preprocess text

separate lists for each speaker

In [20]:
f = open("class3_files/The_first_Trump-Clinton_presidential_debate_transcript_annotated.txt")
s = f.read()
f.close()

#we want to remove the following characters and string
punc = ',.;:!?"'
otherbadwords = ['--','(APPLAUSE)','(inaudible)','(LAUGHTER)','(CROSSTALK)']

# first remove punctuation
for p in punc: s = s.replace(p,'')#remove punctuation

# next remove formating words that aren't helpful
for w in otherbadwords: s = s.replace(w,'')
    
# identify the 3 speakers: Holt is the debate moderator
speakers = ['HOLT','CLINTON','TRUMP']

# restore colons after speaker names since these were just removed
for sp in speakers: s = s.replace(sp,sp+':')  

#make all lower case
s = s.lower()

tags = [ sp.lower()+':' for sp in speakers ]
print('These strings identify who is speaking:')
print(tags)

# the text is one long string, break it up into a list of words for easier analysis
words = s.split()
#words[:50]
These strings identify who is speaking:
['holt:', 'clinton:', 'trump:']

Step 2: Separate words into 3 lists, 1 per speaker

In [21]:
h = []#words used by holt
c = []#words used by clinton
t = []#words used by trump

#let variable 'current' denote the current speaker, and update it as one goes through the text

for w in words:# i.e., consider each work
    if  w == tags[0]: #i.e., if w = holt:
        current = h 
    elif w == tags[1]: #i.e., if w = clinton:
        current = c
    elif w == tags[2]: #...
        current = t

    # if the word is not a speaker's name, add word to the current speaker's list of words    
    else: current.append(w) 
h[:5],c[:5],t[:5]        
Out[21]:
(['good', 'evening', 'from', 'hofstra', 'university'],
 ['how', 'are', 'you', 'donald', 'well'],
 ['thank', 'you', 'lester', 'our', 'jobs'])
In [22]:
print('The total number of words spoken by Holt, Clinton and Trump are')#repeats are counted 
len(h),len(c),len(t)
The total number of words spoken by Holt, Clinton and Trump are
Out[22]:
(1939, 6342, 8562)
In [23]:
len(set(h)),len(set(c)),len(set(t))  # counts of distinct words spoken by each
Out[23]:
(563, 1385, 1291)

Step 3: Compare speaker's usages of all the words

Consider all the words, and compute how many times each speaker uses each word.

In [24]:
d = {} # dictionary containing words and word usage profiles of speakers

for w in h:#holt
    if w not in d: 
        d[w] = [1,0,0]
    else:          
        d[w][0] += 1
        
        
for w in c:#clinton
    if w not in d: 
        d[w] = [0,1,0]
    else:          
        d[w][1] += 1
        
        
for w in t:#trump
    if w not in d: 
        d[w] = [0,0,1]
    else:          
        d[w][2] += 1

d['you']
Out[24]:
[65, 76, 206]

Now we want to sort the words by speaker usage. Naturally, sorted sorts by alphabetical

In [25]:
dl = list(d.items())
dl[:15]

#sorted(list(d.items()))
sorted( list(d.items()) )[:20]


#sorted( list(d.items()),reverse=True )[:20]
Out[25]:
[('$13', [0, 1, 1]),
 ('$14', [0, 1, 0]),
 ('$150', [0, 0, 1]),
 ('$17', [0, 0, 1]),
 ('$20', [0, 0, 6]),
 ('$200', [0, 0, 1]),
 ('$25', [0, 0, 2]),
 ('$39', [0, 0, 1]),
 ('$4', [0, 1, 0]),
 ('$400', [0, 0, 1]),
 ('$5', [0, 3, 1]),
 ('$6', [0, 0, 2]),
 ('$650', [0, 1, 3]),
 ('$694', [0, 0, 2]),
 ('$800', [0, 0, 1]),
 ("'04", [0, 0, 1]),
 ("'13", [1, 0, 0]),
 ("'14", [1, 0, 0]),
 ("'15", [1, 0, 0]),
 ('10', [1, 2, 4])]

Note: The lambda construct

Requires less code writing as it allows you to temporarily define a function

In [26]:
def f(x):
    return x*2;
print(f(7))

f2 = lambda x: x*2
print(f2(7))
14
14
In [27]:
f('ho')
Out[27]:
'hoho'
In [28]:
def f(x):
    return x[1][2];

sorted(dl,key=f,reverse=True)[:10]  # sort by Trump frequency


#less code using lambda construction
sorted(dl,key=lambda x:x[1][2],reverse=True)[:10]  # sort by Trump frequency
Out[28]:
[('the', [95, 253, 295]),
 ('and', [44, 206, 289]),
 ('to', [83, 240, 258]),
 ('i', [16, 141, 240]),
 ('you', [65, 76, 206]),
 ('a', [27, 122, 172]),
 ('of', [39, 135, 171]),
 ('that', [22, 147, 167]),
 ('have', [27, 84, 147]),
 ('we', [22, 131, 127])]
In [29]:
sorted(dl,key=lambda x:x[1][1],reverse=True)[:10]  # sort by Clinton frequency
Out[29]:
[('the', [95, 253, 295]),
 ('to', [83, 240, 258]),
 ('and', [44, 206, 289]),
 ('that', [22, 147, 167]),
 ('i', [16, 141, 240]),
 ('of', [39, 135, 171]),
 ('we', [22, 131, 127]),
 ('a', [27, 122, 172]),
 ('in', [24, 104, 110]),
 ('have', [27, 84, 147])]

Lets visualize some results

In [30]:
x = [w[1][1] for w in dl]  # Clinton 
y = [w[1][2] for w in dl]  # Trump

#scatter plot
plot(x,y,'o',alpha=.4)#alpha controls transparent vs. opaque
Out[30]:
[<matplotlib.lines.Line2D at 0x110b9d860>]
In [31]:
plot(x,y,'.',alpha=.5)

#zoom in 
xlim(0,45)
ylim(0,45)
Out[31]:
(0, 45)
In [32]:
# words used by trump far more than by hillary
trumphi = sorted(dl,key=lambda x:x[1][2]/(x[1][1]+1),reverse=True)[:10]
trumphi
Out[32]:
[('clinton', [24, 0, 22]),
 ('leaving', [0, 0, 15]),
 ('agree', [0, 0, 14]),
 ('wrong', [0, 0, 13]),
 ("i'll", [2, 0, 12]),
 ('tremendous', [0, 0, 11]),
 ('politicians', [0, 0, 10]),
 ('she', [2, 3, 33]),
 ("they're", [0, 4, 41]),
 ('hillary', [3, 0, 8])]
In [33]:
# words used by hillary far more than by trump
clintonhi = sorted(dl,key=lambda x:x[1][1]/(x[1][2]+1),reverse=True)[:10]
clintonhi
Out[33]:
[('donald', [3, 26, 1]),
 ('american', [8, 11, 0]),
 ('information', [0, 9, 0]),
 ('proposed', [0, 7, 0]),
 ('justice', [0, 7, 0]),
 ('everyone', [2, 6, 0]),
 ('national', [0, 6, 0]),
 ('part', [0, 6, 0]),
 ('both', [2, 5, 0]),
 ('incomes', [1, 5, 0])]