Gender Bias in the U.S. Name Data

In [14]:
#load data (code from last class)

from glob import glob
from numpy import *
files = sorted( glob('names/yob*.txt') )
nyears = len(files)
def year(filename): return int(filename[-8:-4])
firstyear = year(files[0])
d = {}
gd = {'F':0,'M':1}
for file in files:
    f = open(file)
    lines = f.read().split('\n')
    for line in lines:
        if len(line)==0: continue
        name,gender,count = line.split(',')
        count = int(count)
        if name not in d:
            d[name] = zeros((2,nyears),dtype=int)
        d[name][ gd[gender], year(file)-firstyear] = count
    f.close()
d['Edward']
Out[14]:
array([[    0,     0,     5,     7,     9,     5,    11,    11,     9,
           12,    13,     6,     9,    10,    13,    13,     9,    10,
            9,     9,     8,     0,    12,    10,     7,    11,     9,
            9,     7,    12,    18,    11,    27,    42,    48,    53,
           43,    61,    63,    65,    64,    80,    93,    79,    97,
          106,   113,   112,   138,   132,   102,   101,    86,    60,
           66,    69,    64,    59,    57,    50,    45,    47,    70,
           55,    49,    44,    52,    55,    51,    50,    61,    43,
           54,    50,    44,    52,    69,    58,    65,    63,    70,
           67,    68,    65,    60,    85,    63,    61,    68,    69,
           62,    68,    78,    58,    57,    58,    50,    46,    53,
           66,    61,    46,    52,    33,    43,    44,    40,    40,
           40,    25,    19,    19,    18,    18,    18,     9,    10,
            5,     8,     0,    11,     8,     6,     5,    10,     0,
            0,     7,     0,     0,     0,     0,     0,     0,     5,
            0,     0],
       [ 2364,  2177,  2477,  2250,  2439,  2220,  2312,  2125,  2470,
         2299,  2282,  1989,  2416,  2309,  2179,  2203,  2296,  2121,
         2337,  1901,  2720,  1917,  2294,  2268,  2334,  2366,  2398,
         2576,  2707,  2935,  3408,  4164,  7936,  9474, 12318, 15889,
        17005, 17502, 19490, 18534, 20098, 20815, 20421, 20600, 21127,
        20093, 19375, 19114, 18487, 17225, 17345, 15644, 15184, 13791,
        13920, 13837, 14195, 14926, 14547, 14428, 14401, 15577, 17465,
        17712, 16473, 15868, 18578, 20550, 18977, 19171, 18717, 19912,
        19516, 18972, 19521, 19258, 19417, 18696, 17374, 16907, 16586,
        15877, 15523, 15316, 15604, 14417, 13276, 12661, 12292, 12460,
        12302, 11062,  9302,  8372,  7827,  7392,  7054,  6847,  6486,
         6878,  6762,  6657,  6291,  5997,  5912,  5869,  5776,  5902,
         5798,  5848,  5741,  5571,  5230,  4788,  4522,  4142,  4071,
         3913,  3575,  3593,  3480,  3372,  3240,  3108,  3147,  2970,
         2868,  2821,  2787,  2981,  2902,  2663,  2591,  2703,  2580,
         2592,  2491]])
In [15]:
#prepare for plotting
%pylab inline
Populating the interactive namespace from numpy and matplotlib
/Users/drt/anaconda3/lib/python3.6/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"

Let's plot the "gender bias" of the name Leslie over time. I.e. the ratio of frequency in females to frequency in males:

In [16]:
name='Leslie'
plot( range(firstyear,year(files[-1])+1),d[name][0]/d[name][1] ,'g');
title('Ratio of Females-to-Males for Name: '+ name)
xlabel('Year')
ylabel('Ratio')
Out[16]:
Text(0,0.5,'Ratio')

The above plot of the ratio of female to male frequency of the name is not satisfactory because the male-dominant part of the history is all squashed into invisibility near the t-axis.

More symmetrical if we take the log of the ratio:

In [17]:
semilogy( range(firstyear,year(files[-1])+1)   ,d[name][0]/d[name][1] ,'g');
title('Ratio of Females-to-Males for Name: '+ name)
xlabel('Year')
ylabel('Ratio (log scale)')
Out[17]:
Text(0,0.5,'Ratio (log scale)')

Now we can see the detail at both extremes!

Next, let's do it for all names in the database:

In [13]:
figure(figsize=(15,6))
max_count = 1000;
count = 0;
for name in d:    
    if count < max_count and d[name][0].sum()>0 and d[name][1].sum()>0: # if there is at least on M and one F in at least one year
        semilogy( range(firstyear,year(files[-1])+1)   ,d[name][0]/d[name][1] ,'g',alpha=0.1);
    count = count+1;    
title('Ratio of Females-to-Males for All Names ')
xlabel('Year')
ylabel('Ratio (log scale)')
/Users/drt/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in true_divide
  
/Users/drt/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in true_divide
  
Out[13]:
Text(0,0.5,'Ratio (log scale)')
In [ ]: