Intro to Pandas 1

https://pandas.pydata.org/

In [13]:
import pandas

Download the following datasets:

airport data (see here for info)

fruit sales data

SPARCS data (see here for info)

and download this script

Ways to load data into pandas

Load from clip board

In [15]:
df = pandas.read_clipboard()
df
Out[15]:
import pandas

Load from csv

In [18]:
ddf = pandas.read_csv('airports.csv')
ddf.head()
Out[18]:
id ident type name latitude_deg longitude_deg elevation_ft continent iso_country iso_region municipality scheduled_service gps_code iata_code local_code home_link wikipedia_link keywords
0 6523 00A heliport Total Rf Heliport 40.070801 -74.933601 11.0 NaN US US-PA Bensalem no 00A NaN 00A NaN NaN NaN
1 323361 00AA small_airport Aero B Ranch Airport 38.704022 -101.473911 3435.0 NaN US US-KS Leoti no 00AA NaN 00AA NaN NaN NaN
2 6524 00AK small_airport Lowell Field 59.949200 -151.695999 450.0 NaN US US-AK Anchor Point no 00AK NaN 00AK NaN NaN NaN
3 6525 00AL small_airport Epps Airpark 34.864799 -86.770302 820.0 NaN US US-AL Harvest no 00AL NaN 00AL NaN NaN NaN
4 6526 00AR closed Newport Hospital & Clinic Heliport 35.608700 -91.254898 237.0 NaN US US-AR Newport no NaN NaN NaN NaN NaN 00AR

From an excel file

Download foo excel spreadsheet from here

df = pandas.read_excel('foo.xlsx') df

fillna()

Exercise - fix airport spreadsheet by reaplacing NA Exercise: fix the airports.csv dataframe by replacing all the nulls by "NA". (Pandas interprets NA for North America as "Not Available" :-/ )

In [19]:
ddf = ddf.fillna('NA')
ddf.head()
Out[19]:
id ident type name latitude_deg longitude_deg elevation_ft continent iso_country iso_region municipality scheduled_service gps_code iata_code local_code home_link wikipedia_link keywords
0 6523 00A heliport Total Rf Heliport 40.070801 -74.933601 11 NA US US-PA Bensalem no 00A NA 00A NA NA NA
1 323361 00AA small_airport Aero B Ranch Airport 38.704022 -101.473911 3435 NA US US-KS Leoti no 00AA NA 00AA NA NA NA
2 6524 00AK small_airport Lowell Field 59.949200 -151.695999 450 NA US US-AK Anchor Point no 00AK NA 00AK NA NA NA
3 6525 00AL small_airport Epps Airpark 34.864799 -86.770302 820 NA US US-AL Harvest no 00AL NA 00AL NA NA NA
4 6526 00AR closed Newport Hospital & Clinic Heliport 35.608700 -91.254898 237 NA US US-AR Newport no NA NA NA NA NA 00AR

Load from a Numpy array

In [21]:
from numpy import *
a = vander([3,4,5,6,7])
df = pandas.DataFrame(a)
df
Out[21]:
0 1 2 3 4
0 81 27 9 3 1
1 256 64 16 4 1
2 625 125 25 5 1
3 1296 216 36 6 1
4 2401 343 49 7 1

Columns and index

On construction of a dataframe, pandas will provide labels for the rows and columns, as seen above.

But we can change them if we like:

In [22]:
df.columns=['aa','b','c','d','z']
df
Out[22]:
aa b c d z
0 81 27 9 3 1
1 256 64 16 4 1
2 625 125 25 5 1
3 1296 216 36 6 1
4 2401 343 49 7 1

(and df.index = whatever, for the rows).

Accessing columns

Access a column like this:

In [23]:
df['aa']
Out[23]:
0      81
1     256
2     625
3    1296
4    2401
Name: aa, dtype: int64

Or a subset of the columns:

In [24]:
df[['aa','z']]
Out[24]:
aa z
0 81 1
1 256 1
2 625 1
3 1296 1
4 2401 1

The index of the dataframe labels the rows:

In [25]:
df.index
Out[25]:
RangeIndex(start=0, stop=5, step=1)
In [26]:
df.index=['Maggie','Edward','Sanjeevani','Michael','Robert']
df
Out[26]:
aa b c d z
Maggie 81 27 9 3 1
Edward 256 64 16 4 1
Sanjeevani 625 125 25 5 1
Michael 1296 216 36 6 1
Robert 2401 343 49 7 1

Accessing rows: loc, iloc, ix

loc, iloc, ix are three accessors

loc provides access by labels:

In [28]:
df.loc['Sanjeevani']
Out[28]:
aa    625
b     125
c      25
d       5
z       1
Name: Sanjeevani, dtype: int64

iloc provides access by row number:

In [29]:
df.iloc[2]
Out[29]:
aa    625
b     125
c      25
d       5
z       1
Name: Sanjeevani, dtype: int64

Labels could be integers:

In [30]:
df.index=['Maggie',4,'Sanjeevani','Michael','Robert']
df
Out[30]:
aa b c d z
Maggie 81 27 9 3 1
4 256 64 16 4 1
Sanjeevani 625 125 25 5 1
Michael 1296 216 36 6 1
Robert 2401 343 49 7 1

Then

In [31]:
df.loc[4]  # gives row with label 4
Out[31]:
aa    256
b      64
c      16
d       4
z       1
Name: 4, dtype: int64

ix provides access by either label or row number. If a row has an integer label, i, and we ask ix for row i, do we get row with label i? Or row number i?

In [32]:
df.ix[4]
/Users/drt/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
Out[32]:
aa    2401
b      343
c       49
d        7
z        1
Name: Robert, dtype: int64

Answer row number i.

These indexers also support slicing. Beware that unlike every other start:stop slicing in Python, "stop" is included:

In [33]:
df.loc['Maggie':'Michael']
Out[33]:
aa b c d z
Maggie 81 27 9 3 1
4 256 64 16 4 1
Sanjeevani 625 125 25 5 1
Michael 1296 216 36 6 1

Examples of Pandas

fruit_sales

airports

SPARC