Show Menu
Cheatography

Pandas Essentials Cheat Sheet (DRAFT) by

All what you need to know about the Pandas Library

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Introd­uction

Pandas is a package built on top of NumPy, and provides an efficient implem­ent­ation of many features :
- DataFrames
- Series
- Data Alignement
- Handling Missing Data
- Grouping and Aggreg­ation
- Data Input and Output
- Handling Time Series

Pandas General Methods

Accessing values
pd_ds.v­alues
DataFrame.values
Series.values
Accessing Indices
pd_ds.i­ndex
DataFrame.index
Series.index
Accessing specific element
pd_ds[idx]
DataFrame[1]
Series[1]
Accessing range of elements
pd_ds[­start : end]
DataFrame[1:4]
Series[2:5]
Implicit Indexing
df.ilo­c[rows , cols]
data.iloc[1:3]
#last index is not included
Explicit Indexing
df.loc­[ro­ws,­cols]
data.l­oc[­'Ca­lif­ornia' : 'Texas']
#last index is included

Pandas Series

Creating Series with lists
pd.Ser­ies­([v­alues], index= list)
data = pd.Ser­ies­([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd’])
Creating Series with dictio­naries
pd.Ser­ies­({i­nde­x:v­alue})
popula­tio­n_dict = {'Cali­for­nia': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illin­ois': 12882135}
population = pd.Ser­ies­(po­pul­ati­on_­dict)
Slicing Series
Series­[fr­om_idx : to_idx]
popula­tio­n['­Tex­as'­:'F­lor­ida’]
Slicing Indices with Dictionary Series
pd.Ser­ies­({i­nde­x:v­alue} , index=[])
pd.Ser­ies­({2­:'a', 1:'b', 3:'c'}, index=[3, 2])
# only returns the third and second index respec­tfully
Pandas Series Index Can be a list of string or list of integers (or any desired type) unlike numpy arrays

Pandas DataFrames

Creating DataFrame
pd.Dat­aFr­ame­({index : iterable})
pd.Dat­aFr­ame­({'­pop­ula­tion': popula­tion, 'area': area})
Adding Column names
pd.Dat­aFr­ame­(dict , columns = [list_­of_­col­_na­mes])
pd.Dat­aFr­ame­(po­pul­ation, column­s=[­'po­pul­ati­on’])
Slicing DataFrame Index
pd.Dat­aFr­ame­(dict , columns = [ ] , index = [ ])
pd.Dat­aFr­ame­(np.ra­ndo­m.r­and(3, 2), column­s=[­'foo', 'bar'], index=­['a', 'b', 'c’])
Reading CSV Files
pd.rea­d_c­sv(­source , index_col = col)
pd.rea­d_c­sv(­"­dat­a/p­res­ide­nt_­hei­ght­s.c­sv", index_­col­="or­der­")
Saving DF to CSV
datafr­ame.to­_cs­v(s­ource)
df.to_­csv­("da­ta/­pre­sid­ent­_he­igh­ts_­cop­y.c­sv")
Reading Excel Files
pd.rea­d_e­xce­l(s­ource)
pd.rea­d_e­xce­l("d­ata­/pr­esi­den­t_h­eig­hts.xl­sx")
Saving DF to Excel
datafr­ame.to­_ex­cel­(so­urce)
df.to_­exc­el(­"­dat­a/p­res­ide­nt_­hei­ght­s_c­opy.xl­sx")
Access DataFrame Columns
datafr­ame.co­lumns
df.columns
Transp­osing DataFrames
datafr­ame.T
df.T
Subsetting Using loc
datafr­ame.lo­c[c­ond­ition , cols]
data.l­oc[­dat­a.d­ensity > 100, ['pop', 'densi­ty’]]
Masking
datafr­ame­[mask]
data[d­ata.de­nsity > 100]

Pandas Index

Creating Index
pd.Ind­ex(­list)
pd.Ind­ex([2, 3, 5, 7, 11])
Accessing Index
Index[idx]
ind[1]
Slicing Index
Index[from : to : step]
ind[ : : 2]
Inters­ection Between Indices
index_­1.i­nte­rse­cti­on(­ind­ex_2)
indA = pd.Ind­ex([1, 3, 5, 7, 9])
indB = pd.Ind­ex([2, 3, 5, 7, 11])
indA.intersection(indB)
Union Between Indices
index_­1.u­nio­n(i­ndex_2)
indA.u­nio­n(indB)
Symmetric Difference
index_1. symmet­ric­_di­ffe­ren­ce(­ind­ex_2)
indA.s­ymm­etr­ic_­dif­fer­enc­e(indB)
The Index has many of the attributes familiar from NumPy arrays such as :
ind.size, ind.shape, ind.ndim, ind.dtype

Pandas Universal Functions

+
add()
-
sub() , subtract()
*
mul(), multiply()
/
truediv(), div(), divide()
//
floordiv()
%
mod()
**
pow()
These universal functions are used in the following form :
- data_s­tru­ct.u­f(­dat­a_s­tru­ct_2)
- data_s­tru­ct.uf()

Datatype Conver­sions (NaN or None)

Float
No change
Object
No change
Integer
Upcast to float64
Boolean
Upcast to object
These are data type conversion when there is missing values

Operating On Missing Values

Nullab­ility Check
data_s­tru­c.i­s_n­ull()
data = pd.Ser­ies([1, np.nan, 'hello', None])
data.isnull()
Non-Nu­lla­bility Check
data_s­tru­c.n­ot_­null()
data.n­ot_­null()
Slicing Non-Null Values
data_s­tru­ct[­dat­a_s­tru­c.n­ot_­null()]
data[d­ata.no­t_n­ull()]
Dropping Null Values
data_s­tru­ct.d­ro­pna­(ax­is=0/1 , how = 'any'/­'all' , thresh = n)
data.d­rop­na(axis = 0 , thresh = 2)
# the tresh means each row has at least 2 non-null values
Filling Missing Values
data_s­tru­ct.f­il­lna­(value , method = 'ffil'­/'b­fill' , axis = 0/1)
df.fil­lna­(me­tho­d='­ffill', axis=1)
Filling Using A Function (Inter­pol­ation)
data_s­tru­ct.i­nt­erp­ola­te(­met­hod­='l­ine­ar'­/'p­oly­nom­ial­'...)
df.interpolate()
# the method is linear by default
When Working with missing values methods , axis = 0 means rows and 1 columns

Pandas Multi-­Ind­exing

Creating Multi-­Index From Tuples
pd.Mul­tiI­nde­x.f­rom­_tu­ple­s(t­uple)
index = pd.Mul­tiI­nde­x.f­rom­_tu­ple­s([­('C­ali­for­nia', 2000), ('Cali­for­nia', 2010)])
Creating Multi-­Index From Arrays
pd.Mul­tiI­nde­x.f­rom­_ar­ray­s(list)
pd.Mul­tiI­nde­x.f­rom­_ar­ray­s([­['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
Creating Multi-­Index From Product
pd.Mul­tiI­nde­x.f­rom­_pr­odu­ct(­[in­dex­1_l­ist­,in­dex­2_l­ist])
pd.Mul­tiI­nde­x.f­rom­_pr­odu­ct(­[['a', 'b'], [1, 2]])
Creating Multi-­Index From DataFrame Values
pd.Mul­tiI­nde­x.f­rom­_fr­ame­(da­taf­rame)
df = pd.Dat­aFr­ame­([['a', 'b'], [1, 2]])
pd.MultiIndex.from_frame(df)
Applying Muli-Index
data_s­tru­ct.r­ei­nde­x(i­ndex)
pop = pop.re­ind­ex(­index)
Setting Index From Columns
data_s­tru­ct.s­et­_in­dex­([c­ols])
pop_fl­at.s­et­_in­dex­(['­pop­ula­tion’])
Accessing Multi-­Indexed Data Structures
data_s­tru­ct[­fir­st_­ind­ex,­sec­ond­_in­dex­,....., col]
pop[:, 2010]
# gets all rows from first index and only 2010 rows from second index
Unstacking
data_s­tru­ct.u­ns­tack()
pop.unstack()
# this converts the last index (if we have 2 then the second one) values into cols
Stacking
data_s­tru­ct.s­tack()
pop.stack()
# this converts columns into a second index
Naming Multi-­Indexes
data_s­tru­ct.i­nd­ex.n­ames = list
pop.in­dex.names = ['state', 'year’]
Swapping Multi-­Indexes
data_s­tru­ct.s­wa­ple­vel­(0,1)
pop_df = pop_df.sw­apl­eve­l(0,1)
Dropping Multi-­Indexes
data_s­tru­ct.d­ro­ple­vel­(le­vel­=index)
pop_df.dr­opl­eve­l(l­evel=0)
Multi-­Index In Columns
pd.Dat­aFr­ame­(data, index=­mul­ti_­ind­ex_­rows, column­s=m­ult­i_i­ndx­_cols)
columns = pd.Mul­tiI­nde­x.f­rom­_pr­odu­ct(­[['­Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=­['s­ubj­ect', 'type'])
health_data = pd.Dat­aFr­ame­(data, index=­index, column­s=c­olumns)
Slicing Using Multi-­Index Column Values
datafr­ame­[mu­lti­_in­d_c­ol_­value]
health­_da­ta[­'Gu­ido’]
Slicing Multi-­Index Cols & Rows Using IndexSlice
idx = pd.IndexSlice
df.loc[idx[index_row1,index_row2], idx[in­dex­_co­l1,­ind­ex_­col2]]
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR’]]
Resetting Multi-­Index to Cols
data_s­tru­ct.r­es­et_­index()
pop.re­set­_in­dex­(na­me=­'po­pul­ation’)
Sorting Multi-­Index
data_s­tru­ct.s­or­t_i­ndex()
data.s­ort­_in­dex()
It is a good practice to sort the values after swapping Multi-­index Levels

Concat­enation , Merging and Joins

Concat­enation
pd.con­cat­([d­ata­_struc , data_s­truct2] , ignore­_index = True/F­alse)
pd.con­cat­([ser1, ser2])
Adding MultiIndex Keys
pd.con­cat­([d­ata­_struc , data_s­truct2] , keys = ['a','b'] )
displa­y('x', 'y', "­pd.c­on­cat([x, y], keys=['x', 'y'])")
Concat­enation with Joins
pd.con­cat­([d­ata­_struc , data_s­truct2] , join = 'outer­'/'­inner' )
pd.con­cat­([df5, df6], join='inner’)
# The inters­ection of 2 DFs
Merging
pd.mer­ge(­dat­a_struc , data_s­truct2)
df3 = pd.mer­ge(df1, df2)
Merging on Columns
pd.mer­ge(­dat­a_struc , data_s­truct2 , on ='col_­name')
pd.mer­ge(df1, df2, on='em­plo­yee')
Specific Merging
pd.mer­ge(­dat­a_struc , data_s­truct2 , right_on ='col_­name' , left_on ='col_­name')
pd.mer­ge(df1, df3, left_o­n="e­mpl­oye­e", right_­on=­"­nam­e").d­ro­p('­name', axis=1)
# When using right and left on we have to drop one of the cols (avoid redund­ancy)
Joining (Default Merge to indices only)
data_s­tru­ct.j­oi­n(d­ata­_st­ruct2)
df1a.j­oin­(df2a)
Merging on Indices
pd.mer­ge(­dat­a_struc , data_s­truct2 , left_i­nde­x=True , right_­index = True)
pd.mer­ge(­df1a, df3, left_i­nde­x=True, right_­on=­'name')
Merging with methods
pd.mer­ge(­dat­a_s­truct, data_s­truct2, how='i­nne­r’/­'ou­ter­'/'­lef­t'/­'ri­ght')
pd.mer­ge(df6, df7, how='i­nner’)
Merging Confli­cting Col Names
pd.mer­ge(­dat­a_s­truct, data_s­truct2, suffixes = ['_suff1', _'suff2'])
pd.mer­ge(df8, df9, on="­nam­e", suffix­es=­["_S­em1­", "­_Se­m_2­"])
- Note that when adding multi-­index keys in a concat­enation , the number of keys should be the same as the number of data structures being concat­enated

Advanced Group By Methods

Aggreg­ation using a list (General)
df.gro­upb­y('­col­').a­gg­reg­ate­([l­ist­_of­_me­thods]
df.gro­upb­y('­key­').a­gg­reg­ate­(['­min', np.median, max])
Aggreg­ation using a dict (Specific)
df.gro­upb­y('­col­').a­gg­reg­ate­({'col' : 'method' , 'col2' : 'method'})
df.gro­upb­y('­key­').a­gg­reg­ate­({'­data1': 'min', 'data2': 'max’})
Filtering
df.gro­upb­y('­col­').f­il­ter­(func)
def filter­_fu­nc(x):
return x['dat­a2'­].std() > 4
df.gro­upb­y('­key­').f­il­ter­(fi­lte­r_func)
Transf­orm­ation
df.gro­upb­y('­col­').t­ra­nsf­orm­(la­mbd­a_func)
df.gro­upb­y('­key­').t­ra­nsf­orm­(lambda x: x - x.mean())
Apply
df.gro­upb­y('­col­').a­pp­ly(­use­r_func)
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
df.groupby('key').apply(norm_by_data2)
Grouping By Custom Mapping
df.gro­upb­y(m­app­ing­).m­ethod
mapping = {'A': 'vowel', 'B': 'conso­nant', 'C': 'consonant'}
df2.groupby(mapping).sum()
aggreg­ate() : this method allows using more than one function with groupby for different columns
filter() : this method allows user-d­efined filter functions to be applied with a groupby (uses boolean operations only)
transf­orm(): mostly uses lambda functions to returned new and transf­ormed version of a columns
apply(): this method allows you to apply arbitrary user-d­efined functions with groupby