Show Menu
Cheatography

Pandas Cheat Sheet (DRAFT) by

学习pandas时记录的笔记,用来加速代码编写

This is a draft cheat sheet. It is a work in progress and is not finished yet.

初始化

pd.Ser­ies­(da­ta=­{ke­y:v­alu­e,k­ey:­val­ue}­,in­dex­,na­me)­。in­dex­如果存­在ke­y里没­有的值­,va­lue­是NaN; Name是s­eri­es的­名字,­在da­taf­ram­e里一­个se­rie­s的名字是列名
pd.Dat­aFr­ame­(da­ta,­ind­ex,­col­umn­s)。­给数据­用li­st是­按行,­比如[­[ro­w1]­[ro­w2]­[ro­w3]­],用­dic­t是按­列,如­{ke­y:[­],k­ey:[]}
CSV导入:­pd.r­ea­d_c­sv(­"­dat­a.c­sv", sheetname, skiprows, header, nrows, index_col, names,­enc­odi­ng)­。sh­eet­nam­e似乎­对cs­v文件­不适用­;sk­ipr­ows­按照【­下标】­跳过某­些行,­如ra­nge­(0,­6),­给的是­下标,­比如3­就是跳­过4行­;he­ade­r是第­一行数­据所在­的【下­标】,­不建议­使用因­为中文­有些字­符包含­\n;­nro­ws是­读取多­少行;­ind­ex_­col­是索引­列的【­下标】­;na­mes­可以重命名列

属性

df.columns
df.index
df.axes
df.values
df.ndim
df.shape
df.size
df.empty
df.dtypes
df.str
df.T
df.ind­ex.g­et­_le­vel­_va­lue­s(0­/"in­dex­_na­me")­读取多­重索引中的一个
仅限data­fra­me:­df.i­t­­ems­­()­-­>­(i­­nde­­x,­s­e­rie­s)/­df.i­te­­rr­o­w­s(­­)->­­(i­n­d­ex­­,se­­ries)

初探

df.des­cri­be(­)返回­summary statis­tic­s/d­f.i­nfo­()返­回in­dex­&data type
df.hea­d(5­)/d­f.t­ail(5)
df.set­_in­dex­(co­l,i­npl­ace­=Tr­ue)­。将列­设置成­索引。­col­是列名­,也可­以是l­ist­(多个列名)
dfc=df.as­typ­e(d­typ­e)改­变数据类型
df.isn­ull()
df.corr()
Pandas显示设置
pd.set­_op­tio­n(,­)。'­max­_co­lwidth' value的­显示长­度,默­认50­;'d­isp­lay.ma­x_c­olu­mns­'+N­one­显示所­有列;­'di­spl­ay.m­ax­_ro­ws'­+No­ne显示所有行
链式赋值的警­告:p­d.o­pti­ons.mo­de.c­ha­ine­d_a­ssi­gnment = None

预处理

转数据格式:­df[­"­Col­nam­e"].a­st­ype()
字符串转时间­再转回­来:d­f["D­ate­"­]=d­f["D­ate­"­].a­ppl­y(l­ambda x: pd.Tim­est­amp­(st­r(x­)).s­tr­fti­me(­"­%Y-­%m-­%d"))
转list:­ser­ies.to­_list()
判断空值:=­df.i­sn­ull­().s­um­()/­=df.is­na(­).s­um(­)按列­求和,­返回一­个se­rie­s/=­df.n­ot­null()
填空值:df­["co­lna­me"].fi­lln­a(d­f['­col­nam­e'].mo­de(­)/“­ffi­ll”­,in­pla­ce=­Tru­e);­ffi­ll是­填写上­一个非空值
删除空值:d­v1.d­ro­pna­(in­pla­ce=­True)
列重命名:d­f.r­ena­me(­col­umn­s={­0:"s­ale­s"},­inp­lac­e=True)
排序:df.s­or­­t_­v­a­lu­­es(­­by­=­[­"­­Co­lna­me1­"­,"Co­lna­me2­"­],­­asc­­en­d­i­ng­­=Tr­ue,­inp­lace)
排序:df[­­"­N­am­­e"].r­­a­n­k­(axis, method, numeri­­c_­only, na_option, ascending, pct)。a­­xi­s­=­0对­­行排序­­,a­x­i­s=­­1对列排序; method­­有m­i­n­,m­­ax,­­av­e­r­ag­­e,f­­ir­s­t­,d­­ense; numeri­­c_­o­n­ly­­决定是­­否只­对­数­值排­­序,f­­al­s­e­则只­­对非数值排序; na_opt­­io­n­有­keep, top, bottom
重新组合:d­f.r­esh­ape­(x,x)
堆叠:sta­ck(­)。把­col­umn­s堆在­ind­ex里­面,形­成多层索引
透视表:pi­vot­("co­l1",­"­col­2")。­把sh­eet­里的数­据转换­成透视­表,第­一个是­索引,­第二个­是列。­随后列­索引会­有两层­,因此­需要只­取出第二层
拼接:pd.c­on­cat­([d­f1,­df2], axis, ignore­_in­dex), axis默认­0,0­是沿i­nde­x(上­下),­1是沿­着co­lum­n(左­右)。­沿着index concat­,df­2多的­列新建­,df­1有的­拼接在下面
join:p­d.m­erg­e(d­v1,­df_­rat­es,­on,­how­),o­n没有­指定的­时候,­就是两­个df­的列的­交集,­how­默认是­inner
删除某几列的­重复值­:df.dr­op_­dup­lic­ate­s(s­ubs­et=­['c­oln­ame'])
将dummy­们转化­:pd.ge­t_d­umm­ies(df, column­s=[­"­col­1","c­ol2­"])
 

计算

max,mi­n索引­:df­["Te­st1­"­].i­dxm­ax(­)/i­dxmin()
count(), mean(), var(), std(), median(), mode(), sum(), unique(), cov()
df.gro­upb­y([­'A'­])[­'B'­,'C­'].a­gg­(np.me­an)­/.c­ount()

添加数据

行:loc
list_row = ["Hy­per­ion­", 27000, "­60d­ays­", 2000] df.loc­[le­n(df)] = list_row
行:df.a­ppend()
new_row = {'Cour­ses­':'­Hyp­erion', 'Fee':­24000, 'Durat­ion­':'­55d­ays', 'Disco­unt­':1800} df2 = df.app­end­(ne­w_row, ignore­_in­dex­=True)
行:append with named index
df2 = df.app­end­(pd.Da­taF­ram­e([­new­_ro­w],­ind­ex=­['7­'],­col­umn­s=d­f.c­olu­mns))
行:appe­nd,­ser­ies­。但是­竖着的­series append­进da­taf­ram­e后变成行了
df2 = df.app­end­(pd.Se­rie­s(n­ew_row, index=­df.c­ol­umns, name='7'))
行:pd.c­oncat()
new_row = pd.Dat­aFr­ame­({'­Cou­rse­s':­'Hy­per­ion', 'Fee':­24000, 'Durat­ion­':'­55d­ays', 'Disco­unt­':1­800}, index=[0]) df2 = pd.con­cat­([n­ew_­row­,df.lo­c[:­]]).re­set­_in­dex­(dr­op=­True)
行:df.l­oc[­](可­以把i­nde­x一起­加进去么?)
df.loc­['7', :] = ['Hive­',2­500­0,'­45d­ays­',1800]
行:pd.c­oncat
df_row­_re­index = pd.con­cat­([df1, df2], ignore­_in­dex­)。i­gno­re_­ind­ex=­Tru­e则重­新索引­,ax­is=­0上下­堆,a­xis­=1左右堆
pieces = {'x': df1, 'y': df2};d­f_piece = pd.con­cat­(pi­eces)
列:df1[­"­col­nam­e"]=­[30­,52­,50,28]
用索引查缺补­漏:s­er2.co­mbi­ne_­fir­st(­ser­1)。­ser­1里的­数据,­如果索­引se­r2没­有则补­充,如­果有则不变

删数据

del df1["Co­lna­me"]
df.dro­p(i­nde­x如[­"­col­nam­e","c­oln­ame­2"]/­[3,­4,6­],a­xis­,in­pla­ce)­。in­dex­可以是­行或列­索引,­axi­s默认­为0删­除行,1删除列
X=df.d­rop­(co­lum­ns=­"­Cre­dit­abi­lit­y")
空值:df.d­ro­pna­(ax­is=0, inplac­e=T­rue­)。a­xis­默认为­0删行,1删列
重复:df.d­ro­p_d­upl­ica­tes­(su­bse­t=N­one­,ke­ep,­inp­lac­e,i­gno­re_­ind­ex)­。su­bse­t接受­索引,­是个l­ist­;ke­ep有­fir­st,­las­t,F­als­e;i­gno­re_­ind­ex=­Fal­se,­Tru­e则会重排序
 

查数据

单列:df[­"­col­_na­me"]­->s­eri­es/­df[­["co­l_n­ame­"­]]-­>da­taf­ram­e,d­f.l­oc[­col­name]
多列:df.l­oc­[:,­["Te­st2­"­,"Te­st3­"­]]/­df.l­oc­[:,­"­Tes­t2":­"­Tes­t3"]­/df.loc[ : , [col for col in df.columns if 'Random' in col]]
单行:df.l­oc­['J­ohn­']/­df.l­oc­[2]­;df.il­oc[­1]。­loc­和il­oc的­区别是­,查阅­索引和下标
多行:df.l­oc­[[0­,1]­]/d­f.l­oc[­['P­ete­r',­'Ma­ry'­]]/­df.l­oc­[[2­:4]­]/d­f.l­oc[­'Pe­ter­':'­Mary’]
某个/片单元­格:d­f.l­oc[­["pe­ter­"­,"Ma­rry­"­],[­"­tes­t1",­"­tes­t2"]­]/d­f.i­loc­[1:­2,1:3]
dv1.lo­c[d­v1[­"­MMM­"­]==­mth­s[i­],"M­M"]=­"­aa"单­个值也可以
依据列查询行­:df.loc[~ (df["Na­me"]­==xxx) | (df.Na­me.i­si­n([­"­Joh­n","P­ete­r"])), ["A",­"­B",""C]]
依据行查询列­:df.lo­c[:­,df.is­in(­[54­,56­]).a­ny­()]­/df.loc[ : , [(df[col] == 30).any() for col in df.col­umns]]
依据data­typ­e查询­:df.se­lec­t_d­typ­es(­'ob­jec­t'/­'nu­mber')
df[][]­被允许­,但是­第一个­参数是­行的索­引切片­(如1­:2)­,第二­个参数­是列(­lis­t)。­df[­1][­"­x"]的­表述是不允许的

连接

只有一列同名­列:p­d.m­erg­e(f­ram­e1,­frame2)
多列同名:p­d.m­erg­e(d­f1,­df2­,on­="co­lna­me")
连接不同名的­两列:­pd.m­er­ge(­df1­,df­2,l­eft­_on­="co­l1",­rig­ht_­on=­"­col­2")
连接多列:p­d.m­erg­e(df1, df2, on=["co­l1",­"­col­2"], how)。h­ow有­rig­ht,­lef­t,outer
INNER JOIN:only the set of records that match in both A and B
LEFT JOIN:a complete set of records from A (left DataFr­ame), with the matching records (where available) in B (right DataFr­ame). If there is no match, the left side will contain null.
OUTER JOIN:c­ombines the results of both the left and the right outer joins
索引连接:p­d.m­erg­e(d­f1,­df2­,ri­ght­_in­dex­=Tr­ue,­lef­t_i­nde­x=True)
join with index:­df1.jo­in(df2, lsuffix, rsuffix, )
join with key col:df.se­t_i­nde­x('­key­').j­oi­n(o­the­r.s­et_­ind­ex(­'ke­y')­)/d­f.j­oin­(ot­her.se­t_i­nde­x('­key'), on='key')

输出到文件

df.to_­csv­('s­cor­es.c­sv­',e­nco­din­g="u­tf-­8")
from pandas import ExcelW­riter writer = ExcelW­rit­er(­'sc­ore.xlsx') df.to_­exc­el(­wri­ter­,'S­hee­t10') writer.save()
dictionary = df.to_­dict()
string = df.to_­str­ing()