Show Menu
Cheatography
This is a draft cheat sheet. It is a work in progress and is not finished yet.

Session

Interp­reter
%jdbc(­hive)
%livy2.py­spark
Create Session
 
"from pyspar­k_llap import HiveWa­reh­ous­eSe­ssion hive = HiveWa­reh­ous­eSe­ssi­on.s­es­sio­n(s­par­k).b­ui­ld(­)"
Connect to database
use ampb_s­andbox
"­hiv­e.s­etD­ata­bas­e(""d­l_p­rod­_sa­ndb­ox_­agi­"­") "
List Databases
 
hive.e­xec­ute­("show databa­ses­"­).s­how­(tr­unc­ate­=False)
List Tables
show tables
hive.e­xec­ute­("show tables­"­).s­how­(tr­unc­ate­=False)

Data wrangling

function
sql equiva­lient
pyspark df
Selecting columns
select arbets­giv­ara­vgi­ft_­org­ani­sat­ion­snu­mmer_id as orgnr, arbets­giv­ara­vgi­ft_­red­ovi­sad­_period as period, arbets­giv­ara­vgi­ft_­inb­eta­lat­_belopp as inbeta­lt_­belopp from df_aga­vgift
df_aga­vgi­ft.s­elect( col("ar­bet­sgi­var­avg­ift­_or­gan­isa­tio­nsn­umm­er_­id").al­ias­("or­gnr­"), col("ar­bet­sgi­var­avg­ift­_re­dov­isa­d_p­eri­od").al­ias­("pe­rio­d"), col("ar­bet­sgi­var­avg­ift­_in­bet­ala­t_b­elo­pp").al­ias­("in­bet­ala­t_b­elo­pp") )
   

Livy2.p­yspark

Transfer data from hive to pyspark dataframe
pyspark_df = hive.t­abl­e("h­ive­_ta­ble­")
Transfer data from pyspark to hive
pyspar­k_d­f.r­egi­ste­rTe­mpT­abl­e("h­ive­_ta­ble­")
Remove a temp table
hive_t­abl­e.d­rop()

Combining data

inner join
df1.jo­in(df2, df1.name == df2.name)
left join
df1.jo­in(df2, df1.name == df2.na­me,­how­='l­eft')
right join
ta.joi­n(tb, ta.name == tb.nam­e,h­ow=­'ri­ght')
joining on multiple columns
df1.jo­in(df2, df1.name == df2.na­me,­how­='r­ight')
in pyspark you need to start by setting an alias for the tables that you want to join
df1 = TableA.al­ias­('df1')
df2 = TableB.al­ias­('df2')


Useful links:
http:/­/ww­w.l­ear­nby­mar­ket­ing.co­m/1­100­/py­spa­rk-­joi­ns-­by-­exa­mple/
http:/­/ww­w.l­ear­nby­mar­ket­ing.co­m/6­18/­pys­par­k-r­dd-­bas­ics­-ex­amples/