Show Menu

Apache Hive in Easy steps Cheat Sheet by

apache     sql     systems     hive     distributed


Hive is not an RDBMS, but it pretends to be one most of the time. It has tables, it runs SQL, and it supports both JDBC and ODBC. Hive lets you use SQL on Hadoop, but tuning SQL on a distri­buted system is different. Here are 12 tips to help your effort fly

The good and bad news of this revela­tion: Hive doesn’t run queries the way an RDBMS does. It's a long story, but I spent an 80-plu­s-hour workweek personally tuning Hive. Needless to say, my head won't quit buzzing. So for your benefit, here are a few sugges­tions to make your next Hive project go a little bit faster than mine did.
"­There’s my dirty dozen. I hope this means my endless late-night sufferings will not have been in vain -- and help keep others from getting stung by Hive’s eccent­ric­iti­es."­ Andrew C. Oliver — Columnist

1. Don’t use MapReduce

Whether you believe in Tez, Spark or Impala, don’t believe in MapReduce. It is slow on its own, and it's really slow under Hive. If you’re on Horton­work’s distri­bution, you can throw set hive.e­xec­uti­on.e­ng­ine=tez at the top of a script. On Cloudera, use Impala. Hopefully, soon you can set hive.e­xec­uti­on.e­ng­ine­=spark when Impala isn’t approp­riate.

2. Don’t do string matching in SQL

Ever! Especially in Hive. If you stick a like string match where a clause should be, you'll generate a cross-­product warning. If you have a query that runs in seconds, with string matching it will take minutes. Your best altern­ative is to use one of many tools that allow you to add search to Hadoop. Look at Elasti­cse­arch’s Hive integr­ation or Lucidw­ork’s integr­ation for Solr. Also, there's Cloudera Search. RDBMSes were never good at this, but Hive is worse.

3. Don't do a join on a subquery

You're better off creating a temporary table, then joining against the temp table instead of asking Hive to be smart about how it handles subque­ries. Meaning don't do this:

select a.* from something a inner join (select ... from someth­ingelse union b select ... from anothe­rthing c) d on a.key1 = d.key1 and a.key2 = b.key2 where a.cond­ition=1

Instead, do this:

create var_temp as select ... from someth­ingelse b union select ... from anothe­rthing c and then select a.* from something a inner join from var_temp b where a.key1­=b.key1 and a.key2­=b.key2 where a.cond­ition=1

It really shouldn’t be tons faster at this point in Hive’s evolution, but it is, generally.

Apache Hive

4. Use Parquet or ORC, but don’t convert to them

Use Parquet or ORC, but don’t convert to them for sport

That is, use Parquet or ORC as opposed to, say, TEXTFILE. However, if you have text data coming in and are massaging it into something slightly more struct­ured, do the conversion to the target tables. You can’t LOAD DATA from a text file into an ORC, so do the initial load into a text.

When you create other tables against which you’ll ultimately run most of your analysis, do your ORCing there because converting to ORC or Parquet takes time and isn’t worth it as step one in your ETL process. If you have simple flat files coming in and aren’t doing any tweaking, then you’re stuck loading into a temporary table and doing a select create into an ORC or Parquet. I don’t envy you because it is kind of slow.

5. Try turning vector­ization on and off

Add set hive.v­ect­ori­zed.ex­ecu­tio­n.e­nabled = true set hive.v­ect­ori­zed.ex­ecu­tio­n.r­edu­ce.e­nabled = true to the top of your scripts. Try it with them on and off because vector­ization seems proble­matic in recent versions of Hive

6. Don’t use structs in a join

I have to admit my native­-brain SQL syntax is about SQL-92 era, so I don’t tend to use structs anyhow. But if you’re doing something super-­rep­etitive like ON clauses for compound PKs, structs are handy. Unfort­una­tely, Hive chokes on them -- partic­ularly in the ON clause. Of course, it doesn’t do so at smaller data sets and yields no errors much of the time. In Tez, you get a fun vector error. This limitation isn’t documented anywhere that I know of. Consider this a fun way to get to know the innards of your execution engine!

7. Check your container size

You may need to increase your container size for Impala or Tez. Also, the “recom­mended” sizes may not apply to your system if you have larger node sizes. You might want to make sure your YARN queue and general YARN memory are approp­riate. You might also want to peg it to something that isn’t the default queue all the peasants use.

8. Enable statistics

Hive does somewhat boneheaded things with joins unless statistics are enabled. You may also want to use query hints in Impala.

9. Consider MapJoin optimi­zations

If you do an explain on your query, you may find that recent versions of Hive are smart enough to apply the optimi­zation automa­tic­ally. But you may need to tweak them.

9. Consider MapJoin optimi­zations


10. If you can, put the largest table last


11. Partitions are your friends ... sorta

If you have this one item in many places where clauses like a date (but ideally not a range) or a location repeat, you might have your partition key! Partitions basically mean “split this into its own direct­ory,” which means instead of looking at a big file, Hive looks at the one file because you have it in your join/where clause saying you’re only looking at locati­on=­’NC’, which is a small subset of your data. Also, unlike with column values, you can push partitions in your LOAD DATA statem­ents. However, remember that HDFS does not love small files.

12. Use hashes for column compar­isons

If you’re comparing the same 10 fields in every query, consider using hash() and comparing the sums. These are sometimes so useful you might shove them in an output table. Note that the hash in Hive 0.12 is a low resolu­tion, but better hashes are available in 0.13.

Download the Apache Hive in Easy steps Cheat Sheet

2 Pages

PDF (recommended)

Alternative Downloads

Share This Cheat Sheet!



No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          mod_rewrite Cheat Sheet
          HTTP Status Codes Cheat Sheet
          Selenium WebDriver Cheat Sheet Cheat Sheet

          More Cheat Sheets by Davidpol