Useful Hive settings

I have found that it is hard to find information on all the set commands for Hive. There are some listed here. I am going to use this blog to save some hive parameters that I wasn’t able to find easily but are very useful.

DIRECTORY RECURSIVENESS 

Two of these parameters work in conjunction. These can be session level or passed in at the beginning of a Hive query. They can also be set permanently in hive-site.xml.

Set mapred.input.dir.recursive=true;  
Set hive.mapred.supports.subdirectories=true;

When set to true Hive tables can be pointed to the higher level directory. Think of a scenario where the directory structure is as following:

/iislogs/02b016a5d19748e5a7e711cc0ef94395/WebRole.Web/WebRole.Web_IN_0/Momentum/
W3SVC472036332/u_ex12041323.log

In the above example the GUID like directory name is auto generated and changes frequently. In this case, it helps just to point your table to the first level of the directory “iislogs” which is constant and let hive do the recursive work for you.  See table DDL in the next example.

If you are using partitioning on a table I don’t suggest using this option as Hive may just look into the top level directory and scan all data instead of partition elimination.

SKIP HEADER ROWS

Sometimes when handling log files you will see header rows at the beginning of the data file. This is how a iislog file looks like (real IPs are obfuscated):

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2012-04-13 23:00:02
#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
2012-04-13 23:00:02 W3SVC472036332 RD00155D3AA932 10.200.255.18 GET /blko/get/stats – 20000 – 10.200.216.12 HTTP/1.1 – – – 10.195.255.2:20000 200 0 0 336 82 296
2012-04-13 23:02:22 W3SVC472036332 RD00155D3AA932 10.200.255.18 GET /opt/get/stats – 20000 – 10.200.216.12 HTTP/1.1 – – – 10.195.255.2:20000 200 0 0 336 82 265

It has 4 lines of headers that you do not want to include in your Hive query. To skip header lines from your tables you have choices and two of them are using PIG or Hive. In Hive you can set a table property that will allow you to skip the header lines.

CREATE EXTERNAL TABLE iislogs (
  date STRING,
  time  STRING,
  s_sitename  STRING,
  s_computername  STRING,
  s_ip  STRING,
  cs_method  STRING,
  cs_uri_stem  STRING,
  cs_uri_query  STRING,
  s_port  STRING,
  cs_username  STRING,
  c_ip  STRING,
  cs_version  STRING,
  cs_user_agent  STRING,
  cs_cookie  STRING,
  cs_referer  STRING,
  cs_host  STRING,
  sc_status  INT,
  sc_substatus  STRING,
  sc_win32_status  STRING,
  sc_bytes  INT,
  cs_bytes  INT,
  time_taken INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘  STORED AS TEXTFILE
LOCATION ‘./iislogs’
tblproperties ("skip.header.line.count"="4");

PRINT HEADER

Sometimes you want to see the header of columns for a Hive query that you want to run. To accomplish this you can use the following set command before your query to show column headers in STDOUT.

Set hive.cli.print.header=true;
select l_returnflag, count(*) as count from lineitem group by l_returnflag;

Result without header:

A       1478493
N       3043852
R       1478870

Result with header:

l_returnflag    count
A       1478493
N       3043852
R       1478870

TPCH schema for Hadoop – Hive

Recently I had some fun with installing Hadoop on my virtual cluster environment. After the installation I started to think about moving some data over to Hadoop and then create some hive tables – after all I play the part of a data architect in real life :). The following is what I did to load TPCH data that was residing on my D: drive (local) :

hadoop fs -put D:\tpch_1gb\data\nation.tbl /tpch_1gb/nation/nation.tbl
hadoop fs -put D:\tpch_1gb\data\orders.tbl /tpch_1gb/orders/orders.tbl
hadoop fs -put D:\tpch_1gb\data\part.tbl /tpch_1gb/part/part.tbl
hadoop fs -put D:\tpch_1gb\data\partsupp.tbl /tpch_1gb/partsupp/partsupp.tbl
hadoop fs -put D:\tpch_1gb\data\region.tbl /tpch_1gb/region/region.tbl
hadoop fs -put D:\tpch_1gb\data\supplier.tbl /tpch_1gb/supplier/supplier.tbl
hadoop fs -put D:\tpch_1gb\data\lineitem.tbl /tpch_1gb/lineitem/lineitem.tbl
hadoop fs -put D:\tpch_1gb\data\customer.tbl /tpch_1gb/customer/customer.tbl

 

Note that I am creating these tables as “external tables” to avoid another copy of the data in my HDFS space. When you create external tables Hive works as a metadata repository forgoing the need to save the data to it’s default directory. DDL for TPCH:

CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (l_orderkey bigint , l_partkey bigint ,        l_suppkey bigint ,        l_linenumber bigint ,        l_quantity FLOAT,        l_extendedprice FLOAT,        l_discount FLOAT,        l_tax FLOAT,        l_returnflag STRING ,        l_linestatus STRING ,        l_shipdate STRING ,        l_commitdate STRING ,        l_receiptdate STRING ,        l_shipinstruct STRING,        l_shipmode STRING,        l_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/lineitem’;

CREATE EXTERNAL TABLE IF NOT EXISTS customer ( c_custkey BIGINT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal FLOAT, c_mktsegment STRING, c_comment STRING) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/customer’;

CREATE EXTERNAL TABLE IF NOT EXISTS nation ( n_nationkey INT, n_name STRING, n_regionkey INT, n_comment STRING) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/nation’;

CREATE EXTERNAL TABLE IF NOT EXISTS orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus STRING, o_totalprice FLOAT, o_orderdate STRING, o_orderpriority STRING, o_clerk STRING, o_shippriority INT, o_comment STRING) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/orders’;

CREATE EXTERNAL TABLE IF NOT EXISTS part ( p_partkey BIGINT, p_name STRING, p_mfgr STRING, p_brand STRING, p_type STRING, p_size INT, p_container STRING, p_retailprice FLOAT, p_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/part’;

CREATE EXTERNAL TABLE IF NOT EXISTS partsupp ( ps_partkey BIGINT, ps_suppkey BIGINT, ps_availqty INT, ps_supplycost FLOAT, ps_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/partsupp’;

CREATE EXTERNAL TABLE IF NOT EXISTS region (        r_regionkey INT, r_name STRING, r_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/region’;

CREATE EXTERNAL TABLE IF NOT EXISTS supplier ( s_suppkey BIGINT, s_name STRING, s_address STRING, s_nationkey INT, s_phone STRING, s_acctbal FLOAT, s_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/supplier’;

Now you can run TPCH queries against it (with modifications). Queries to come later..