April | 2014 | Murshed Zaman

Useful Hive settings

April 18, 2014 by MurshedAzureCAT Leave a comment

I have found that it is hard to find information on all the set commands for Hive. There are some listed here. I am going to use this blog to save some hive parameters that I wasn’t able to find easily but are very useful.

DIRECTORY RECURSIVENESS

Two of these parameters work in conjunction. These can be session level or passed in at the beginning of a Hive query. They can also be set permanently in hive-site.xml.

Set mapred.input.dir.recursive=true;
Set hive.mapred.supports.subdirectories=true;

When set to true Hive tables can be pointed to the higher level directory. Think of a scenario where the directory structure is as following:

/iislogs/02b016a5d19748e5a7e711cc0ef94395/WebRole.Web/WebRole.Web_IN_0/Momentum/
W3SVC472036332/u_ex12041323.log

In the above example the GUID like directory name is auto generated and changes frequently. In this case, it helps just to point your table to the first level of the directory “iislogs” which is constant and let hive do the recursive work for you. See table DDL in the next example.

If you are using partitioning on a table I don’t suggest using this option as Hive may just look into the top level directory and scan all data instead of partition elimination.

SKIP HEADER ROWS

Sometimes when handling log files you will see header rows at the beginning of the data file. This is how a iislog file looks like (real IPs are obfuscated):

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2012-04-13 23:00:02
#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
2012-04-13 23:00:02 W3SVC472036332 RD00155D3AA932 10.200.255.18 GET /blko/get/stats – 20000 – 10.200.216.12 HTTP/1.1 – – – 10.195.255.2:20000 200 0 0 336 82 296
2012-04-13 23:02:22 W3SVC472036332 RD00155D3AA932 10.200.255.18 GET /opt/get/stats – 20000 – 10.200.216.12 HTTP/1.1 – – – 10.195.255.2:20000 200 0 0 336 82 265

It has 4 lines of headers that you do not want to include in your Hive query. To skip header lines from your tables you have choices and two of them are using PIG or Hive. In Hive you can set a table property that will allow you to skip the header lines.

CREATE EXTERNAL TABLE iislogs (
date STRING,
time STRING,
s_sitename STRING,
s_computername STRING,
s_ip STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_version STRING,
cs_user_agent STRING,
cs_cookie STRING,
cs_referer STRING,
cs_host STRING,
sc_status INT,
sc_substatus STRING,
sc_win32_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE
LOCATION ‘./iislogs’
tblproperties ("skip.header.line.count"="4");

PRINT HEADER

Sometimes you want to see the header of columns for a Hive query that you want to run. To accomplish this you can use the following set command before your query to show column headers in STDOUT.

Set hive.cli.print.header=true;
select l_returnflag, count(*) as count from lineitem group by l_returnflag;

Result without header:

A       1478493
N       3043852
R       1478870

Result with header:

l_returnflag    count
A       1478493
N       3043852
R       1478870

Filed under Hadoop Tagged with HDInsight; Hadoop, hive

Murshed Zaman

Useful Hive settings

Follow me on Twitter

Recent Posts

Archives

Categories