Things you need to know about Hive

Data Analysis Skills 09

BW L.
Data Engineering Insight

--

This post is part of the Hadoop Series.

Data in Hadoop is often organized with Hive using HDFS as the storage layer. Each Hive table is stored at an HDFS location, which can be found using Hive SQL: describe formatted db.tbl;

As we briefly touch in the HDFS post, the size, number of files and storage format of a Hive table will have impact on performance of processes that use the table.

Here are some basic principals to keep in mind when working with Hive:

  • Less number of files per table is better.
  • Default storage format should not be plaintext. Think about the down stream tool that will be use your table and choose a proper storage format (ORC, PARQUET, etc). If it’s only going to be used by Hive, ORC is the choice. If Spark or H2O will be using the table, parquet might be better.
  • Automate consolidating small files of new hive tables by setting default hive properties.

Factors that impact the execution of Hive Code

The execution of Hive code will eventually become a series of Mapper and Reducer tasks. Although MapReduce engine and Tez engine might create completely different execution plan because of the optimization of the engine, we are not covering their differences in this post.

Mapper tasks are normally used to read in the source tables of the query. The the number of mappers of a table is determined by the number of files/blocks of the table. Therefore, a Hive table that has thousands of files will have more mapper tasks than another Hive table with the same data but different number of files.

The last step of a Hive execution plan is a reducer task that write the final results to HDFS. The number of reducers at this step will determine how many files the final table will have.

Size of a hive table and number of files

The size and number of files of a Hive table together will impact on the performance of hive and spark code that use the table. You should always be aware of these two metrics for the tables that you are working on.

  • The size of hive table impacts IO (read and write) of the table when it is read for further processing or joining with other tables.
  • The number of files of a Hive table decides the number of mappers the initial hive job will have. Too many files will cause the Hive job try to spin up the same number of mappers therefore causing waste of memory, even saturate the queue.

Below is a shell script that can get the HDFS path, total size and number of files of a Hive table.

db=$1
tbl=$2
loc=$(hive -e "describe formatted $db.$tbl" 2>/dev/null | grep Location | awk -F'|' '{print $3}')
size=$(hadoop fs -du -s -h $loc)
num_files=$(hadoop fs -ls $loc | wc -l)
echo "Table $db.$tbl"
echo -e "size\tHDFS Path\tNum Files"
echo -e "$size\t$loc\t$num_files"

Save it as a script name get_table_size.sh, make it executable

chmod +x get_table_size.sh

and run it to get the HDFS location, size and number of files for the Hive table db.tbl

./get_table_size.sh db tbl

Hive table storage format

Hive tables might be stored in plaintext, ORC, parquet, avro, ect. Each format has it’s pros and cons. Below are some cases when a format might be used.

  • Plaintext. Data is stored as text files like csv or tsv format. No compression is done. This is good for staging step of an ingestion pipeline if further processing needs to be done before the data become a hive table. But it is not recommended as long term storage format since it’s a waste of storage.
  • ORC will have a huge saving in disk space for most data especially for sparse tables. But unfortunately it doesn’t come with column names and data types. So it’s not convenient for reading from HDFS directly.
  • Parquet has compression and comes with column information. I would recommend to use parquet when the table is created and you know that Spark or H2O will be using this table.

For an existing Hive table that is stored in plaintext, you probably want to convert between formats based on the need.

Below is a simple script to convert an existing table into another format.

check_count(){
hive --outputformat=dsv -e "select count(*) from $1.$2" 2>/dev/null | sed '/0: jdbc/d' | sed '/_c0/d'
}
file=to_parquet.hql
db=$1
table=$2
format=$3 # ORC | PARQUET
to_drop="${table}_drop"
stage="${table}_stage"
echo "checking count of $db.$table"
count=`check_count $db $table`
echo "drop table if exists $db.$stage;
drop table if exists $db.$to_drop;
create table $db.$stage Stored as Parquet as select * from $db.$table;
alter table $db.$table rename to $db.$to_drop;
alter table $db.$stage rename to $db.$table;" > $file
hive -f $fileecho "checking count of new table $db.$table"
new_count=`check_count $db $table`
echo "==== Original count: $count, new count: $new_count"
echo "==== Original table is renamed to $db.$to_drop.
Remember to
drop table $db.$to_drop;"

Save this script as a shell script file, table_format.sh and make it executable chmod +x table_format.sh. Then run this command to convert the hive table db.tbl to ORC format. If you prefer parquet, just replace ORC to PARQUET.

./table_format.sh db tbl ORC

The script will create a new table with the same data stored as the format from the third argument. The original table is renamed to ${table}_drop where table is the original table name. You can drop this table once you are sure the new table is the same as the original one.

Note that the script will do a sanity check on the counts of original and new table. For large tables this will take time because Hive will essentially scan both tables from begin to end. You can comment out related lines (the lines that are calling the check_count function) if you want.

Create a Hive external table from data in your home folder

You should always put the data of a Hive external table in an HDFS folder even there is only one file. Create a folder in HDFS first, then upload data to the newly created folder:

hadoop fs -mkdir data/table_name
hadoop fs -put data.csv data/table_name

Now you are ready to create the external table at HDFS location:
/user/<user name>/data/table_name.

Run below Hive SQL statements to create the table:

USE db_name;
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
col1 string,
col2 string )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/<user name>/data/table_name';

/user/<user name>/data/table_name is the absolute path of the HDFS folder you just created.

Pay attention to the data.csv if it has header line or not. The above Hive statements assume that there is not header line. Remember to remove header before uploading to HDFS.

If the data has only one file and you don’s want to remove the header before uploading to HDFS, you can skip the header by adding below line after the LOCATION line.

TBLPROPERTIES ("skip.header.line.count"="1");

Optimize Hive performance by setting Hive properties

The behavior of hive can be customized by hive properties. To set a hive property, use the format set <property>=<value>; in Hive SQL.

Below are some properties that might be usefule:

  • Default storage format for a new table:
set hive.default.fileformat.managed=ORC;
set hive.default.fileformat=ORC;
  • Mapper
set tez.grouping.min-size=<byte size>;
set mapreduce.input.fileinputformat.split.maxsize=<byte size>;
  • Reducer
set hive.exec.reducers.bytes.per.reducer=<byte size>;
set hive.exec.reducers.max=<number>;
set hive.tez.auto.reducer.parallelism=true;

Of course there are tons of properties that can be adjusted. Refer to Apache’s hive properties page for a complete list.

What’s Next

Continue to the next posts:

--

--