Category Archives: Hive

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: SET mapreduce.map.java.opts="-Duser.timezone=UTC"; SET mapreduce.reduce.java.opts="-Duser.timezone=UTC"; Compress results — Determines whether the output of the final map/reduce job in a query is compressed or not. SET hive.exec.compress.output=true; — Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. SET hive.exec.compress.intermediate=true; Avro settings – Compression — Supported codecs… Read More »

Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (id<tab>INT,name STRING); — this will fail Hive will throw an error saying "Error: Error while compiling… Read More »

Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.String.substring(String.java:1933) at java.io.File.getName(File.java:456) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:243) at java.io.File.isDirectory(File.java:849) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList(ProcfsBasedProcessTree.java:511) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:210) at org.apache.hadoop.mapred.Task.updateResourceCounters(Task.java:894) at org.apache.hadoop.mapred.Task.updateCounters(Task.java:1045) at org.apache.hadoop.mapred.Task.access$500(Task.java:82) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:782) at java.lang.Thread.run(Thread.java:745)

Hive UDFs – Simple and Generic UDFs

Hive UDFs: These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions. Ex: SELECT LOWER(str) FROM table_name; SELECT CONCAT(column1,column2) AS x FROM table_name; There are 2 ways of writing the UDFs Simple – extend UDF class Generic – extend GenericUDF class In… Read More »

Hive UDF with testNG test case – concatenate two strings

Hive UDF class package org.puneetha.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.UDFType; import org.apache.hadoop.io.Text; import org.apache.log4j.Logger; import org.apache.hadoop.hive.ql.exec.Description; /*** * * * @author Puneetha * */ @Description(name = "udf_concat" , value = "_FUNC_(STRING, STRING) – RETURN_TYPE(STRING)\n" + "Description: Concatenate two strings, separated by spaces" , extended = "Example:\n" + " > SELECT udf_concat('hello','world') FROM src;\n" +… Read More »

Hive Commands

Run hive one shot command in background $nohup hive -f sample.hql > output1.out 2>&1 & $nohup hive –database "default" -e "select * from tablename;" > output1.out 2>&1 & Replace delimiter in hive output from default delimiter to the character you wish ( In this example I am replacing it with comma(,) hive –database "database_name" -f… Read More »

Query Hive table from Pig – using HCatalog

Querying hive table from PIG using HCatalog $cat hcatScript.pig A = LOAD 'db1.tablename' USING org.apache.hcatalog.pig.HCatLoader(); B = LIMIT A 10; dump B; $pig -useHCatalog -f hcatScript.pig Comment below if you find this blog useful.