Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.String.substring(String.java:1933) at java.io.File.getName(File.java:456) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:243) at java.io.File.isDirectory(File.java:849) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList(ProcfsBasedProcessTree.java:511) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:210) at org.apache.hadoop.mapred.Task.updateResourceCounters(Task.java:894) at org.apache.hadoop.mapred.Task.updateCounters(Task.java:1045) at org.apache.hadoop.mapred.Task.access$500(Task.java:82) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:782) at java.lang.Thread.run(Thread.java:745)

Tracking YARN logs

Create script to get yarn logs $ vim hadoop_logs.sh #!/bin/bash APPLICATION_ID=$1 CONTAINER_ID=$2 NODE_ADDRESS=$3 if [ $# -eq 1 ]; then yarn logs -applicationId ${APPLICATION_ID} elif [ $# -eq 3 ]; then yarn logs -applicationId ${APPLICATION_ID} -containerId ${CONTAINER_ID} -nodeAddress ${NODE_ADDRESS} else echo "you must specify 1 or 3 arguments <hlogs applicationId containerId nodeAddress>" fi Create a… Read More »

Search for a pattern in HDFS files – python script

Problem: Search a pattern in HDFS files and return the filename which contains this pattern. For example, below are our input files: $vim log1.out [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11… Read More »

Spark quick commands – Scala

Save file to HDFS with custom delimiter in Spark: import spark.sql val df = sql(""" select * from test_db.test_table1 """) df.write.format("csv").partitionBy("year", "month").mode('overwrite').option("delimiter", "|").save("/user/cloudera/project/workspace/test/test_table1")

Hive UDFs – Simple and Generic UDFs

Hive UDFs: These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematics and string functions. Ex: SELECT LOWER(str) FROM table_name; SELECT CONCAT(column1,column2) AS x FROM table_name; There are 2 ways of writing the UDFs Simple – extend UDF class Generic – extend GenericUDF class In… Read More »

Hive Beeline cheatsheet

Beeline Shell Commands Command Description Example !help Print a summary of command usage !quit Exits the Beeline client. !history Display the command history !table <sql_query_file> Run SQL query from file !run /user/dummy_local_user/myquery1.sql set Prints a list of configuration variables that are overridden by the user or Hive. set -v Prints all Hadoop and Hive configuration… Read More »

PIG UDF with testNG test case – concatenate two strings

PIG UDF class package org.puneetha.pig.udf; import java.io.IOException; import org.apache.log4j.Logger; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; /*** * * * @author Puneetha * */ public final class ConcatStrPig extends EvalFunc<String>{ private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName()); @Override public String exec(final Tuple input) throws IOException { logger.debug("Tuple=" + input.toString()); String separator = " "; StringBuilder result = new… Read More »

Category: Pig

Hive UDF with testNG test case – concatenate two strings

Hive UDF class package org.puneetha.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.UDFType; import org.apache.hadoop.io.Text; import org.apache.log4j.Logger; import org.apache.hadoop.hive.ql.exec.Description; /*** * * * @author Puneetha * */ @Description(name = "udf_concat" , value = "_FUNC_(STRING, STRING) – RETURN_TYPE(STRING)\n" + "Description: Concatenate two strings, separated by spaces" , extended = "Example:\n" + " > SELECT udf_concat('hello','world') FROM src;\n" +… Read More »