Featured Article

Configure Hadoop Security with Cloudera Manager 5 or later – using Kerberos

If you are using Cloudera Manager version less than 5. Check out the other blog here Kerberos is a network authentication protocol created by MIT, and uses symmetric-key cryptography to authenticate users to network services, which means passwords are never actually sent over the network.Rather than authenticating each user to each network service separately as… Read More »

Pyspark – getting started – useful stuff

Example to create dataframe from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext def create_dataframe(): """ Example to create dataframe """ headers = ("id" , "name") data = [ (1, "puneetha") ,(2, "bhoomika") ] df = spark.createDataFrame(data, headers) df.show(1, False) # Output: # |id |name | # +—+——–+ #… Read More »

sqoop queries – examples

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Open source Apache project that exchanges data between a database and HDFS Can import all tables, single tables or even partial tables with free form SQL queries into HDFS Data can be imported in… Read More »

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: SET mapreduce.map.java.opts="-Duser.timezone=UTC"; SET mapreduce.reduce.java.opts="-Duser.timezone=UTC"; Compress results — Determines whether the output of the final map/reduce job in a query is compressed or not. SET hive.exec.compress.output=true; — Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. SET hive.exec.compress.intermediate=true; Avro settings – Compression — Supported codecs… Read More »

Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (id<tab>INT,name STRING); — this will fail Hive will throw an error saying "Error: Error while compiling… Read More »

Hive – big data – big problems

2017-07-26 00:32:04,676 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.String.substring(String.java:1933) at java.io.File.getName(File.java:456) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:243) at java.io.File.isDirectory(File.java:849) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.getProcessList(ProcfsBasedProcessTree.java:511) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:210) at org.apache.hadoop.mapred.Task.updateResourceCounters(Task.java:894) at org.apache.hadoop.mapred.Task.updateCounters(Task.java:1045) at org.apache.hadoop.mapred.Task.access$500(Task.java:82) at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:782) at java.lang.Thread.run(Thread.java:745)

Tracking YARN logs

Create script to get yarn logs $ vim hadoop_logs.sh #!/bin/bash APPLICATION_ID=$1 CONTAINER_ID=$2 NODE_ADDRESS=$3 if [ $# -eq 1 ]; then yarn logs -applicationId ${APPLICATION_ID} elif [ $# -eq 3 ]; then yarn logs -applicationId ${APPLICATION_ID} -containerId ${CONTAINER_ID} -nodeAddress ${NODE_ADDRESS} else echo "you must specify 1 or 3 arguments <hlogs applicationId containerId nodeAddress>" fi Create a… Read More »

Search for a pattern in HDFS files – python script

Problem: Search a pattern in HDFS files and return the filename which contains this pattern. For example, below are our input files: $vim log1.out [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/htdocs/test [Wed Oct 11… Read More »

Spark quick commands – Scala

Save file to HDFS with custom delimiter in Spark: import spark.sql val df = sql(""" select * from test_db.test_table1 """) df.write.format("csv").partitionBy("year", "month").mode('overwrite').option("delimiter", "|").save("/user/cloudera/project/workspace/test/test_table1")