Featured Article

Configure Hadoop Security with Cloudera Manager 5 or later – using Kerberos

If you are using Cloudera Manager version less than 5. Check out the other blog here Kerberos is a network authentication protocol created by MIT, and uses symmetric-key cryptography to authenticate users to network services, which means passwords are never actually sent over the network.Rather than authenticating each user to each network service separately as… Read More »

PIG – general stuff

Adding jar REGISTER /local/path/to/myjar_name.jar Set queue name Specify below in the pig script SET mapreduce.job.queuename 'my_queuename'; (or) specify while running the PIG script $ pig -Dmapreduce.job.queuename=my_queuename -f my_script.pig Set job name Specify below in the pig script SET mapreduce.job.name 'Testing HCatalog'; (or) specify while running the PIG script $ pig -Dmapreduce.job.name="Testing HCatalog" -f my_script.pig

Category: Pig

Hive – Timezone problem

Timezone problem – Any function which triggers mapreduce job, causes this problem, since it takes the local timezone of machine where it runs the mapper/reducer In our case, lets say our servers are in German timezone i.e. CET — With original setttings SET system:user.country; +————————-+–+ | set | +————————-+–+ | system:user.country=GB | +————————-+–+ — Original… Read More »

Search for a file in HDFS using Solr Find tool

HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one or more HDFS directory trees, finds all HDFS files that match the specified expression, and applies selected actions to them. By default, it prints the list of matching HDFS file paths to stdout, one path per line. Search… Read More »

Solr Installation and create new collection – standalone

Note: I am running this in Windows. Download Solr Download Solr from here I have downloaded solr-7.0.1: http://mirrors.whoishostingthis.com/apache/lucene/solr/7.0.1/solr-7.0.1.zip For this example, we will extract it to the folder C:\Users\Public\hadoop_ecosystem\solr-7.0.1 Start Solr Open command prompt and type below commands > c: > cd C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin C:\Users\Public\hadoop_ecosystem\solr-7.0.1\bin> solr start -p 8983 Output: Waiting up to 30 to see… Read More »

PySpark – dev set up – Eclipse – Windows

For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up Note: I am running Eclipse Neon Prerequisites Python 3.5 JRE 8 JDK 1.8 Eclipse plugins: PyDev Steps to set up: Download from here: https://spark.apache.org/downloads.html 1. Choose a Spark release: 2.1.0 2. Choose a package type: Pre-built for Apache Hadoop 2.6 3. Download below… Read More »

Pyspark – getting started – useful stuff

Example to create dataframe from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext def create_dataframe(): """ Example to create dataframe """ headers = ("id" , "name") data = [ (1, "puneetha") ,(2, "bhoomika") ] df = spark.createDataFrame(data, headers) df.show(1, False) # Output: # |id |name | # +—+——–+ #… Read More »

sqoop queries – examples

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Open source Apache project that exchanges data between a database and HDFS Can import all tables, single tables or even partial tables with free form SQL queries into HDFS Data can be imported in… Read More »

Hive – testing queries with dummy data

If your query looks like “SELECT * FROM TABLE1;” You want to test the input from “TABLE1” with your dummy dataset. If you have a multiple subqueries using a base table. This comes very handy. — Creating single dummy row: SELECT * FROM ( — This is our dummy row, which is a replacement of… Read More »

Hive – Optimization

To set user timezone: SET mapreduce.map.java.opts="-Duser.timezone=UTC"; SET mapreduce.reduce.java.opts="-Duser.timezone=UTC"; Compress results — Determines whether the output of the final map/reduce job in a query is compressed or not. SET hive.exec.compress.output=true; — Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. SET hive.exec.compress.intermediate=true; Avro settings – Compression — Supported codecs… Read More »

Hive – Best Practices

Testing with Dummy data – Check here Beeline doesnt honor tabs, if you are using any editors, you can replace tabs with space to maintain the structure and still use beeline effectively. Ex: CREATE TABLE IF NOT EXISTS default.test1 (id<tab>INT,name STRING); — this will fail Hive will throw an error saying "Error: Error while compiling… Read More »