Hive – Optimization

By | August 8, 2017
  1. To set user timezone:
    SET mapreduce.map.java.opts="-Duser.timezone=UTC";
    SET mapreduce.reduce.java.opts="-Duser.timezone=UTC";
    
  2. Compress results
    -- Determines whether the output of the final map/reduce job in a query is compressed or not.
    SET hive.exec.compress.output=true;
        
    -- Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not.
    SET hive.exec.compress.intermediate=true;
    
  3. Avro settings – Compression
    -- Supported codecs are snappy and deflate.
    SET avro.output.codec=snappy;
    
  4. Parquet settings – Compression
    SET parquet.compression=snappy;
  5. Hive – Hint – STREAMTABLE
      
    -- In every map/reduce stage of the join, the table to be streamed can be specified via a hint.  
    SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

    All the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables b and c are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows. If the STREAMTABLE hint is omitted, Hive streams the rightmost table in the join.
    Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

  6. Joins – while select
    Mapjoin – Hive enables the optimization about converting common join into mapjoin based on the input file size. Prevent mapjoins for large tables

    SET hive.auto.convert.join=false;
    
    (OR) By specifying the hint "MAPJOIN"
    SELECT /*+ MAPJOIN(s) */ idOne, idTwo, value 
        FROM bigTable  b
        JOIN smallTableOne  s
        ON (b.idOne = s.idTwo) 
    
    -- Optimize Chains of Map Joins
    select /*+ MAPJOIN(time_dim, date_dim) */ count(*) from
    store_sales 
    join time_dim on (ss_sold_time_sk = t_time_sk) 
    join date_dim on (ss_sold_date_sk = d_date_sk)
    where t_hour = 8 and d_year = 2002
    ;

    Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

  7. Sort Merge bucket map join
    SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
    SET hive.optimize.bucketmapjoin=true; 
    SET hive.optimize.bucketmapjoin.sortedmerge=true; 
    
  8. Bucket mapjoin
    -- Enforce bucketing - while inserting
    SET hive.enforce.bucketing=true;
    
    -- Enforce sorting in each bucket - while inserting
    SET hive.enforce.sorting=true;
    
  9. Sort memory – The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
    SET io.sort.mb=800;
    Note: io.sort.mb should be 10 * io.sort.factor
    
  10. Memory
    -- Shuffle memory
    SET mapreduce.reduce.shuffle.memory.limit.percent=0.65;
    
    -- Map memory
    SET mapreduce.map.java.opts=-Xmx8192m;
    
    -- Reduce memory
    SET mapreduce.reduce.java.opts=-Xmx12288m;
    
  11. Execute jobs in parallel – — Whether to execute jobs in parallel. Applies to MapReduce jobs that can run in parallel, for example jobs processing different source tables before a join. As of Hive 0.14, also applies to move tasks that can run in parallel, for example moving files to insert targets during multi-insert.
    SET hive.exec.parallel=true;
  12. Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.
    SET mapreduce.job.reduce.slowstart.completedmaps=0.7;
  13. Data skew – Skew join Optimization – If your data highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation
    SET hive.optimize.skewjoin=true;
    -- Syntax: SET hive.skewjoin.key=<skew_key_threshold_number>;
    SET hive.skewjoin.key=100000;
    -- Skew data could also happen in GROUP BY data too.
    -- Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.
    SET hive.groupby.skewindata=true;
  14. Avoid shunting of results in Hue
    -- File format to use for a query's intermediate results. This takes care of results shunting in Hue
    SET hive.query.result.fileformat=SequenceFile;
    

Leave a Reply

Your email address will not be published. Required fields are marked *