sqoop queries – examples

By | August 8, 2017
  • Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Open source Apache project that exchanges data between a database and HDFS
  • Can import all tables, single tables or even partial tables with free form SQL queries into HDFS
  • Data can be imported in a variety of formats
  • also be used to export data from HDFS back into standard relational databases
  • Sqoop is a client side application that imports data using Hadoop MapReduce
  • An import involves three main steps
    • Examine the table details
    • Create and submit job to cluster
    • Fetch records from table and write this data to HDFS
    • Under the hood the dataset being transferred is sliced up into different partitions using multiple connections to the database

    We have few dataset in Mysql in cloudera quickstart VM. I thought its a good source to get few examples of sqoop command around this data and import it to Hive.

    Login to mysql instance in cloudera quickstart VM with below credentials:
    username: root
    password: cloudera

    $ mysql -uroot -pcloudera
    mysql> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | cm                 |
    | firehose           |
    | hue                |
    | metastore          |
    | mysql              |
    | nav                |
    | navms              |
    | oozie              |
    | retail_db          |
    | rman               |
    | sentry             |
    +--------------------+
    12 rows in set (0.19 sec)
    
    mysql> USE retail_db;
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A
    
    Database changed
    mysql> show tables;
    +---------------------+
    | Tables_in_retail_db |
    +---------------------+
    | categories          |
    | customers           |
    | departments         |
    | order_items         |
    | orders              |
    | products            |
    +---------------------+
    6 rows in set (0.21 sec)
    
    

    For our sqoop import, we will import tables “customers” and “categories”

    -- In Hive create a database with name 'sqoop_db' to import all this data. This can be any database, 'sqoop_db' is just an example
    
    -- Creates database with name 'sqoop_db'
    CREATE DATABASE IF NOT EXISTS sqoop_db
    COMMENT "This is the database used for sqoop import"
    LOCATION "/user/cloudera/.sqoop_db"
    ;
    

    We will import from Mysql table “retail_db.customers” to Hive “sqoop_db.customers” table

    $ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
    --username "root" --password "cloudera" \
    --mapreduce-job-name "This is our first sqoop import" \
    --query "SELECT * FROM customers WHERE \$CONDITIONS" \
    --hive-import \
    --create-hive-table \
    --fields-terminated-by ',' \
    --hive-table sqoop_db.customers \
    --split-by "customer_id" \
    --target-dir "/user/cloudera/customers"
    

    We will import from Mysql table “retail_db.categories” to Hive “sqoop_db.categories” table

    # Specify limit/Boundary condition on how many rows of data to import
    $sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
    --username "root" --password "cloudera" \
    --mapreduce-job-name "This is our second sqoop import - with boundary condition" \
    --query "SELECT * FROM categories WHERE \$CONDITIONS" \
    --boundary-query "SELECT 10, 1000 FROM categories" \
    --hive-import \
    --create-hive-table \
    --fields-terminated-by ',' \
    --hive-table sqoop_db.categories \
    --split-by "customer_id" \
    --target-dir "/user/cloudera/categories"
    

    Note: Can exceed the number of connections, if you dont specify this in production set up
    -m,–num-mappers Use n map tasks to import in parallel

    Check whether all tables are are imported to Hive database:

    ## Login to hive
    $ beeline -u jdbc:hive2://quickstart.cloudera:10000/default -n cloudera -p cloudera
    beeline> USE sqoop_db;
    beeline> SHOW TABLES;
    

Leave a Reply

Your email address will not be published. Required fields are marked *