Search for a file in HDFS using Solr Find tool

By | October 22, 2017
HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one or more HDFS directory trees, finds all HDFS files that match the specified expression, and applies selected actions to them. By default, it prints the list of matching HDFS file paths to stdout, one path per line.

  1. Search for exact filenames
    For our example, lets assume our HDFS folder “/user/cloudera/test1/test1_child” contains below files

    $ hadoop fs -ls /user/cloudera/test1/test1_child
    Found 4 items
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:42 /user/cloudera/test1/test1_child
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:44 /user/cloudera/test1/test1_child/input1.txt
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:46 /user/cloudera/test1/test1_child/input2.txt
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:43 /user/cloudera/test1/test1_child/input3.txt
    

    We want to search for file named “input1.txt”

    ## Filename which matches the pattern
    $ export SOLR_JAR=/usr/lib/solr/contrib/mr/search-mr-job.jar;
    $ hdfs_find_location="/user/cloudera/test1/";
    $ search_file_name_pattern="input1.txt"
    
    $ for var in `hadoop jar ${SOLR_JAR} org.apache.solr.hadoop.HdfsFindTool \
    -find ${hdfs_find_location} -type f  -name ${search_file_name_pattern} `; do
    echo ${var}
    done
    
    Output:
    hdfs://quickstart.cloudera:8020/user/cloudera/test1/test1_child/input1.txt
    

  2. Search for files which does not match the pattern(one or more patterns)
    For our example, lets assume our HDFS folder “/user/cloudera/sample1” contains below files

    $ hadoop fs -ls /user/cloudera/sample1
    Found 4 items
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:42 /user/cloudera/sample1/_SUCCESS
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:44 /user/cloudera/sample1/input1.tsv
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:46 /user/cloudera/sample1/input2.tsv
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 06:43 /user/cloudera/sample1/logs.tsv
    

    We want to search for files which does not match “_SUCCESS” and “logs.tsv”

    ## Filename which does not match the pattern
    $ export SOLR_JAR=/usr/lib/solr/contrib/mr/search-mr-job.jar;
    $ hdfs_find_location="/user/cloudera/sample1"
    
    $ for var in `hadoop jar ${SOLR_JAR} org.apache.solr.hadoop.HdfsFindTool \
    -find ${hdfs_find_location} \
    -type f ! -name "_SUCCESS" -a ! -name "logs.tsv" \
    `; do
    echo ${var}
    done
    
    Output:
    hdfs://quickstart.cloudera:8020/user/cloudera/sample1
    hdfs://quickstart.cloudera:8020/user/cloudera/sample1/input1.tsv
    hdfs://quickstart.cloudera:8020/user/cloudera/sample1/input2.tsv
    

  3. Perform set of operations on searched files, excluding the parent directory
    For our example, lets assume our HDFS folder “/user/cloudera/sample2” contains below files.
    We want to search for files which does not match “part-m-*”, copy those to a new location

    $ hadoop fs -ls /user/cloudera/sample2
    Found 5 items
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 10:19 /user/cloudera/sample2/corrupted_file
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 10:15 /user/cloudera/sample2/part-m-00000
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 10:16 /user/cloudera/sample2/part-m-00001
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 10:17 /user/cloudera/sample2/part-r-00000
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 10:18 /user/cloudera/sample2/part-r-00001
    

    # Script
    $ export SOLR_JAR=/usr/lib/solr/contrib/mr/search-mr-job.jar;
    
    $ hdfs_find_location="/user/cloudera/sample2"
    $ source_parent_directory="`basename ${hdfs_find_location}`";
    $ hdfs_destination=/user/cloudera/sample3/
    
    $ echo "This is a parent: ${source_parent_directory}"; 
    $ echo "This is the destination: ${hdfs_destination}"; 
    
    $ for source_to_be_copied in `hadoop jar ${SOLR_JAR} org.apache.solr.hadoop.HdfsFindTool \
    -find ${hdfs_find_location} -type f  ! -name 'part-m-*'` ; do
    ###Place your command below
    parent_directory=`basename ${source_to_be_copied}`;
    if [ "${parent_directory}" == "${source_parent_directory}" ]; 
    then  
        echo "This is a parent. Will not be copied: ${source_to_be_copied}"; 
    else 
        # Execute any command here
    	# In this example, we are copying the files which did not match the pattern to a new location
        echo "Copying ${source_to_be_copied}";
        hadoop fs -cp ${source_to_be_copied} ${hdfs_destination}
    fi
    done
    
    Output
    This is a parent. Will not be copied: hdfs://quickstart.cloudera:8020/user/cloudera/sample2
    Copying hdfs://quickstart.cloudera:8020/user/cloudera/sample2/corrupted_file
    Copying hdfs://quickstart.cloudera:8020/user/cloudera/sample2/part-r-00000
    Copying hdfs://quickstart.cloudera:8020/user/cloudera/sample2/part-r-00001
    
    # Checking destination location for copied files
    $ hadoop fs -ls /user/cloudera/sample3/
    Found 3 items
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 11:57 /user/cloudera/sample3/corrupted_file
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 11:59 /user/cloudera/sample3/part-r-00000
    -rw-r--r--   1 cloudera cloudera          0 2017-10-22 12:01 /user/cloudera/sample3/part-r-00001
    

  4. To get command help

    $ export SOLR_JAR=/usr/lib/solr/contrib/mr/search-mr-job.jar;
    $ hadoop jar ${SOLR_JAR} org.apache.solr.hadoop.HdfsFindTool -help
    
    Output:
    Usage: hadoop fs [generic options]
    	[-find <path> ... <expression> ...]
    	[-help [cmd ...]]
    	[-usage [cmd ...]]
    
    -find <path> ... <expression> ... :
      Finds all files that match the specified expression and applies selected actions
      to them.
      
      The following primary expressions are recognised:
        -atime n
        -amin n
          Evaluates as true if the file access time subtracted from
          the start time is n days (or minutes if -amin is used).
      
        -blocks n
          Evaluates to true if the number of file blocks is n.
      
        -class classname [args ...]
          Executes the named expression class.
      
        -depth
          Always evaluates to true. Causes directory contents to be
          evaluated before the directory itself.
      
        -empty
          Evaluates as true if the file is empty or directory has no
          contents.
      
        -exec command [argument ...]
        -ok command [argument ...]
          Executes the specified Hadoop shell command with the given
          arguments. If the string {} is given as an argument then
          is replaced by the current path name.  If a {} argument is
          followed by a + character then multiple paths will be
          batched up and passed to a single execution of the command.
          A maximum of 500 paths will be passed to a single
          command. The expression evaluates to true if the command
          returns success and false if it fails.
          If -ok is specified then confirmation of each command shall be
          prompted for on STDERR prior to execution.  If the response is
          'y' or 'yes' then the command shall be executed else the command
          shall not be invoked and the expression shall return false.
      
        -group groupname
          Evaluates as true if the file belongs to the specified
          group.
      
        -mtime n
        -mmin n
          Evaluates as true if the file modification time subtracted
          from the start time is n days (or minutes if -mmin is used)
      
        -name pattern
        -iname pattern
          Evaluates as true if the basename of the file matches the
          pattern using standard file system globbing.
          If -iname is used then the match is case insensitive.
      
        -newer file
          Evaluates as true if the modification time of the current
          file is more recent than the modification time of the
          specified file.
      
        -nogroup
          Evaluates as true if the file does not have a valid group.
      
        -nouser
          Evaluates as true if the file does not have a valid owner.
      
        -perm [-]mode
        -perm [-]onum
          Evaluates as true if the file permissions match that
          specified. If the hyphen is specified then the expression
          shall evaluate as true if at least the bits specified
          match, otherwise an exact match is required.
          The mode may be specified using either symbolic notation,
          eg 'u=rwx,g+x+w' or as an octal number.
      
        -print
        -print0
          Always evaluates to true. Causes the current pathname to be
          written to standard output. If the -print0 expression is
          used then an ASCII NULL character is appended.
      
        -prune
          Always evaluates to true. Causes the find command to not
          descend any further down this directory tree. Does not
          have any affect if the -depth expression is specified.
      
        -replicas n
          Evaluates to true if the number of file replicas is n.
      
        -size n[c]
          Evaluates to true if the file size in 512 byte blocks is n.
          If n is followed by the character 'c' then the size is in bytes.
      
        -type filetype
          Evaluates to true if the file type matches that specified.
          The following file type values are supported:
          'd' (directory), 'l' (symbolic link), 'f' (regular file).
      
        -user username
          Evaluates as true if the owner of the file matches the
          specified user.
      
      The following operators are recognised:
        expression -a expression
        expression -and expression
        expression expression
          Logical AND operator for joining two expressions. Returns
          true if both child expressions return true. Implied by the
          juxtaposition of two expressions and so does not need to be
          explicitly specified. The second expression will not be
          applied if the first fails.
      
        ! expression
        -not expression
          Evaluates as true if the expression evaluates as false and
          vice-versa.
      
        expression -o expression
        expression -or expression
          Logical OR operator for joining two expressions. Returns
          true if one of the child expressions returns true. The
          second expression will not be applied if the first returns
          true.
    
    -help [cmd ...] :
      Displays help for given command or all commands if none is specified.
    
    -usage [cmd ...] :
      Displays the usage for given command or all commands if none is specified.
    
    Generic options supported are
    -conf <configuration file>     specify an application configuration file
    -D <property=value>            use value for given property
    -fs <local|namenode:port>      specify a namenode
    -jt <local|resourcemanager:port>    specify a ResourceManager
    -files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
    -libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
    -archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
    
    The general command line syntax is
    bin/hadoop command [genericOptions] [commandOptions]
    

Leave a Reply

Your email address will not be published. Required fields are marked *