ImportTSV Data from HDFS into HBase

Is it really hard to insert data inside HBase by writing the scripts? For every record, you have to write an identical script to get data inside HBase. Even though we have same data already present in HDFS.
But what if by writing only a few lines you can have the data copied inside HBase?. It would be a lot of fun to work with HBase then, to get an analytical result much faster than traditional ways. In this blog, you will see a utility which will save us from writing multiple lines of scripts to insert data in HBase. HBase has developed numbers of utilities to make our work easier. Like many of the other HBase utilities, one which we are about to see is ImportTsv.A utility that loads data in the TSV format into HBase. ImportTsv takes data from HDFS into HBase via Puts. Find below the syntax used to load data via Puts (i.e., non-bulk loading):$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>In this blog, we will be practicing with small sample dataset how data inside HDFS is loaded into HBase.Steps to Practical ExecutionYet, Before starting practice on TSV import, it is compulsory to start all the Hadoop and HBase daemons. While Hadoop is not running, go toHadoop-X/sbin/start-all.shand so startHadoop-X/sbin/mr-historyserver-daemon.sh. So, if HMaster is not running, go toHbase/bin/start-Hbase.sh.

Now our system is ready.

Step1:

Step2 :

Come out of HBase shell to the terminal and also make a directory for Hbase in the local drive; So,since you have your own path you can use it.mkdir -p hbaseNow move to the directory where we will keep our data.cd hbase

Step5:

After the data is present now in HDFS.In terminal, we give the following command along with arguments <tablename> and <path of data in HDFS>

Command:hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf2:exp bulktable /hbase/bulk_data.tsvObserve that the map is done 100% although we get an error afterward.For now, ignore the error message due to our task is to map data in HBase table.Now,also let us check whether we actually got the data inside HBase by using the below command.Scan ‘bulkdata’We see all the data are present in the table, thus confirming our mapping successful for tab separated values.

Running ImportTsv with no arguments prints brief usage information:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family or a columnfamily:qualifier. Also, the special column name HBASE_ROW_KEY is used to designate that this column should be used as the row key for each imported record. You must specify exactly one column to be the row key, and consequently, you must specify a column name for every column that exists in the input data.
Especially relevant, this importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:-Dimporttsv.bulk.output=/path/for/output Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:-Dimporttsv.skip.bad.lines=false – fail if encountering an invalid line‘-Dimporttsv.separator=|’ – eg separate on pipes instead of tabs-Dimporttsv.timestamp=currentTimeAsLong –use the specified timestamp for the import-Dimporttsv.mapper.class=my.Mapper – A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
So, Hope this post helped you in importing tab separated values data. For any queries feel free to comment below.

Hi,
If the table gets a new row (5 Jyothi 3), how to import only this row into hbase, rather than importing the whole table again.
I know that even if you import whole table, hbase will not create any duplicates. But, I want to import only a single/updated row.
Thanks

excellent demonstration. Just one problem in my case: importTSV command selects first column as the row_key which is not unique in my case and only last record is being updated by hbase. How can I select Nth column in my tsv as row_key?

excellent demonstration. Just one problem in my case: importTSV command selects first column as the row_key which is not unique in my case and only last record is being updated by hbase. How can I select Nth column in my tsv as row_key??