I'm starting with hbase and testing for our needs. I have set up a hadoopcluster of Three machines and A Hbase cluster atop on the same three machines,one master two slaves.

I am testing the Import of a 5GB csv file with the importTsv tool. I import thefile in the HDFS and use the importTsv tool to import in Hbase.

Right now it takes a little over an hour to complete. It creates around 2million entries in one table with a single family.If I use bulk uploading it goes down to 20 minutes.

My hadoop has 21 map tasks but they all seem to be taking a very long time tofinish many tasks end up in time out.

I am wondering what I have missed in my configuration. I have followed thedifferent prerequisites in the documentations but I am really unsure as to whatis causing this slow down. If I were to apply the wordcount example to the samefile it takes only minutes to complete so I am guessing the issue lies in myHbase configuration.

> Hi everyone>> I'm starting with hbase and testing for our needs. I have set up a hadoop> cluster of Three machines and A Hbase cluster atop on the same three> machines,> one master two slaves.>> I am testing the Import of a 5GB csv file with the importTsv tool. I> import the> file in the HDFS and use the importTsv tool to import in Hbase.>> Right now it takes a little over an hour to complete. It creates around 2> million entries in one table with a single family.> If I use bulk uploading it goes down to 20 minutes.>> My hadoop has 21 map tasks but they all seem to be taking a very long time> to> finish many tasks end up in time out.>> I am wondering what I have missed in my configuration. I have followed the> different prerequisites in the documentations but I am really unsure as to> what> is causing this slow down. If I were to apply the wordcount example to the> same> file it takes only minutes to complete so I am guessing the issue lies in> my> Hbase configuration.>> Any help or pointers would by appreciated>>

<configuration> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description></property><property> <name>dfs.data.dir</name> <value>/home/runner/app/hadoop/dfs/data</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description></property><property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property></configuration>Mapred-site.xml

<configuration> <property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description></property><property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>14</value> <description>The maximum number of map tasks that will be run simultaneously by a task tracker. </description></property>

<property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>14</value> <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description></property><property><name>mapred.child.java.opts</name> <value>-Xmx400m</value> <description>Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc

The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. </description></property></configuration>core-site.xml

Thanks, checking the schema itself is still interesting (cf. the link sent)As well, with 3 machines and a replication factor of 3, all the machinesare used during a write. As HBase writes all entries into a write-ahead-logfor safety, the number of writes is also doubled. So may be your machine isjust dying under the load. Anyway, here your cluster is going at the speedof the least powerful machine, and this machine has a workload multipliedby 6 compared to a single machine config (i.e. just writing a file locally).

> Thanks for the help!>> My conf files are : Hadoop:> hdfs-site>> <configuration>> <property>> <name>dfs.replication</name>> <value>3</value> <description>Default block replication.> The actual number of replications can be specified when the file is> created.> The default is used if replication is not specified in create time.> </description>> </property>> <property>> <name>dfs.data.dir</name>> <value>/home/runner/app/hadoop/dfs/data</value>> <description>Default block replication.> The actual number of replications can be specified when the file is> created.> The default is used if replication is not specified in create time.> </description>> </property>> <property>> <name>dfs.datanode.max.xcievers</name>> <value>4096</value>> </property>> </configuration>>>> Mapred-site.xml>> <configuration>> <property>> <name>mapred.job.tracker</name>> <value>master:54311</value>> <description>The host and port that the MapReduce job tracker runs> at. If "local", then jobs are run in-process as a single map> and reduce task.> </description>> </property>> <property>> <name>mapred.tasktracker.map.tasks.maximum</name>> <value>14</value>> <description>The maximum number of map tasks that will be run> simultaneously by a task tracker.> </description>> </property>>> <property>> <name>mapred.tasktracker.reduce.tasks.maximum</name>> <value>14</value>> <description>The maximum number of reduce tasks that will be run> simultaneously by a task tracker.> </description>> </property>> <property>> <name>mapred.child.java.opts</name>> <value>-Xmx400m</value>> <description>Java opts for the task tracker child processes.> The following symbol, if present, will be interpolated: @taskid@ is> replaced> by current TaskID. Any other occurrences of '@' will go unchanged.> For example, to enable verbose gc logging to a file named for the taskid> in> /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:> -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc>> The configuration variable mapred.child.ulimit can be used to control the> maximum virtual memory of the child processes.> </description>> </property>> </configuration>>>> core-site.xml>> <configuration>> <property>> <name>hadoop.tmp.dir</name>> <value>/home/runner/app/hadoop/tmp</value>> <description>A base for other temporary directories.</description>> </property>>> <property>> <name>fs.default.name</name>> <value>hdfs://master:54310</value>> <description>The name of the default file system. A URI whose> scheme and authority determine the FileSystem implementation. The> uri's scheme determines the config property (fs.SCHEME.impl) naming> the FileSystem implementation class. The uri's authority is used to> determine the host, port, etc. for a filesystem.</description>> </property>>>> For Hbase:> hbase-site:> <configuration>> <property>> <name>hbase.rootdir</name>> <value>hdfs://master:54310/hbase</value>> </property>> <property>> <name>hbase.cluster.distributed</name>> <value>true</value>> <description>The mode the cluster will be in. Possible values are> false: standalone and pseudo-distributed setups with managed> Zookeeper> true: fully-distributed with unmanaged Zookeeper Quorum (see

You will want to make sure your table is pre-split. Also Import doesputs, so you will want to make sure you are not flushing and blockingby raising your memstore, Hlog, and blocking count. This can greatlyimprove your write speeds. I usually do a 256MB memstore(you canlower it later if it is not a heavy writes table), 512MB Hlog(samething, you can lower back to default), and then raise the storefileblocking count to about 100.

On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:> Thanks, checking the schema itself is still interesting (cf. the link sent)> As well, with 3 machines and a replication factor of 3, all the machines> are used during a write. As HBase writes all entries into a write-ahead-log> for safety, the number of writes is also doubled. So may be your machine is> just dying under the load. Anyway, here your cluster is going at the speed> of the least powerful machine, and this machine has a workload multiplied> by 6 compared to a single machine config (i.e. just writing a file locally).>> On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard <> [EMAIL PROTECTED]> wrote:>>> Thanks for the help!>>>> My conf files are : Hadoop:>> hdfs-site>>>> <configuration>>> <property>>> <name>dfs.replication</name>>> <value>3</value>> <description>Default block replication.>> The actual number of replications can be specified when the file is>> created.>> The default is used if replication is not specified in create time.>> </description>>> </property>>> <property>>> <name>dfs.data.dir</name>>> <value>/home/runner/app/hadoop/dfs/data</value>>> <description>Default block replication.>> The actual number of replications can be specified when the file is>> created.>> The default is used if replication is not specified in create time.>> </description>>> </property>>> <property>>> <name>dfs.datanode.max.xcievers</name>>> <value>4096</value>>> </property>>> </configuration>>>>>>> Mapred-site.xml>>>> <configuration>>> <property>>> <name>mapred.job.tracker</name>>> <value>master:54311</value>>> <description>The host and port that the MapReduce job tracker runs>> at. If "local", then jobs are run in-process as a single map>> and reduce task.>> </description>>> </property>>> <property>>> <name>mapred.tasktracker.map.tasks.maximum</name>>> <value>14</value>>> <description>The maximum number of map tasks that will be run>> simultaneously by a task tracker.>> </description>>> </property>>>>> <property>>> <name>mapred.tasktracker.reduce.tasks.maximum</name>>> <value>14</value>>> <description>The maximum number of reduce tasks that will be run>> simultaneously by a task tracker.>> </description>>> </property>>> <property>>> <name>mapred.child.java.opts</name>>> <value>-Xmx400m</value>>> <description>Java opts for the task tracker child processes.>> The following symbol, if present, will be interpolated: @taskid@ is>> replaced>> by current TaskID. Any other occurrences of '@' will go unchanged.>> For example, to enable verbose gc logging to a file named for the taskid>> in>> /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:>> -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc>>>> The configuration variable mapred.child.ulimit can be used to control the>> maximum virtual memory of the child processes.>> </description>>> </property>>> </configuration>>>>>>> core-site.xml>>>> <configuration>>> <property>>> <name>hadoop.tmp.dir</name>>> <value>/home/runner/app/hadoop/tmp</value>>> <description>A base for other temporary directories.</description>>> </property>>>>> <property>>> <name>fs.default.name</name>>> <value>hdfs://master:54310</value>>> <description>The name of the default file system. A URI whose>> scheme and authority determine the FileSystem implementation. The

Hi Using ImportTSV tool you are trying to bulk load your data. Can you seeand tell how many mappers and reducers were there. Out of total time whatis the time taken by the mapper phase and by the reducer phase. Seems likeMR related issue (may be some conf issue). In this bulk load case most ofthe work is done by the MR job. It will read the raw data and convert itinto Puts and write to HFiles. MR o/p is HFiles itself. The next part inImportTSV will just put the HFiles under the table region store.. Therewont be WAL usage in this bulk load.

> Hi everyone>> I'm starting with hbase and testing for our needs. I have set up a hadoop> cluster of Three machines and A Hbase cluster atop on the same three> machines,> one master two slaves.>> I am testing the Import of a 5GB csv file with the importTsv tool. I> import the> file in the HDFS and use the importTsv tool to import in Hbase.>> Right now it takes a little over an hour to complete. It creates around 2> million entries in one table with a single family.> If I use bulk uploading it goes down to 20 minutes.>> My hadoop has 21 map tasks but they all seem to be taking a very long time> to> finish many tasks end up in time out.>> I am wondering what I have missed in my configuration. I have followed the> different prerequisites in the documentations but I am really unsure as to> what> is causing this slow down. If I were to apply the wordcount example to the> same> file it takes only minutes to complete so I am guessing the issue lies in> my> Hbase configuration.>> Any help or pointers would by appreciated>>

> Hi> Using ImportTSV tool you are trying to bulk load your data. Can you see> and tell how many mappers and reducers were there. Out of total time what> is the time taken by the mapper phase and by the reducer phase. Seems like> MR related issue (may be some conf issue). In this bulk load case most of> the work is done by the MR job. It will read the raw data and convert it> into Puts and write to HFiles. MR o/p is HFiles itself. The next part in> ImportTSV will just put the HFiles under the table region store.. There> wont be WAL usage in this bulk load.>> -Anoop->> On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard <> [EMAIL PROTECTED]> wrote:>> > Hi everyone> >> > I'm starting with hbase and testing for our needs. I have set up a hadoop> > cluster of Three machines and A Hbase cluster atop on the same three> > machines,> > one master two slaves.> >> > I am testing the Import of a 5GB csv file with the importTsv tool. I> > import the> > file in the HDFS and use the importTsv tool to import in Hbase.> >> > Right now it takes a little over an hour to complete. It creates around 2> > million entries in one table with a single family.> > If I use bulk uploading it goes down to 20 minutes.> >> > My hadoop has 21 map tasks but they all seem to be taking a very long> time> > to> > finish many tasks end up in time out.> >> > I am wondering what I have missed in my configuration. I have followed> the> > different prerequisites in the documentations but I am really unsure as> to> > what> > is causing this slow down. If I were to apply the wordcount example to> the> > same> > file it takes only minutes to complete so I am guessing the issue lies in> > my> > Hbase configuration.> >> > Any help or pointers would by appreciated> >> >>

> As Kevin suggested we can make use of bulk load that goes thro WAL and> Memstore. Or the second option will be to use the o/p of mappers to create> HFiles directly.>> Regards> Ram>> On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <[EMAIL PROTECTED]> wrote:>> > Hi> > Using ImportTSV tool you are trying to bulk load your data. Can you> see> > and tell how many mappers and reducers were there. Out of total time what> > is the time taken by the mapper phase and by the reducer phase. Seems> like> > MR related issue (may be some conf issue). In this bulk load case most of> > the work is done by the MR job. It will read the raw data and convert it> > into Puts and write to HFiles. MR o/p is HFiles itself. The next part in> > ImportTSV will just put the HFiles under the table region store.. There> > wont be WAL usage in this bulk load.> >> > -Anoop-> >> > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard <> > [EMAIL PROTECTED]> wrote:> >> > > Hi everyone> > >> > > I'm starting with hbase and testing for our needs. I have set up a> hadoop> > > cluster of Three machines and A Hbase cluster atop on the same three> > > machines,> > > one master two slaves.> > >> > > I am testing the Import of a 5GB csv file with the importTsv tool. I> > > import the> > > file in the HDFS and use the importTsv tool to import in Hbase.> > >> > > Right now it takes a little over an hour to complete. It creates> around 2> > > million entries in one table with a single family.> > > If I use bulk uploading it goes down to 20 minutes.> > >> > > My hadoop has 21 map tasks but they all seem to be taking a very long> > time> > > to> > > finish many tasks end up in time out.> > >> > > I am wondering what I have missed in my configuration. I have followed> > the> > > different prerequisites in the documentations but I am really unsure as> > to> > > what> > > is causing this slow down. If I were to apply the wordcount example to> > the> > > same> > > file it takes only minutes to complete so I am guessing the issue lies> in> > > my> > > Hbase configuration.> > >> > > Any help or pointers would by appreciated> > >> > >> >>

Hi Anil In case of bulk loading it is not like data is put intoHBase one by one.. The MR job will create an o/p like HFile.. It willcreate the KVs and write to file in order as how HFile will look like.. Thethe file is loaded into HBase finally.. Only for this final step HBase RSwill be used.. So there is no point in WAL there... I am making it clearfor you? The data is already present in form of raw data in some txt orcsv file :)

That's a very interesting fact. You made it clear but my custom Bulk Loadergenerates an unique ID for every row in map phase. So, all my data is notin csv or text. Is there a way that i can explicitly turn on WAL for bulkloading?

Anil,When you do ImportTSV the data that is present in the the TSV file alonewill be parsed and loaded into HBase.How are you planning to generate the UniqueID? Your usecase seems like ityour data is in CSV file but the unique id that you need is not part of theTSV.Now you need them to be loaded to HBASE thro WAL.

I would suggest that can you first do a loading of the existing TSV file toone HTable.Then from that table you can do a bulk load into another table using urcustom mapper. Here you can use the logic of generating unique ID forevery row that comes out from the loaded table.Here we can make the data to be inserted into the new table thro normalputs which will use the WAL and memstore.

>. Is there a way that i can explicitly turn on WAL for bulk loading?no..How you generate the unique id? Remember that initial steps wont need theHBase cluster at all. MR generates the HFiles and the o/p will be in fileonly.. Mappers also will write o/p to file... Only thing is that somemappers crashed.. So thin MR fw will run that mapper again on the same dataset.. Then the unique id will be different? I think you no need to worryabout data loss from Hbase side.. So WAL is not required..

Yes, the uniqueId is not part of csv file. In my bulk loader i usecombination of nodeId+processId+counter as UniqueID for each row. I have touse the uniqueId since the remaining part of rowkey is not unique.

I think there are two approaches to solve this problem:1. Generate HFiles through MR and then do incremental load. I am fine withthis approach as we will have entire trace of data in HFiles.2. Use prePut observers? I am already using the prePut hook for some otherpurpose.

I think as per your explanation of need for unique id it is okey.. No needto worry abt data loss.. As long as you can make sure you make a unique idthings are fine.. MR will make sure it run the job on whole data and theo/p is persisted in file.. Yes this file is HFile(s) only.. Then finallythe HBase cluster is used for loading the HFiles to the Region stores..Bulk loading huge data using this way will be much much faster than normalput()s

Yeah, we never used HBase client api(puts) for loading a batch of millionsof records. Can you tell me by default where the o/p HFile(s) from MR jobare stored in HDFS?On Tue, Oct 23, 2012 at 11:31 PM, Anoop John <[EMAIL PROTECTED]> wrote:

I have taken my replication down to 2 but If I am not mistaken replication alsohas the benefit of rendering the cluster more fault by duplicating info ondifferent nodes so that if one goes down data is note necessarily lost. I suchcase i would like to keep it a least at 2.

I have set dfs.replication at 2 but this process time has not changed at all.How could I change my configuration to avoid this hotspot issue you talked about.

As Kevin has advised I have also upped:hbase.hstore.blockingStoreFiles to 100hbase.hregion.memstore.block.multiplier to 7hbase.hregion.memstore.flush.size to 256 MBhbase.regionserver.optionallogflushinterval to 30s

However map importTsv is still around 1minutes for 1% of map tasks so over anhour total.

Currently I have 42 running map tasks and an average of 28 tasks/node a lot ofmy map tasks end up in "failed to report status for 601 seconds"

when I check the map job details there are 80 tasks to complete.As i drill down on the different map tasks in task detail they all take a verylong time (26 minutes) to complete. A lot of them fail as well.Fail info is "failed to report status for 601 seconds" so time out.I does feel like an M/R related issue.

I have tried running the hadoop wordcount example on the same 5GB HDFS file.The point was to get a feel of something only hadoop with no hbase associated.The process took a couple of minutes.

I guess something in the imporTsv thru hbase call hangs up the map tasks.I don't really knwo where to look anymore to understand. Any idea of where ofhow or what to look for would be appreciated.As well any idea od different configuration I could try would be great.

As I have written in a reply above but that is kind of lost in the tread:

I have set dfs.replication at 2 but this process time has not changed at all.How could I change my configuration to avoid this hotspot issue you have talkedabout.

As Kevin has advised I have also upped:hbase.hstore.blockingStoreFiles to 100hbase.hregion.memstore.block.multiplier to 7hbase.hregion.memstore.flush.size to 256 MBhbase.regionserver.optionallogflushinterval to 30s

> As I have written in a reply above but that is kind of lost in the tread:>> I have set dfs.replication at 2 but this process time has not changed at> all.> How could I change my configuration to avoid this hotspot issue you have> talked> about.>> As Kevin has advised I have also upped:> hbase.hstore.blockingStoreFiles to 100> hbase.hregion.memstore.block.multiplier to 7> hbase.hregion.memstore.flush.size to 256 MB> hbase.regionserver.optionallogflushinterval to 30s>> These changes did not bring any significant evolution in speed.>> My cluster is 3 ubuntu machines:> 2 cores 4 threads 3.4+ GHz with 16gb ram>> thanks for everyones help>>>

'Yeah, we never used HBase client api(puts) for loading a batch of millionsof records. Can you tell me by default where the o/p HFile(s) from MR jobare stored in HDFS?'Hi AnilThe o/p HFiles are stored in the path created for the corresponding HBasetable./table_name/store_name/region_name/file_name.The location will be the same that will be used when a normal flush throHBase happens.

Still looking in the issue.I have tried different tests and the results are surprising.If I put mapred.tasktracker.map.tasks.maximum: 28I get a total of 84 tasks on my cluster and the process takes about 1h15 mineach task taking up 1h10 minutes. The whole file being cut down in 80 tasks.

If I put mapred.tasktracker.map.tasks.maximum: 3I get a total of 6 tasks on my cluster and the process takes about the sameamount of time 1h20 still cutting down the whole file in 80 tasks, but now ofcourse each individual task only takes up a couple of minutes.

It's like the overall importTSv must take 1h something and the duration of themap tasks vary accordingly.

> Hello everyone>> Still looking in the issue.> I have tried different tests and the results are surprising.> If I put mapred.tasktracker.map.tasks.maximum: 28> I get a total of 84 tasks on my cluster and the process takes about 1h15> min> each task taking up 1h10 minutes. The whole file being cut down in 80> tasks.>> If I put mapred.tasktracker.map.tasks.maximum: 3> I get a total of 6 tasks on my cluster and the process takes about the same> amount of time 1h20 still cutting down the whole file in 80 tasks, but now> of> course each individual task only takes up a couple of minutes.>> It's like the overall importTSv must take 1h something and the duration of> the> map tasks vary accordingly.>> There is definitly something I am doing wrong.>>>>

How many hard drives your slaves has? RPM of those? How many mappers arerun concurrently on a node?Did you turn off speculative execution? Have alook at disk i/o to see whether that is a bottleneck or not.

MR is disk I/O bound so if you only have one disk on slave and you arerunning 5 Mapper concurrently then the job will slow down.

> Nick,>> What versions are you using:>> HDFS> HBase> OS> On Oct 24, 2012 10:36 AM, "Nick maillard" <> [EMAIL PROTECTED]>> wrote:>> > Hello everyone> >> > Still looking in the issue.> > I have tried different tests and the results are surprising.> > If I put mapred.tasktracker.map.tasks.maximum: 28> > I get a total of 84 tasks on my cluster and the process takes about 1h15> > min> > each task taking up 1h10 minutes. The whole file being cut down in 80> > tasks.> >> > If I put mapred.tasktracker.map.tasks.maximum: 3> > I get a total of 6 tasks on my cluster and the process takes about the> same> > amount of time 1h20 still cutting down the whole file in 80 tasks, but> now> > of> > course each individual task only takes up a couple of minutes.> >> > It's like the overall importTSv must take 1h something and the duration> of> > the> > map tasks vary accordingly.> >> > There is definitly something I am doing wrong.> >> >> >> >>

I have one hard drive per slave.I have tested with 3 concurrent mappers and 28 concurrent mappers per slave. And both times the total time was about 1 hour the only difference wasthe time each map took aka respectfully 40min and 1h10min I have turned of the speculative execution.

I'll run a process tomorrow and look at disk I/O to check ifit is the bottleneck.

But the test I ran this afternoon with 3 or 28 max map tasks per nodemakes me doubt.When I run 28 map per node I can load the whole file in theavailable maps in one pass so all maps take 1h to completeso the whole process takes 1h and some minutes.

When I run with 3 maps per node the whole file is imported through 7full passes of available maps. In this case each map takes around 8-9 minutes to complete.So 7 passes times 9 minutes, the process takes about 1hour to completesame as before.

This situation i don't understand and leads me to believeI have missed a step somwhere.

You will want to make sure your table is pre-split. Also Import doesputs, so you will want to make sure you are not flushing and blockingby raising your memstore, Hlog, and blocking count. This can greatlyimprove your write speeds. I usually do a 256MB memstore(you canlower it later if it is not a heavy writes table), 512MB Hlog(samething, you can lower back to default), and then raise the storefileblocking count to about 100.

On Tue, Oct 23, 2012 at 1:32 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:> Thanks, checking the schema itself is still interesting (cf. the link sent)> As well, with 3 machines and a replication factor of 3, all the machines> are used during a write. As HBase writes all entries into a write-ahead-log> for safety, the number of writes is also doubled. So may be your machine is> just dying under the load. Anyway, here your cluster is going at the speed> of the least powerful machine, and this machine has a workload multiplied> by 6 compared to a single machine config (i.e. just writing a file locally).>> On Tue, Oct 23, 2012 at 7:13 PM, Nick maillard <> [EMAIL PROTECTED]> wrote:>>> Thanks for the help!>>>> My conf files are : Hadoop:>> hdfs-site>>>> <configuration>>> <property>>> <name>dfs.replication</name>>> <value>3</value>> <description>Default block replication.>> The actual number of replications can be specified when the file is>> created.>> The default is used if replication is not specified in create time.>> </description>>> </property>>> <property>>> <name>dfs.data.dir</name>>> <value>/home/runner/app/hadoop/dfs/data</value>>> <description>Default block replication.>> The actual number of replications can be specified when the file is>> created.>> The default is used if replication is not specified in create time.>> </description>>> </property>>> <property>>> <name>dfs.datanode.max.xcievers</name>>> <value>4096</value>>> </property>>> </configuration>>>>>>> Mapred-site.xml>>>> <configuration>>> <property>>> <name>mapred.job.tracker</name>>> <value>master:54311</value>>> <description>The host and port that the MapReduce job tracker runs>> at. If "local", then jobs are run in-process as a single map>> and reduce task.>> </description>>> </property>>> <property>>> <name>mapred.tasktracker.map.tasks.maximum</name>>> <value>14</value>>> <description>The maximum number of map tasks that will be run>> simultaneously by a task tracker.>> </description>>> </property>>>>> <property>>> <name>mapred.tasktracker.reduce.tasks.maximum</name>>> <value>14</value>>> <description>The maximum number of reduce tasks that will be run>> simultaneously by a task tracker.>> </description>>> </property>>> <property>>> <name>mapred.child.java.opts</name>>> <value>-Xmx400m</value>>> <description>Java opts for the task tracker child processes.>> The following symbol, if present, will be interpolated: @taskid@ is>> replaced>> by current TaskID. Any other occurrences of '@' will go unchanged.>> For example, to enable verbose gc logging to a file named for the taskid>> in>> /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:>> -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc>>>> The configuration variable mapred.child.ulimit can be used to control the>> maximum virtual memory of the child processes.>> </description>>> </property>>> </configuration>>>>>>> core-site.xml>>>> <configuration>>> <property>>> <name>hadoop.tmp.dir</name>>> <value>/home/runner/app/hadoop/tmp</value>>> <description>A base for other temporary directories.</description>

I just went through the same exercise. There are many ways to get this togo faster, but eventually I decided that bulk loading is the best solutionas run times scaled with the number machines in my cluster when I used thatapproach.

One thing you can try is to turn off hbase's write ahead log (WAL). But beaware that regionserver failure will cause data loss if you do this.

> Hi everyone>> I'm starting with hbase and testing for our needs. I have set up a hadoop> cluster of Three machines and A Hbase cluster atop on the same three> machines,> one master two slaves.>> I am testing the Import of a 5GB csv file with the importTsv tool. I> import the> file in the HDFS and use the importTsv tool to import in Hbase.>> Right now it takes a little over an hour to complete. It creates around 2> million entries in one table with a single family.> If I use bulk uploading it goes down to 20 minutes.>> My hadoop has 21 map tasks but they all seem to be taking a very long time> to> finish many tasks end up in time out.>> I am wondering what I have missed in my configuration. I have followed the> different prerequisites in the documentations but I am really unsure as to> what> is causing this slow down. If I were to apply the wordcount example to the> same> file it takes only minutes to complete so I am guessing the issue lies in> my> Hbase configuration.>> Any help or pointers would by appreciated>>

> Nicolas,>> I just went through the same exercise. There are many ways to get this to> go faster, but eventually I decided that bulk loading is the best solution> as run times scaled with the number machines in my cluster when I used that> approach.>> One thing you can try is to turn off hbase's write ahead log (WAL). But be> aware that regionserver failure will cause data loss if you do this.>> Jon>> On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard <> [EMAIL PROTECTED]> wrote:>> > Hi everyone> >> > I'm starting with hbase and testing for our needs. I have set up a hadoop> > cluster of Three machines and A Hbase cluster atop on the same three> > machines,> > one master two slaves.> >> > I am testing the Import of a 5GB csv file with the importTsv tool. I> > import the> > file in the HDFS and use the importTsv tool to import in Hbase.> >> > Right now it takes a little over an hour to complete. It creates around 2> > million entries in one table with a single family.> > If I use bulk uploading it goes down to 20 minutes.> >> > My hadoop has 21 map tasks but they all seem to be taking a very long> time> > to> > finish many tasks end up in time out.> >> > I am wondering what I have missed in my configuration. I have followed> the> > different prerequisites in the documentations but I am really unsure as> to> > what> > is causing this slow down. If I were to apply the wordcount example to> the> > same> > file it takes only minutes to complete so I am guessing the issue lies in> > my> > Hbase configuration.> >> > Any help or pointers would by appreciated> >> >>

> Hi Nicolas,>> As per my experience you wont get good performance if you run 3 Map task> simultaneously on one Hard Drive. That seems like a lot of I/O on one disk.>> HBase performs well when you have at least 5 nodes in cluster. So, running> HBase on 3 nodes is not something you would do in prod.>> Thanks,> Anil>> On Thu, Oct 25, 2012 at 8:57 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote:>>> Nicolas,>>>> I just went through the same exercise. There are many ways to get this to>> go faster, but eventually I decided that bulk loading is the best solution>> as run times scaled with the number machines in my cluster when I used>> that>> approach.>>>> One thing you can try is to turn off hbase's write ahead log (WAL). But be>> aware that regionserver failure will cause data loss if you do this.>>>> Jon>>>> On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard <>> [EMAIL PROTECTED]> wrote:>>>> > Hi everyone>> >>> > I'm starting with hbase and testing for our needs. I have set up a>> hadoop>> > cluster of Three machines and A Hbase cluster atop on the same three>> > machines,>> > one master two slaves.>> >>> > I am testing the Import of a 5GB csv file with the importTsv tool. I>> > import the>> > file in the HDFS and use the importTsv tool to import in Hbase.>> >>> > Right now it takes a little over an hour to complete. It creates around>> 2>> > million entries in one table with a single family.>> > If I use bulk uploading it goes down to 20 minutes.>> >>> > My hadoop has 21 map tasks but they all seem to be taking a very long>> time>> > to>> > finish many tasks end up in time out.>> >>> > I am wondering what I have missed in my configuration. I have followed>> the>> > different prerequisites in the documentations but I am really unsure as>> to>> > what>> > is causing this slow down. If I were to apply the wordcount example to>> the>> > same>> > file it takes only minutes to complete so I am guessing the issue lies>> in>> > my>> > Hbase configuration.>> >>> > Any help or pointers would by appreciated>> >>> >>>>>>> --> Thanks & Regards,> Anil Gupta>

>As per Anoop and Ram, WAL is not used with bulk loading so turning off WALwont have any impact on performance.

This is if HFileOutputFormat is being used.. There is a TableOutputFormat which also can be used as the OutputFormat for MR.. Here write to wal is applicableThis one, instead of write to HFile and upload at one shot, puts data into HTable calling put() method...

> Hi Nicolas,>> As per my experience you wont get good performance if you run 3 Map task> simultaneously on one Hard Drive. That seems like a lot of I/O on one disk.>> HBase performs well when you have at least 5 nodes in cluster. So, running> HBase on 3 nodes is not something you would do in prod.>> Thanks,> Anil>> On Thu, Oct 25, 2012 at 8:57 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote:>>> Nicolas,>>>> I just went through the same exercise. There are many ways to get this to>> go faster, but eventually I decided that bulk loading is the best solution>> as run times scaled with the number machines in my cluster when I used>> that>> approach.>>>> One thing you can try is to turn off hbase's write ahead log (WAL). But be>> aware that regionserver failure will cause data loss if you do this.>>>> Jon>>>> On Tue, Oct 23, 2012 at 8:48 AM, Nick maillard <>> [EMAIL PROTECTED]> wrote:>>>> > Hi everyone>> >>> > I'm starting with hbase and testing for our needs. I have set up a>> hadoop>> > cluster of Three machines and A Hbase cluster atop on the same three>> > machines,>> > one master two slaves.>> >>> > I am testing the Import of a 5GB csv file with the importTsv tool. I>> > import the>> > file in the HDFS and use the importTsv tool to import in Hbase.>> >>> > Right now it takes a little over an hour to complete. It creates around>> 2>> > million entries in one table with a single family.>> > If I use bulk uploading it goes down to 20 minutes.>> >>> > My hadoop has 21 map tasks but they all seem to be taking a very long>> time>> > to>> > finish many tasks end up in time out.>> >>> > I am wondering what I have missed in my configuration. I have followed>> the>> > different prerequisites in the documentations but I am really unsure as>> to>> > what>> > is causing this slow down. If I were to apply the wordcount example to>> the>> > same>> > file it takes only minutes to complete so I am guessing the issue lies>> in>> > my>> > Hbase configuration.>> >>> > Any help or pointers would by appreciated>> >>> >>>>>>> --> Thanks & Regards,> Anil Gupta>

--Thanks & Regards,Anil Gupta

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext