After reading the documentation and scouring the mailing list archives, I understand there is no real support for fast row counting in HBase unless you build some sort of tracking logic into your code. In our case, we do not have such logic, and have massive amounts of data already persisted. I am running into the issue of very long execution of the RowCounter MapReduce job against very large tables (multi-billion for many is our estimate). I understand why this issue exists and am slowly accepting it, but I am hoping I can solicit some possible ideas to help speed things up a little.

My current task is to provide total row counts on about 600 tables, some extremely large, some not so much. Currently, I have a process that executes the MapRduce job in process like so:

At the moment, each MapReduce job is executed in serial order, so counting one table at a time. For the current implementation of this whole process, as it stands right now, my rough timing calculations indicate that fully counting all the rows of these 600 tables will take anywhere between 11 to 22 days. This is not what I consider a desirable timeframe.

I have considered three alternative approaches to speed things up.

First, since the application is not heavily CPU bound, I could use a ThreadPool and execute multiple MapReduce jobs at the same time looking at different tables. I have never done this, so I am unsure if this would cause any unanticipated side effects.

Second, I could distribute the processes. I could find as many machines that can successfully talk to the desired cluster properly, give them a subset of tables to work on, and then combine the results post process.

Third, I could combine both the above approaches and run a distributed set of multithreaded process to execute the MapReduce jobs in parallel.

Although it seems to have been asked and answered many times, I will ask once again. Without the need to change our current configurations or restart the clusters, is there a faster approach to obtain row counts? FYI, my cache size for the Scan is set to 1000. I have experimented with different numbers, but nothing made a noticeable difference. Any advice or feedback would be greatly appreciated!

> After reading the documentation and scouring the mailing list> archives, I understand there is no real support for fast row counting in> HBase unless you build some sort of tracking logic into your code. In our> case, we do not have such logic, and have massive amounts of data already> persisted. I am running into the issue of very long execution of the> RowCounter MapReduce job against very large tables (multi-billion for many> is our estimate). I understand why this issue exists and am slowly> accepting it, but I am hoping I can solicit some possible ideas to help> speed things up a little.>> My current task is to provide total row counts on about 600> tables, some extremely large, some not so much. Currently, I have a> process that executes the MapRduce job in process like so:>> Job job = RowCounter.createSubmittableJob(> ConfigManager.getConfiguration(),> new String[]{tableName});> boolean waitForCompletion > job.waitForCompletion(true);> Counters counters = job.getCounters();> Counter rowCounter > counters.findCounter(hbaseadminconnection.Counters.ROWS);> return rowCounter.getValue();>> At the moment, each MapReduce job is executed in serial order, so> counting one table at a time. For the current implementation of this whole> process, as it stands right now, my rough timing calculations indicate that> fully counting all the rows of these 600 tables will take anywhere between> 11 to 22 days. This is not what I consider a desirable timeframe.>> I have considered three alternative approaches to speed things up.>> First, since the application is not heavily CPU bound, I could use> a ThreadPool and execute multiple MapReduce jobs at the same time looking> at different tables. I have never done this, so I am unsure if this would> cause any unanticipated side effects.>> Second, I could distribute the processes. I could find as many> machines that can successfully talk to the desired cluster properly, give> them a subset of tables to work on, and then combine the results post> process.>> Third, I could combine both the above approaches and run a> distributed set of multithreaded process to execute the MapReduce jobs in> parallel.>> Although it seems to have been asked and answered many times, I> will ask once again. Without the need to change our current configurations> or restart the clusters, is there a faster approach to obtain row counts?> FYI, my cache size for the Scan is set to 1000. I have experimented with> different numbers, but nothing made a noticeable difference. Any advice or> feedback would be greatly appreciated!>> Thanks,> Birch

It is hitting a production cluster, but I am not really sure how to calculate the load placed on the cluster.On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> How many nodes do you have in your cluster ?> > When counting rows, what other load would be placed on the cluster ?> > What is the HBase version you're currently using / planning to use ?> > Thanks> > > On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> After reading the documentation and scouring the mailing list>> archives, I understand there is no real support for fast row counting in>> HBase unless you build some sort of tracking logic into your code. In our>> case, we do not have such logic, and have massive amounts of data already>> persisted. I am running into the issue of very long execution of the>> RowCounter MapReduce job against very large tables (multi-billion for many>> is our estimate). I understand why this issue exists and am slowly>> accepting it, but I am hoping I can solicit some possible ideas to help>> speed things up a little.>> >> My current task is to provide total row counts on about 600>> tables, some extremely large, some not so much. Currently, I have a>> process that executes the MapRduce job in process like so:>> >> Job job = RowCounter.createSubmittableJob(>> ConfigManager.getConfiguration(),>> new String[]{tableName});>> boolean waitForCompletion >> job.waitForCompletion(true);>> Counters counters = job.getCounters();>> Counter rowCounter >> counters.findCounter(hbaseadminconnection.Counters.ROWS);>> return rowCounter.getValue();>> >> At the moment, each MapReduce job is executed in serial order, so>> counting one table at a time. For the current implementation of this whole>> process, as it stands right now, my rough timing calculations indicate that>> fully counting all the rows of these 600 tables will take anywhere between>> 11 to 22 days. This is not what I consider a desirable timeframe.>> >> I have considered three alternative approaches to speed things up.>> >> First, since the application is not heavily CPU bound, I could use>> a ThreadPool and execute multiple MapReduce jobs at the same time looking>> at different tables. I have never done this, so I am unsure if this would>> cause any unanticipated side effects.>> >> Second, I could distribute the processes. I could find as many>> machines that can successfully talk to the desired cluster properly, give>> them a subset of tables to work on, and then combine the results post>> process.>> >> Third, I could combine both the above approaches and run a>> distributed set of multithreaded process to execute the MapReduce jobs in>> parallel.>> >> Although it seems to have been asked and answered many times, I>> will ask once again. Without the need to change our current configurations>> or restart the clusters, is there a faster approach to obtain row counts?>> FYI, my cache size for the Scan is set to 1000. I have experimented with>> different numbers, but nothing made a noticeable difference. Any advice or>> feedback would be greatly appreciated!>> >> Thanks,>> Birch

do you need that many tables? "Table" in HBase should have been call "KeySpace" instead. 600 is lot.

But anyway... Did you enabled scanner caching for your M/R job (if you didn't every next() will be a roundtrip to the RegionServer and you end up measuring your networks RTT)?Are you IO bound?Lastly instead of doing it as M/R (which has to bring all the data back to the mapper just to count the returned rows), you could use a coprocessor, which do the counting on the server (or use Phoenix, search back in the archives for an example that James Taylor gave for row counting).

After reading the documentation and scouring the mailing list archives, I understand there is no real support for fast row counting in HBase unless you build some sort of tracking logic into your code. In our case, we do not have such logic, and have massive amounts of data already persisted. I am running into the issue of very long execution of the RowCounter MapReduce job against very large tables (multi-billion for many is our estimate). I understand why this issue exists and am slowly accepting it, but I am hoping I can solicit some possible ideas to help speed things up a little.

My current task is to provide total row counts on about 600 tables, some extremely large, some not so much. Currently, I have a process that executes the MapRduce job in process like so:

At the moment, each MapReduce job is executed in serial order, so counting one table at a time. For the current implementation of this whole process, as it stands right now, my rough timing calculations indicate that fully counting all the rows of these 600 tables will take anywhere between 11 to 22 days. This is not what I consider a desirable timeframe.

I have considered three alternative approaches to speed things up.

First, since the application is not heavily CPU bound, I could use a ThreadPool and execute multiple MapReduce jobs at the same time looking at different tables. I have never done this, so I am unsure if this would cause any unanticipated side effects.

Second, I could distribute the processes. I could find as many machines that can successfully talk to the desired cluster properly, give them a subset of tables to work on, and then combine the results post process.

Third, I could combine both the above approaches and run a distributed set of multithreaded process to execute the MapReduce jobs in parallel.

Although it seems to have been asked and answered many times, I will ask once again. Without the need to change our current configurations or restart the clusters, is there a faster approach to obtain row counts? FYI, my cache size for the Scan is set to 1000. I have experimented with different numbers, but nothing made a noticeable difference. Any advice or feedback would be greatly appreciated!

It is hitting a production cluster, but I am not really sure how to calculate the load placed on the cluster.On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> How many nodes do you have in your cluster ?>> When counting rows, what other load would be placed on the cluster ?>> What is the HBase version you're currently using / planning to use ?>> Thanks>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:>>> After reading the documentation and scouring the mailing list>> archives, I understand there is no real support for fast row counting in>> HBase unless you build some sort of tracking logic into your code. In our>> case, we do not have such logic, and have massive amounts of data already>> persisted. I am running into the issue of very long execution of the>> RowCounter MapReduce job against very large tables (multi-billion for many>> is our estimate). I understand why this issue exists and am slowly>> accepting it, but I am hoping I can solicit some possible ideas to help>> speed things up a little.>>>> My current task is to provide total row counts on about 600>> tables, some extremely large, some not so much. Currently, I have a>> process that executes the MapRduce job in process like so:>>>> Job job = RowCounter.createSubmittableJob(>> ConfigManager.getConfiguration(),>> new String[]{tableName});>> boolean waitForCompletion >> job.waitForCompletion(true);>> Counters counters = job.getCounters();>> Counter rowCounter >> counters.findCounter(hbaseadminconnection.Counters.ROWS);>> return rowCounter.getValue();>>>> At the moment, each MapReduce job is executed in serial order, so>> counting one table at a time. For the current implementation of this whole>> process, as it stands right now, my rough timing calculations indicate that>> fully counting all the rows of these 600 tables will take anywhere between>> 11 to 22 days. This is not what I consider a desirable timeframe.>>>> I have considered three alternative approaches to speed things up.>>>> First, since the application is not heavily CPU bound, I could use>> a ThreadPool and execute multiple MapReduce jobs at the same time looking>> at different tables. I have never done this, so I am unsure if this would>> cause any unanticipated side effects.>>>> Second, I could distribute the processes. I could find as many>> machines that can successfully talk to the desired cluster properly, give>> them a subset of tables to work on, and then combine the results postConfidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or [EMAIL PROTECTED] and delete or destroy any copy of this message and its attachments.

Agree with our first statement. I am in no way saying HBase is being used properly as a store. I am only saying my task is to determine the row counts as accurately as possible for the data and setup we currently have.

I set the scan caching to 1000. I tried 10000, but did not see much of a performance increase.

I will look further into coprocessors. Since I am relatively new to the technology, can someone provide a quick answer to this? Will using a coprocessor require me to change and restart our cluster? I am assuming is is possibly a configuration thing? If so, I will have to see if that is an option. If the answer is no, great. If yes, and it is an option for me, I will def take a look at this approach.

> Hi James,> > do you need that many tables? "Table" in HBase should have been call "KeySpace" instead. 600 is lot.> > But anyway... Did you enabled scanner caching for your M/R job (if you didn't every next() will be a roundtrip to the RegionServer and you end up measuring your networks RTT)?> Are you IO bound?> > > Lastly instead of doing it as M/R (which has to bring all the data back to the mapper just to count the returned rows), you could use a coprocessor, which do the counting on the server (or use Phoenix, search back in the archives for an example that James Taylor gave for row counting).> > -- Lars> > > > ________________________________> From: James Birchfield <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED] > Sent: Friday, September 20, 2013 2:47 PM> Subject: HBase Table Row Count Optimization - A Solicitation For Help> > > After reading the documentation and scouring the mailing list archives, I understand there is no real support for fast row counting in HBase unless you build some sort of tracking logic into your code. In our case, we do not have such logic, and have massive amounts of data already persisted. I am running into the issue of very long execution of the RowCounter MapReduce job against very large tables (multi-billion for many is our estimate). I understand why this issue exists and am slowly accepting it, but I am hoping I can solicit some possible ideas to help speed things up a little.> > My current task is to provide total row counts on about 600 tables, some extremely large, some not so much. Currently, I have a process that executes the MapRduce job in process like so:> > Job job = RowCounter.createSubmittableJob(> ConfigManager.getConfiguration(), new String[]{tableName});> boolean waitForCompletion = job.waitForCompletion(true);> Counters counters = job.getCounters();> Counter rowCounter = counters.findCounter(hbaseadminconnection.Counters.ROWS);> return rowCounter.getValue();> > At the moment, each MapReduce job is executed in serial order, so counting one table at a time. For the current implementation of this whole process, as it stands right now, my rough timing calculations indicate that fully counting all the rows of these 600 tables will take anywhere between 11 to 22 days. This is not what I consider a desirable timeframe.> > I have considered three alternative approaches to speed things up.> > First, since the application is not heavily CPU bound, I could use a ThreadPool and execute multiple MapReduce jobs at the same time looking at different tables. I have never done this, so I am unsure if this would cause any unanticipated side effects. > > Second, I could distribute the processes. I could find as many machines that can successfully talk to the desired cluster properly, give them a subset of tables to work on, and then combine the results post process.> > Third, I could combine both the above approaches and run a distributed set of multithreaded process to execute the MapReduce jobs in parallel.>

> How long does it take for RowCounter Job for largest table to finish on your cluster?> > Just curious.> > On your options:> > 1. Not worth it probably - you may overload your cluster> 2. Not sure this one differs from 1. Looks the same to me but more complex.> 3. The same as 1 and 2> > Counting rows in efficient way can be done if you sacrifice some accuracy :> > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html> > Yeah, you will need coprocessors for that.> > Best regards,> Vladimir Rodionov> Principal Platform Engineer> Carrier IQ, www.carrieriq.com> e-mail: [EMAIL PROTECTED]> > ________________________________________> From: James Birchfield [[EMAIL PROTECTED]]> Sent: Friday, September 20, 2013 3:50 PM> To: [EMAIL PROTECTED]> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> > Hadoop 2.0.0-cdh4.3.1> > HBase 0.94.6-cdh4.3.1> > 110 servers, 0 dead, 238.2364 average load> > Some other info, not sure if it helps or not.> > Configured Capacity: 1295277834158080 (1.15 PB)> Present Capacity: 1224692609430678 (1.09 PB)> DFS Remaining: 624376503857152 (567.87 TB)> DFS Used: 600316105573526 (545.98 TB)> DFS Used%: 49.02%> Under replicated blocks: 0> Blocks with corrupt replicas: 1> Missing blocks: 0> > It is hitting a production cluster, but I am not really sure how to calculate the load placed on the cluster.> On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >> How many nodes do you have in your cluster ?>> >> When counting rows, what other load would be placed on the cluster ?>> >> What is the HBase version you're currently using / planning to use ?>> >> Thanks>> >> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <>> [EMAIL PROTECTED]> wrote:>> >>> After reading the documentation and scouring the mailing list>>> archives, I understand there is no real support for fast row counting in>>> HBase unless you build some sort of tracking logic into your code. In our>>> case, we do not have such logic, and have massive amounts of data already>>> persisted. I am running into the issue of very long execution of the>>> RowCounter MapReduce job against very large tables (multi-billion for many>>> is our estimate). I understand why this issue exists and am slowly>>> accepting it, but I am hoping I can solicit some possible ideas to help>>> speed things up a little.>>> >>> My current task is to provide total row counts on about 600>>> tables, some extremely large, some not so much. Currently, I have a>>> process that executes the MapRduce job in process like so:>>> >>> Job job = RowCounter.createSubmittableJob(>>> ConfigManager.getConfiguration(),>>> new String[]{tableName});>>> boolean waitForCompletion >>> job.waitForCompletion(true);>>> Counters counters = job.getCounters();>>> Counter rowCounter >>> counters.findCounter(hbaseadminconnection.Counters.ROWS);>>> return rowCounter.getValue();>>> >>> At the moment, each MapReduce job is executed in serial order, so>>> counting one table at a time. For the current implementation of this whole>>> process, as it stands right now, my rough timing calculations indicate that

If your cells are extremely small try setting the caching even higher than10k. You want to strike a balance between MBs of response size and numberof calls needed, leaning towards larger response sizes as far as yoursystem can handle (account for RS max memory, and memory available to yourmappers).

You could use the KeyOnlyFilter to further limit the sizes of responsestransferred.

Another thing that may help would be increasing your block size. Thiswould speed up sequential read but slow down random access. It would be amatter of making the config change and then running a major compaction tore-write the data.

We constantly run multiple MR jobs (often on the order of 10's) against thesame hbase cluster and don't often see issues. They are not full tablescans, but they do often overlap. I think it would be worth at leastattempting to run multiple jobs at once.On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield <[EMAIL PROTECTED]> wrote:

> I did not implement accurate timing, but the current table being counted> has been running for about 10 hours, and the log is estimating the map> portion at 10%>> 2013-09-20 23:40:24,099 INFO [main] Job : map> 10% reduce 0%>> So a loooong time. Like I mentioned, we have billions, if not trillions> of rows potentially.>> Thanks for the feedback on the approaches I mentioned. I was not sure if> they would have any effect overall.>> I will look further into coprocessors.>> Thanks!> Birch> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>> wrote:>> > How long does it take for RowCounter Job for largest table to finish on> your cluster?> >> > Just curious.> >> > On your options:> >> > 1. Not worth it probably - you may overload your cluster> > 2. Not sure this one differs from 1. Looks the same to me but more> complex.> > 3. The same as 1 and 2> >> > Counting rows in efficient way can be done if you sacrifice some> accuracy :> >> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html> >> > Yeah, you will need coprocessors for that.> >> > Best regards,> > Vladimir Rodionov> > Principal Platform Engineer> > Carrier IQ, www.carrieriq.com> > e-mail: [EMAIL PROTECTED]> >> > ________________________________________> > From: James Birchfield [[EMAIL PROTECTED]]> > Sent: Friday, September 20, 2013 3:50 PM> > To: [EMAIL PROTECTED]> > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> >> > Hadoop 2.0.0-cdh4.3.1> >> > HBase 0.94.6-cdh4.3.1> >> > 110 servers, 0 dead, 238.2364 average load> >> > Some other info, not sure if it helps or not.> >> > Configured Capacity: 1295277834158080 (1.15 PB)> > Present Capacity: 1224692609430678 (1.09 PB)> > DFS Remaining: 624376503857152 (567.87 TB)> > DFS Used: 600316105573526 (545.98 TB)> > DFS Used%: 49.02%> > Under replicated blocks: 0> > Blocks with corrupt replicas: 1> > Missing blocks: 0> >> > It is hitting a production cluster, but I am not really sure how to> calculate the load placed on the cluster.> > On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >> >> How many nodes do you have in your cluster ?> >>> >> When counting rows, what other load would be placed on the cluster ?> >>> >> What is the HBase version you're currently using / planning to use ?> >>> >> Thanks> >>> >>> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <> >> [EMAIL PROTECTED]> wrote:> >>> >>> After reading the documentation and scouring the mailing list> >>> archives, I understand there is no real support for fast row counting> in> >>> HBase unless you build some sort of tracking logic into your code. In> our> >>> case, we do not have such logic, and have massive amounts of data> already> >>> persisted. I am running into the issue of very long execution of the> >>> RowCounter MapReduce job against very large tables (multi-billion for

Right now the MapReduce Scan uses the FirstKeyOnlyFilter. From what I have read in the javadoc, FirstKeyFilter *should* be faster since it only grabs the first KV pair.

I will play around with setting the caching size to a much higher number and see how it performs. I do not think I have too much wiggle room to modify our cluster configurations, but will see what I can do.

> If your cells are extremely small try setting the caching even higher than> 10k. You want to strike a balance between MBs of response size and number> of calls needed, leaning towards larger response sizes as far as your> system can handle (account for RS max memory, and memory available to your> mappers).> > You could use the KeyOnlyFilter to further limit the sizes of responses> transferred.> > Another thing that may help would be increasing your block size. This> would speed up sequential read but slow down random access. It would be a> matter of making the config change and then running a major compaction to> re-write the data.> > We constantly run multiple MR jobs (often on the order of 10's) against the> same hbase cluster and don't often see issues. They are not full table> scans, but they do often overlap. I think it would be worth at least> attempting to run multiple jobs at once.> > > > > On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> I did not implement accurate timing, but the current table being counted>> has been running for about 10 hours, and the log is estimating the map>> portion at 10%>> >> 2013-09-20 23:40:24,099 INFO [main] Job : map>> 10% reduce 0%>> >> So a loooong time. Like I mentioned, we have billions, if not trillions>> of rows potentially.>> >> Thanks for the feedback on the approaches I mentioned. I was not sure if>> they would have any effect overall.>> >> I will look further into coprocessors.>> >> Thanks!>> Birch>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>>> wrote:>> >>> How long does it take for RowCounter Job for largest table to finish on>> your cluster?>>> >>> Just curious.>>> >>> On your options:>>> >>> 1. Not worth it probably - you may overload your cluster>>> 2. Not sure this one differs from 1. Looks the same to me but more>> complex.>>> 3. The same as 1 and 2>>> >>> Counting rows in efficient way can be done if you sacrifice some>> accuracy :>>> >>> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html>>> >>> Yeah, you will need coprocessors for that.>>> >>> Best regards,>>> Vladimir Rodionov>>> Principal Platform Engineer>>> Carrier IQ, www.carrieriq.com>>> e-mail: [EMAIL PROTECTED]>>> >>> ________________________________________>>> From: James Birchfield [[EMAIL PROTECTED]]>>> Sent: Friday, September 20, 2013 3:50 PM>>> To: [EMAIL PROTECTED]>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help>>> >>> Hadoop 2.0.0-cdh4.3.1>>> >>> HBase 0.94.6-cdh4.3.1>>> >>> 110 servers, 0 dead, 238.2364 average load>>> >>> Some other info, not sure if it helps or not.>>> >>> Configured Capacity: 1295277834158080 (1.15 PB)>>> Present Capacity: 1224692609430678 (1.09 PB)>>> DFS Remaining: 624376503857152 (567.87 TB)>>> DFS Used: 600316105573526 (545.98 TB)>>> DFS Used%: 49.02%>>> Under replicated blocks: 0>>> Blocks with corrupt replicas: 1>>> Missing blocks: 0>>> >>> It is hitting a production cluster, but I am not really sure how to>> calculate the load placed on the cluster.>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>>> >>>> How many nodes do you have in your cluster ?>>>> >>>> When counting rows, what other load would be placed on the cluster ?>>>> >>>> What is the HBase version you're currently using / planning to use ?

>From your numbers below you have about 26k regions, thus each region is about 545tb/26k = 20gb. Good.

How many mappers are you running?And just to rule out the obvious, the M/R is running on the cluster and not locally, right? (it will default to a local runner when it cannot use the M/R cluster).

Some back of the envelope calculations tell me that assuming 1ge network cards, the best you can expect for 110 machines to map through this data is about 10h. (so way faster than what you see).(545tb/(110*1/8gb/s) ~ 40ks ~11h)We should really add a rowcounting coprocessor to HBase and allow using it via M/R.

> How long does it take for RowCounter Job for largest table to finish on your cluster?> > Just curious.> > On your options:> > 1. Not worth it probably - you may overload your cluster> 2. Not sure this one differs from 1. Looks the same to me but more complex.> 3. The same as 1 and 2> > Counting rows in efficient way can be done if you sacrifice some accuracy :> > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html> > Yeah, you will need coprocessors for that.> > Best regards,> Vladimir Rodionov> Principal Platform Engineer> Carrier IQ, www.carrieriq.com> e-mail: [EMAIL PROTECTED]> > ________________________________________> From: James Birchfield [[EMAIL PROTECTED]]> Sent: Friday, September 20, 2013 3:50 PM> To: [EMAIL PROTECTED]> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> > Hadoop 2.0.0-cdh4.3.1> > HBase 0.94.6-cdh4.3.1> > 110 servers, 0 dead, 238.2364 average load> > Some other info, not sure if it helps or not.> > Configured Capacity: 1295277834158080 (1.15 PB)> Present Capacity: 1224692609430678 (1.09 PB)> DFS Remaining: 624376503857152 (567.87 TB)> DFS Used: 600316105573526 (545.98 TB)> DFS Used%: 49.02%> Under replicated blocks: 0> Blocks with corrupt replicas: 1> Missing blocks: 0> > It is hitting a production cluster, but I am not really sure how to calculate the load placed on the cluster.> On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >> How many nodes do you have in your cluster ?>> >> When counting rows, what other load would be placed on the cluster ?>> >> What is the HBase version you're currently using / planning to use ?>> >> Thanks>> >> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <>> [EMAIL PROTECTED]> wrote:>> >>> After reading the documentation and scouring the mailing list>>> archives, I understand there is no real support for fast row counting in>>> HBase unless you build some sort of tracking logic into your code. In our>>> case, we do not have such logic, and have massive amounts of data already>>> persisted. I am running into the issue of very long execution of the>>> RowCounter MapReduce job against very large tables (multi-billion for many>>> is our estimate). I understand why this issue exists and am slowly>>> accepting it, but I am hoping I can solicit some possible ideas to help>>> speed things up a little.>>> >>> My current task is to provide total row counts on about 600

> Thanks for the info.>> Right now the MapReduce Scan uses the FirstKeyOnlyFilter. From what I> have read in the javadoc, FirstKeyFilter *should* be faster since it only> grabs the first KV pair.>> I will play around with setting the caching size to a much higher number> and see how it performs. I do not think I have too much wiggle room to> modify our cluster configurations, but will see what I can do.>> Thanks!>> Birch> On Sep 20, 2013, at 5:39 PM, Bryan Beaudreault <[EMAIL PROTECTED]>> wrote:>> > If your cells are extremely small try setting the caching even higher> than> > 10k. You want to strike a balance between MBs of response size and> number> > of calls needed, leaning towards larger response sizes as far as your> > system can handle (account for RS max memory, and memory available to> your> > mappers).> >> > You could use the KeyOnlyFilter to further limit the sizes of responses> > transferred.> >> > Another thing that may help would be increasing your block size. This> > would speed up sequential read but slow down random access. It would be> a> > matter of making the config change and then running a major compaction to> > re-write the data.> >> > We constantly run multiple MR jobs (often on the order of 10's) against> the> > same hbase cluster and don't often see issues. They are not full table> > scans, but they do often overlap. I think it would be worth at least> > attempting to run multiple jobs at once.> >> >> >> >> > On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield <> > [EMAIL PROTECTED]> wrote:> >> >> I did not implement accurate timing, but the current table being counted> >> has been running for about 10 hours, and the log is estimating the map> >> portion at 10%> >>> >> 2013-09-20 23:40:24,099 INFO [main] Job :> map> >> 10% reduce 0%> >>> >> So a loooong time. Like I mentioned, we have billions, if not trillions> >> of rows potentially.> >>> >> Thanks for the feedback on the approaches I mentioned. I was not sure> if> >> they would have any effect overall.> >>> >> I will look further into coprocessors.> >>> >> Thanks!> >> Birch> >> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]> >> >> wrote:> >>> >>> How long does it take for RowCounter Job for largest table to finish on> >> your cluster?> >>>> >>> Just curious.> >>>> >>> On your options:> >>>> >>> 1. Not worth it probably - you may overload your cluster> >>> 2. Not sure this one differs from 1. Looks the same to me but more> >> complex.> >>> 3. The same as 1 and 2> >>>> >>> Counting rows in efficient way can be done if you sacrifice some> >> accuracy :> >>>> >>>> >>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html> >>>> >>> Yeah, you will need coprocessors for that.> >>>> >>> Best regards,> >>> Vladimir Rodionov> >>> Principal Platform Engineer> >>> Carrier IQ, www.carrieriq.com> >>> e-mail: [EMAIL PROTECTED]> >>>> >>> ________________________________________> >>> From: James Birchfield [[EMAIL PROTECTED]]> >>> Sent: Friday, September 20, 2013 3:50 PM> >>> To: [EMAIL PROTECTED]> >>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For> Help> >>>> >>> Hadoop 2.0.0-cdh4.3.1> >>>> >>> HBase 0.94.6-cdh4.3.1> >>>> >>> 110 servers, 0 dead, 238.2364 average load> >>>> >>> Some other info, not sure if it helps or not.> >>>> >>> Configured Capacity: 1295277834158080 (1.15 PB)> >>> Present Capacity: 1224692609430678 (1.09 PB)> >>> DFS Remaining: 624376503857152 (567.87 TB)> >>> DFS Used: 600316105573526 (545.98 TB)> >>> DFS Used%: 49.02%> >>> Under replicated blocks: 0

> bq. FirstKeyFilter *should* be faster since it only grabs the first KV pair.> > Minor correction: FirstKeyFilter above should be FirstKeyOnlyFilter> > > On Fri, Sep 20, 2013 at 5:53 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> Thanks for the info.>> >> Right now the MapReduce Scan uses the FirstKeyOnlyFilter. From what I>> have read in the javadoc, FirstKeyFilter *should* be faster since it only>> grabs the first KV pair.>> >> I will play around with setting the caching size to a much higher number>> and see how it performs. I do not think I have too much wiggle room to>> modify our cluster configurations, but will see what I can do.>> >> Thanks!>> >> Birch>> On Sep 20, 2013, at 5:39 PM, Bryan Beaudreault <[EMAIL PROTECTED]>>> wrote:>> >>> If your cells are extremely small try setting the caching even higher>> than>>> 10k. You want to strike a balance between MBs of response size and>> number>>> of calls needed, leaning towards larger response sizes as far as your>>> system can handle (account for RS max memory, and memory available to>> your>>> mappers).>>> >>> You could use the KeyOnlyFilter to further limit the sizes of responses>>> transferred.>>> >>> Another thing that may help would be increasing your block size. This>>> would speed up sequential read but slow down random access. It would be>> a>>> matter of making the config change and then running a major compaction to>>> re-write the data.>>> >>> We constantly run multiple MR jobs (often on the order of 10's) against>> the>>> same hbase cluster and don't often see issues. They are not full table>>> scans, but they do often overlap. I think it would be worth at least>>> attempting to run multiple jobs at once.>>> >>> >>> >>> >>> On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield <>>> [EMAIL PROTECTED]> wrote:>>> >>>> I did not implement accurate timing, but the current table being counted>>>> has been running for about 10 hours, and the log is estimating the map>>>> portion at 10%>>>> >>>> 2013-09-20 23:40:24,099 INFO [main] Job :>> map>>>> 10% reduce 0%>>>> >>>> So a loooong time. Like I mentioned, we have billions, if not trillions>>>> of rows potentially.>>>> >>>> Thanks for the feedback on the approaches I mentioned. I was not sure>> if>>>> they would have any effect overall.>>>> >>>> I will look further into coprocessors.>>>> >>>> Thanks!>>>> Birch>>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>>> >>>> wrote:>>>> >>>>> How long does it take for RowCounter Job for largest table to finish on>>>> your cluster?>>>>> >>>>> Just curious.>>>>> >>>>> On your options:>>>>> >>>>> 1. Not worth it probably - you may overload your cluster>>>>> 2. Not sure this one differs from 1. Looks the same to me but more>>>> complex.>>>>> 3. The same as 1 and 2>>>>> >>>>> Counting rows in efficient way can be done if you sacrifice some>>>> accuracy :>>>>> >>>>> >>>> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html>>>>> >>>>> Yeah, you will need coprocessors for that.>>>>> >>>>> Best regards,>>>>> Vladimir Rodionov>>>>> Principal Platform Engineer>>>>> Carrier IQ, www.carrieriq.com>>>>> e-mail: [EMAIL PROTECTED]>>>>> >>>>> ________________________________________>>>>> From: James Birchfield [[EMAIL PROTECTED]]>>>>> Sent: Friday, September 20, 2013 3:50 PM>>>>> To: [EMAIL PROTECTED]>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For>> Help>>>>> >>>>> Hadoop 2.0.0-cdh4.3.1>>>>> >>>>> HBase 0.94.6-cdh4.3.1>>>>> >>>>> 110 servers, 0 dead, 238.2364 average load>>>>> >>>>> Some other info, not sure if it helps or not.>>>>> >>>>> Configured Capacity: 1295277834158080 (1.15 PB)

> From your numbers below you have about 26k regions, thus each region is> about 545tb/26k = 20gb. Good.>> How many mappers are you running?> And just to rule out the obvious, the M/R is running on the cluster and> not locally, right? (it will default to a local runner when it cannot use> the M/R cluster).>> Some back of the envelope calculations tell me that assuming 1ge network> cards, the best you can expect for 110 machines to map through this data is> about 10h. (so way faster than what you see).> (545tb/(110*1/8gb/s) ~ 40ks ~11h)>>> We should really add a rowcounting coprocessor to HBase and allow using it> via M/R.>> -- Lars>>>> ________________________________> From: James Birchfield <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]> Sent: Friday, September 20, 2013 5:09 PM> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help>>> I did not implement accurate timing, but the current table being counted> has been running for about 10 hours, and the log is estimating the map> portion at 10%>> 2013-09-20 23:40:24,099 INFO [main] Job : map> 10% reduce 0%>> So a loooong time. Like I mentioned, we have billions, if not trillions> of rows potentially.>> Thanks for the feedback on the approaches I mentioned. I was not sure if> they would have any effect overall.>> I will look further into coprocessors.>> Thanks!> Birch> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>> wrote:>> > How long does it take for RowCounter Job for largest table to finish on> your cluster?> >> > Just curious.> >> > On your options:> >> > 1. Not worth it probably - you may overload your cluster> > 2. Not sure this one differs from 1. Looks the same to me but more> complex.> > 3. The same as 1 and 2> >> > Counting rows in efficient way can be done if you sacrifice some> accuracy :> >> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html> >> > Yeah, you will need coprocessors for that.> >> > Best regards,> > Vladimir Rodionov> > Principal Platform Engineer> > Carrier IQ, www.carrieriq.com> > e-mail: [EMAIL PROTECTED]> >> > ________________________________________> > From: James Birchfield [[EMAIL PROTECTED]]> > Sent: Friday, September 20, 2013 3:50 PM> > To: [EMAIL PROTECTED]> > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> >> > Hadoop 2.0.0-cdh4.3.1> >> > HBase 0.94.6-cdh4.3.1> >> > 110 servers, 0 dead, 238.2364 average load> >> > Some other info, not sure if it helps or not.> >> > Configured Capacity: 1295277834158080 (1.15 PB)> > Present Capacity: 1224692609430678 (1.09 PB)> > DFS Remaining: 624376503857152 (567.87 TB)> > DFS Used: 600316105573526 (545.98 TB)> > DFS Used%: 49.02%> > Under replicated blocks: 0> > Blocks with corrupt replicas: 1> > Missing blocks: 0> >> > It is hitting a production cluster, but I am not really sure how to> calculate the load placed on the cluster.> > On Sep 20, 2013, at 3:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >> >> How many nodes do you have in your cluster ?> >>> >> When counting rows, what other load would be placed on the cluster ?> >>> >> What is the HBase version you're currently using / planning to use ?> >>> >> Thanks> >>> >>> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield <> >> [EMAIL PROTECTED]> wrote:> >>> >>> After reading the documentation and scouring the mailing list> >>> archives, I understand there is no real support for fast row counting> in> >>> HBase unless you build some sort of tracking logic into your code. In> our> >>> case, we do not have such logic, and have massive amounts of data

So this is where my inexperience is probably going to come glaring through. And maybe the root of all this. I am not running the MapReduce job on a node in the cluster. It is running on a development server that connects remotely to the cluster. Further more, I am not executing the MpReduce job from the command line using the CLI as seen in many of the examples. I am executing them in process of a stand-alone Java process I have written. It is simple in nature, it simply creates an HBaseAdmin connection, list the tables and looks up the column families, code the admin connection, then loops over the table list, and runs the following code:

makes me think I am not taking advantage of the cluster effectively, if at all. I do not mind at all running the MapReduce job using the hbase/hadoop CLI, I can script that as well. I just thought this would work decently enough.

It does seem like it will be possible to use the Agregation coprocessor as suggested a little earlier in this thread. It may speed things up as well. But either way, I need to understand if I am losing significant performance running in the manner I am. Which at this point sounds like I probably am.

> From your numbers below you have about 26k regions, thus each region is about 545tb/26k = 20gb. Good.> > How many mappers are you running?> And just to rule out the obvious, the M/R is running on the cluster and not locally, right? (it will default to a local runner when it cannot use the M/R cluster).> > Some back of the envelope calculations tell me that assuming 1ge network cards, the best you can expect for 110 machines to map through this data is about 10h. (so way faster than what you see).> (545tb/(110*1/8gb/s) ~ 40ks ~11h)> > > We should really add a rowcounting coprocessor to HBase and allow using it via M/R.> > -- Lars> > > > ________________________________> From: James Birchfield <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED] > Sent: Friday, September 20, 2013 5:09 PM> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> > > I did not implement accurate timing, but the current table being counted has been running for about 10 hours, and the log is estimating the map portion at 10%> > 2013-09-20 23:40:24,099 INFO [main] Job : map 10% reduce 0%> > So a loooong time. Like I mentioned, we have billions, if not trillions of rows potentially.> > Thanks for the feedback on the approaches I mentioned. I was not sure if they would have any effect overall.> > I will look further into coprocessors.> > Thanks!> Birch> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]> wrote:> >> How long does it take for RowCounter Job for largest table to finish on your cluster?>> >> Just curious.>> >> On your options:>> >> 1. Not worth it probably - you may overload your cluster>> 2. Not sure this one differs from 1. Looks the same to me but more complex.>> 3. The same as 1 and 2>> >> Counting rows in efficient way can be done if you sacrifice some accuracy :>> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

That was the direction I have been working towards as I am learning today. Much appreciation to all the replies to this thread.

Whether I keep the MapReduce job or utilize the Aggregation coprocessor (which is turning out that it should be possible for me here), I need to make sure I am running the client in an efficient manner. Lars may have hit upon the core problem. I am not running the map reduce job on the cluster, but rather from a stand alone remote java client executing the job in process. This may very well turn out to be the number one issue. I would love it if this turns out to be true. Would make this a great learning lesson for me as a relative newcomer to working with HBase, and potentially allow me to finish this initial task much quicker than I was thinking.

So assuming the MapReduce jobs need to be run on the cluster instead of locally, does a coprocessor endpoint client need to be run the same, or is it safe to run it on a remote machine since the work gets distributed out to the region servers? Just wondering if I would run into the same issues if what I said above holds true.

> In 0.94, we have AggregateImplementation, an endpoint coprocessor, which> implements getRowNum().> > Example is in AggregationClient.java> > Cheers> > > On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:> >> From your numbers below you have about 26k regions, thus each region is>> about 545tb/26k = 20gb. Good.>> >> How many mappers are you running?>> And just to rule out the obvious, the M/R is running on the cluster and>> not locally, right? (it will default to a local runner when it cannot use>> the M/R cluster).>> >> Some back of the envelope calculations tell me that assuming 1ge network>> cards, the best you can expect for 110 machines to map through this data is>> about 10h. (so way faster than what you see).>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)>> >> >> We should really add a rowcounting coprocessor to HBase and allow using it>> via M/R.>> >> -- Lars>> >> >> >> ________________________________>> From: James Birchfield <[EMAIL PROTECTED]>>> To: [EMAIL PROTECTED]>> Sent: Friday, September 20, 2013 5:09 PM>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help>> >> >> I did not implement accurate timing, but the current table being counted>> has been running for about 10 hours, and the log is estimating the map>> portion at 10%>> >> 2013-09-20 23:40:24,099 INFO [main] Job : map>> 10% reduce 0%>> >> So a loooong time. Like I mentioned, we have billions, if not trillions>> of rows potentially.>> >> Thanks for the feedback on the approaches I mentioned. I was not sure if>> they would have any effect overall.>> >> I will look further into coprocessors.>> >> Thanks!>> Birch>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>>> wrote:>> >>> How long does it take for RowCounter Job for largest table to finish on>> your cluster?>>> >>> Just curious.>>> >>> On your options:>>> >>> 1. Not worth it probably - you may overload your cluster>>> 2. Not sure this one differs from 1. Looks the same to me but more>> complex.>>> 3. The same as 1 and 2>>> >>> Counting rows in efficient way can be done if you sacrifice some>> accuracy :>>> >>> >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html>>> >>> Yeah, you will need coprocessors for that.>>> >>> Best regards,>>> Vladimir Rodionov>>> Principal Platform Engineer>>> Carrier IQ, www.carrieriq.com>>> e-mail: [EMAIL PROTECTED]>>> >>> ________________________________________>>> From: James Birchfield [[EMAIL PROTECTED]]>>> Sent: Friday, September 20, 2013 3:50 PM>>> To: [EMAIL PROTECTED]>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help

> Thanks Ted.>> That was the direction I have been working towards as I am learning today.> Much appreciation to all the replies to this thread.>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor> (which is turning out that it should be possible for me here), I need to> make sure I am running the client in an efficient manner. Lars may have> hit upon the core problem. I am not running the map reduce job on the> cluster, but rather from a stand alone remote java client executing the job> in process. This may very well turn out to be the number one issue. I> would love it if this turns out to be true. Would make this a great> learning lesson for me as a relative newcomer to working with HBase, and> potentially allow me to finish this initial task much quicker than I was> thinking.>> So assuming the MapReduce jobs need to be run on the cluster instead of> locally, does a coprocessor endpoint client need to be run the same, or is> it safe to run it on a remote machine since the work gets distributed out> to the region servers? Just wondering if I would run into the same issues> if what I said above holds true.>> Thanks!> Birch> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> > In 0.94, we have AggregateImplementation, an endpoint coprocessor, which> > implements getRowNum().> >> > Example is in AggregationClient.java> >> > Cheers> >> >> > On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:> >> >> From your numbers below you have about 26k regions, thus each region is> >> about 545tb/26k = 20gb. Good.> >>> >> How many mappers are you running?> >> And just to rule out the obvious, the M/R is running on the cluster and> >> not locally, right? (it will default to a local runner when it cannot> use> >> the M/R cluster).> >>> >> Some back of the envelope calculations tell me that assuming 1ge network> >> cards, the best you can expect for 110 machines to map through this> data is> >> about 10h. (so way faster than what you see).> >> (545tb/(110*1/8gb/s) ~ 40ks ~11h)> >>> >>> >> We should really add a rowcounting coprocessor to HBase and allow using> it> >> via M/R.> >>> >> -- Lars> >>> >>> >>> >> ________________________________> >> From: James Birchfield <[EMAIL PROTECTED]>> >> To: [EMAIL PROTECTED]> >> Sent: Friday, September 20, 2013 5:09 PM> >> Subject: Re: HBase Table Row Count Optimization - A Solicitation For> Help> >>> >>> >> I did not implement accurate timing, but the current table being counted> >> has been running for about 10 hours, and the log is estimating the map> >> portion at 10%> >>> >> 2013-09-20 23:40:24,099 INFO [main] Job :> map> >> 10% reduce 0%> >>> >> So a loooong time. Like I mentioned, we have billions, if not trillions> >> of rows potentially.> >>> >> Thanks for the feedback on the approaches I mentioned. I was not sure> if> >> they would have any effect overall.> >>> >> I will look further into coprocessors.> >>> >> Thanks!> >> Birch> >> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]> >> >> wrote:> >>> >>> How long does it take for RowCounter Job for largest table to finish on> >> your cluster?> >>>> >>> Just curious.> >>>> >>> On your options:> >>>> >>> 1. Not worth it probably - you may overload your cluster> >>> 2. Not sure this one differs from 1. Looks the same to me but more> >> complex.> >>> 3. The same as 1 and 2> >>>> >>> Counting rows in efficient way can be done if you sacrifice some

> Please take a look at the javadoc> for src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java> > As long as the machine can reach your HBase cluster, you should be able to> run AggregationClient and utilize the AggregateImplementation endpoint in> the region servers.> > Cheers> > > On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> Thanks Ted.>> >> That was the direction I have been working towards as I am learning today.>> Much appreciation to all the replies to this thread.>> >> Whether I keep the MapReduce job or utilize the Aggregation coprocessor>> (which is turning out that it should be possible for me here), I need to>> make sure I am running the client in an efficient manner. Lars may have>> hit upon the core problem. I am not running the map reduce job on the>> cluster, but rather from a stand alone remote java client executing the job>> in process. This may very well turn out to be the number one issue. I>> would love it if this turns out to be true. Would make this a great>> learning lesson for me as a relative newcomer to working with HBase, and>> potentially allow me to finish this initial task much quicker than I was>> thinking.>> >> So assuming the MapReduce jobs need to be run on the cluster instead of>> locally, does a coprocessor endpoint client need to be run the same, or is>> it safe to run it on a remote machine since the work gets distributed out>> to the region servers? Just wondering if I would run into the same issues>> if what I said above holds true.>> >> Thanks!>> Birch>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> >>> In 0.94, we have AggregateImplementation, an endpoint coprocessor, which>>> implements getRowNum().>>> >>> Example is in AggregationClient.java>>> >>> Cheers>>> >>> >>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:>>> >>>> From your numbers below you have about 26k regions, thus each region is>>>> about 545tb/26k = 20gb. Good.>>>> >>>> How many mappers are you running?>>>> And just to rule out the obvious, the M/R is running on the cluster and>>>> not locally, right? (it will default to a local runner when it cannot>> use>>>> the M/R cluster).>>>> >>>> Some back of the envelope calculations tell me that assuming 1ge network>>>> cards, the best you can expect for 110 machines to map through this>> data is>>>> about 10h. (so way faster than what you see).>>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)>>>> >>>> >>>> We should really add a rowcounting coprocessor to HBase and allow using>> it>>>> via M/R.>>>> >>>> -- Lars>>>> >>>> >>>> >>>> ________________________________>>>> From: James Birchfield <[EMAIL PROTECTED]>>>>> To: [EMAIL PROTECTED]>>>> Sent: Friday, September 20, 2013 5:09 PM>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For>> Help>>>> >>>> >>>> I did not implement accurate timing, but the current table being counted>>>> has been running for about 10 hours, and the log is estimating the map>>>> portion at 10%>>>> >>>> 2013-09-20 23:40:24,099 INFO [main] Job :>> map>>>> 10% reduce 0%>>>> >>>> So a loooong time. Like I mentioned, we have billions, if not trillions>>>> of rows potentially.>>>> >>>> Thanks for the feedback on the approaches I mentioned. I was not sure>> if>>>> they would have any effect overall.>>>> >>>> I will look further into coprocessors.>>>> >>>> Thanks!>>>> Birch>>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[EMAIL PROTECTED]>>> >>>> wrote:>>>> >>>>> How long does it take for RowCounter Job for largest table to finish on>>>> your cluster?>>>>> >>>>> Just curious.>>>>> >>>>> On your options:>>>>> >>>>> 1. Not worth it probably - you may overload your cluster>>>>> 2. Not sure this one differs from 1. Looks the same to me but more

I could be wrong, but based on the info in your most recent emails and thelogs therein as well, I believe you may be running this job as a singleprocess.

Do you actually have a full hadoop setup running, with a jobtracker andtasktrackers? In the absence of proper configuration, the hadoop code willsimply launch a local, single-process job. The LocalJobRunner referencedin your logs points to that.

If this is the case you are likely only running a single mapper andreducer, or at most running a few mappers at once in threads in your localprocess. Either way this would obviously greatly limit the throughput.

If you have a full hadoop set-up, make sure the client (dev machine) youare running this job from has access to a mapred-site.xml and hdfs-site.xmlconfiguration file, or at the very least set the mapred.job.tracker valuemanually in your job configuration before submitting.

> Excellent! Will do!>> Birchj> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> > Please take a look at the javadoc> > for> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java> >> > As long as the machine can reach your HBase cluster, you should be able> to> > run AggregationClient and utilize the AggregateImplementation endpoint in> > the region servers.> >> > Cheers> >> >> > On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <> > [EMAIL PROTECTED]> wrote:> >> >> Thanks Ted.> >>> >> That was the direction I have been working towards as I am learning> today.> >> Much appreciation to all the replies to this thread.> >>> >> Whether I keep the MapReduce job or utilize the Aggregation coprocessor> >> (which is turning out that it should be possible for me here), I need to> >> make sure I am running the client in an efficient manner. Lars may have> >> hit upon the core problem. I am not running the map reduce job on the> >> cluster, but rather from a stand alone remote java client executing the> job> >> in process. This may very well turn out to be the number one issue. I> >> would love it if this turns out to be true. Would make this a great> >> learning lesson for me as a relative newcomer to working with HBase, and> >> potentially allow me to finish this initial task much quicker than I was> >> thinking.> >>> >> So assuming the MapReduce jobs need to be run on the cluster instead of> >> locally, does a coprocessor endpoint client need to be run the same, or> is> >> it safe to run it on a remote machine since the work gets distributed> out> >> to the region servers? Just wondering if I would run into the same> issues> >> if what I said above holds true.> >>> >> Thanks!> >> Birch> >> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >>> >>> In 0.94, we have AggregateImplementation, an endpoint coprocessor,> which> >>> implements getRowNum().> >>>> >>> Example is in AggregationClient.java> >>>> >>> Cheers> >>>> >>>> >>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]>> wrote:> >>>> >>>> From your numbers below you have about 26k regions, thus each region> is> >>>> about 545tb/26k = 20gb. Good.> >>>>> >>>> How many mappers are you running?> >>>> And just to rule out the obvious, the M/R is running on the cluster> and> >>>> not locally, right? (it will default to a local runner when it cannot> >> use> >>>> the M/R cluster).> >>>>> >>>> Some back of the envelope calculations tell me that assuming 1ge> network> >>>> cards, the best you can expect for 110 machines to map through this> >> data is> >>>> about 10h. (so way faster than what you see).> >>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)> >>>>> >>>>> >>>> We should really add a rowcounting coprocessor to HBase and allow> using> >> it> >>>> via M/R.> >>>>> >>>> -- Lars> >>>>> >>>>> >>>>> >>>

Ok, weird. Those classes do not show up through normal navigation from that link, however, the documentation does exist if I google for it directly. Maybe the javadocs need to be regenerated??? Dunno, but I will check it out.

Birch

On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Please take a look at the javadoc> for src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java> > As long as the machine can reach your HBase cluster, you should be able to> run AggregationClient and utilize the AggregateImplementation endpoint in> the region servers.> > Cheers> > > On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> Thanks Ted.>> >> That was the direction I have been working towards as I am learning today.>> Much appreciation to all the replies to this thread.>> >> Whether I keep the MapReduce job or utilize the Aggregation coprocessor>> (which is turning out that it should be possible for me here), I need to>> make sure I am running the client in an efficient manner. Lars may have>> hit upon the core problem. I am not running the map reduce job on the>> cluster, but rather from a stand alone remote java client executing the job>> in process. This may very well turn out to be the number one issue. I>> would love it if this turns out to be true. Would make this a great>> learning lesson for me as a relative newcomer to working with HBase, and>> potentially allow me to finish this initial task much quicker than I was>> thinking.>> >> So assuming the MapReduce jobs need to be run on the cluster instead of>> locally, does a coprocessor endpoint client need to be run the same, or is>> it safe to run it on a remote machine since the work gets distributed out>> to the region servers? Just wondering if I would run into the same issues>> if what I said above holds true.>> >> Thanks!>> Birch>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> >>> In 0.94, we have AggregateImplementation, an endpoint coprocessor, which>>> implements getRowNum().>>> >>> Example is in AggregationClient.java>>> >>> Cheers>>> >>> >>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:>>> >>>> From your numbers below you have about 26k regions, thus each region is>>>> about 545tb/26k = 20gb. Good.>>>> >>>> How many mappers are you running?>>>> And just to rule out the obvious, the M/R is running on the cluster and>>>> not locally, right? (it will default to a local runner when it cannot>> use>>>> the M/R cluster).>>>> >>>> Some back of the envelope calculations tell me that assuming 1ge network>>>> cards, the best you can expect for 110 machines to map through this>> data is>>>> about 10h. (so way faster than what you see).>>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)>>>> >>>> >>>> We should really add a rowcounting coprocessor to HBase and allow using>> it>>>> via M/R.>>>> >>>> -- Lars>>>> >>>> >>>> >>>> ________________________________>>>> From: James Birchfield <[EMAIL PROTECTED]>>>>> To: [EMAIL PROTECTED]>>>> Sent: Friday, September 20, 2013 5:09 PM>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For>> Help>>>> >>>> >>>> I did not implement accurate timing, but the current table being counted>>>> has been running for about 10 hours, and the log is estimating the map>>>> portion at 10%>>>> >>>> 2013-09-20 23:40:24,099 INFO [main] Job :>> map>>>> 10% reduce 0%>>>> >>>> So a loooong time. Like I mentioned, we have billions, if not trillions>>>> of rows potentially.>>>> >>>> Thanks for the feedback on the approaches I mentioned. I was not sure>> if>>>> they would have any effect overall.

Yes, we have a fully setup cluster complete with all you pointed out. But I believe, now that it has been pointed out to me in this thread and your reply, is exactly as you and Lars say. I am running the MapReduce in process from a standalone java process, and I believe it is not taking advantage of that infrastructure.

So I will pull this all out of the process, and run it on the cluster using the example I have read about.

It is most likely just my ignorance leading to the root cause of this problem. All the help is very appreciative.

> I could be wrong, but based on the info in your most recent emails and the> logs therein as well, I believe you may be running this job as a single> process.> > Do you actually have a full hadoop setup running, with a jobtracker and> tasktrackers? In the absence of proper configuration, the hadoop code will> simply launch a local, single-process job. The LocalJobRunner referenced> in your logs points to that.> > If this is the case you are likely only running a single mapper and> reducer, or at most running a few mappers at once in threads in your local> process. Either way this would obviously greatly limit the throughput.> > If you have a full hadoop set-up, make sure the client (dev machine) you> are running this job from has access to a mapred-site.xml and hdfs-site.xml> configuration file, or at the very least set the mapred.job.tracker value> manually in your job configuration before submitting.> > Let me know if I'm totally off base here.> > > On Fri, Sep 20, 2013 at 9:34 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> Excellent! Will do!>> >> Birchj>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> >>> Please take a look at the javadoc>>> for>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java>>> >>> As long as the machine can reach your HBase cluster, you should be able>> to>>> run AggregationClient and utilize the AggregateImplementation endpoint in>>> the region servers.>>> >>> Cheers>>> >>> >>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <>>> [EMAIL PROTECTED]> wrote:>>> >>>> Thanks Ted.>>>> >>>> That was the direction I have been working towards as I am learning>> today.>>>> Much appreciation to all the replies to this thread.>>>> >>>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor>>>> (which is turning out that it should be possible for me here), I need to>>>> make sure I am running the client in an efficient manner. Lars may have>>>> hit upon the core problem. I am not running the map reduce job on the>>>> cluster, but rather from a stand alone remote java client executing the>> job>>>> in process. This may very well turn out to be the number one issue. I>>>> would love it if this turns out to be true. Would make this a great>>>> learning lesson for me as a relative newcomer to working with HBase, and>>>> potentially allow me to finish this initial task much quicker than I was>>>> thinking.>>>> >>>> So assuming the MapReduce jobs need to be run on the cluster instead of>>>> locally, does a coprocessor endpoint client need to be run the same, or>> is>>>> it safe to run it on a remote machine since the work gets distributed>> out>>>> to the region servers? Just wondering if I would run into the same>> issues>>>> if what I said above holds true.>>>> >>>> Thanks!>>>> Birch>>>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>>>> >>>>> In 0.94, we have AggregateImplementation, an endpoint coprocessor,>> which>>>>> implements getRowNum().>>>>> >>>>> Example is in AggregationClient.java>>>>> >>>>> Cheers>>>>> >>>>> >>>>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]>>> wrote:>>>>> >>>>>> From your numbers below you have about 26k regions, thus each region

So, just to clarify where I am at this point, I have learned that I was absolutely not taking advantage of the cluster doing it the way I was. Some quick tests running the 'correct' way, from the command line, using the built in RowCounter MapReduce job works orders of magnitude faster than what I am seeing.

So, my apologies for seeking help for a problem when I fully didn't understand the technology and the proper use of it. However, I am very glad that this community was able to point this out and clue me in. For that I am very, very appreciative.

I will rework my logic to use this technique, probably creating a customized RowCounter MapReduce impl that can count multiple table at one instead of having to issue 600 individual requests.

> Yes, we have a fully setup cluster complete with all you pointed out. But I believe, now that it has been pointed out to me in this thread and your reply, is exactly as you and Lars say. I am running the MapReduce in process from a standalone java process, and I believe it is not taking advantage of that infrastructure.> > So I will pull this all out of the process, and run it on the cluster using the example I have read about.> > It is most likely just my ignorance leading to the root cause of this problem. All the help is very appreciative.> > Thanks!> Birch> On Sep 20, 2013, at 6:46 PM, Bryan Beaudreault <[EMAIL PROTECTED]> wrote:> >> I could be wrong, but based on the info in your most recent emails and the>> logs therein as well, I believe you may be running this job as a single>> process.>> >> Do you actually have a full hadoop setup running, with a jobtracker and>> tasktrackers? In the absence of proper configuration, the hadoop code will>> simply launch a local, single-process job. The LocalJobRunner referenced>> in your logs points to that.>> >> If this is the case you are likely only running a single mapper and>> reducer, or at most running a few mappers at once in threads in your local>> process. Either way this would obviously greatly limit the throughput.>> >> If you have a full hadoop set-up, make sure the client (dev machine) you>> are running this job from has access to a mapred-site.xml and hdfs-site.xml>> configuration file, or at the very least set the mapred.job.tracker value>> manually in your job configuration before submitting.>> >> Let me know if I'm totally off base here.>> >> >> On Fri, Sep 20, 2013 at 9:34 PM, James Birchfield <>> [EMAIL PROTECTED]> wrote:>> >>> Excellent! Will do!>>> >>> Birchj>>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>>> >>>> Please take a look at the javadoc>>>> for>>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java>>>> >>>> As long as the machine can reach your HBase cluster, you should be able>>> to>>>> run AggregationClient and utilize the AggregateImplementation endpoint in>>>> the region servers.>>>> >>>> Cheers>>>> >>>> >>>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <>>>> [EMAIL PROTECTED]> wrote:>>>> >>>>> Thanks Ted.>>>>> >>>>> That was the direction I have been working towards as I am learning>>> today.>>>>> Much appreciation to all the replies to this thread.>>>>> >>>>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor>>>>> (which is turning out that it should be possible for me here), I need to>>>>> make sure I am running the client in an efficient manner. Lars may have>>>>> hit upon the core problem. I am not running the map reduce job on the>>>>> cluster, but rather from a stand alone remote java client executing the>>> job>>>>> in process. This may very well turn out to be the number one issue. I>>>>> would love it if this turns out to be true. Would make this a great>>>>> learning lesson for me as a relative newcomer to working with HBase, and

> Ted,>> My apologies if I am being thick, but I am looking at the API docs> here: http://hbase.apache.org/apidocs/index.html and I do not see that> package. And the coprocessor package only contains an exception.>> Ok, weird. Those classes do not show up through normal navigation> from that link, however, the documentation does exist if I google for it> directly. Maybe the javadocs need to be regenerated??? Dunno, but I will> check it out.>> Birch>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> > Please take a look at the javadoc> > for> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java> >> > As long as the machine can reach your HBase cluster, you should be able> to> > run AggregationClient and utilize the AggregateImplementation endpoint in> > the region servers.> >> > Cheers> >> >> > On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <> > [EMAIL PROTECTED]> wrote:> >> >> Thanks Ted.> >>> >> That was the direction I have been working towards as I am learning> today.> >> Much appreciation to all the replies to this thread.> >>> >> Whether I keep the MapReduce job or utilize the Aggregation coprocessor> >> (which is turning out that it should be possible for me here), I need to> >> make sure I am running the client in an efficient manner. Lars may have> >> hit upon the core problem. I am not running the map reduce job on the> >> cluster, but rather from a stand alone remote java client executing the> job> >> in process. This may very well turn out to be the number one issue. I> >> would love it if this turns out to be true. Would make this a great> >> learning lesson for me as a relative newcomer to working with HBase, and> >> potentially allow me to finish this initial task much quicker than I was> >> thinking.> >>> >> So assuming the MapReduce jobs need to be run on the cluster instead of> >> locally, does a coprocessor endpoint client need to be run the same, or> is> >> it safe to run it on a remote machine since the work gets distributed> out> >> to the region servers? Just wondering if I would run into the same> issues> >> if what I said above holds true.> >>> >> Thanks!> >> Birch> >> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >>> >>> In 0.94, we have AggregateImplementation, an endpoint coprocessor,> which> >>> implements getRowNum().> >>>> >>> Example is in AggregationClient.java> >>>> >>> Cheers> >>>> >>>> >>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]>> wrote:> >>>> >>>> From your numbers below you have about 26k regions, thus each region> is> >>>> about 545tb/26k = 20gb. Good.> >>>>> >>>> How many mappers are you running?> >>>> And just to rule out the obvious, the M/R is running on the cluster> and> >>>> not locally, right? (it will default to a local runner when it cannot> >> use> >>>> the M/R cluster).> >>>>> >>>> Some back of the envelope calculations tell me that assuming 1ge> network> >>>> cards, the best you can expect for 110 machines to map through this> >> data is> >>>> about 10h. (so way faster than what you see).> >>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h)> >>>>> >>>>> >>>> We should really add a rowcounting coprocessor to HBase and allow> using> >> it> >>>> via M/R.> >>>>> >>>> -- Lars> >>>>> >>>>> >>>>> >>>> ________________________________> >>>> From: James Birchfield <[EMAIL PROTECTED]>> >>>> To: [EMAIL PROTECTED]> >>>> Sent: Friday, September 20, 2013 5:09 PM> >>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For

Thanks. I have ben taking a look this evening. We enabled the Aggregation coprocessor and the Aggregation client works great. I still have to execute it with the 'hadoop jar' command though, but can live with that. When I try to run it in process, it just hangs. I am not going to fight i though.

The only thing I dislike about the AggrgationClient is that it requires a column family. I was hoping to do this in a completely generic way, without having any information about a tables column families to get a row count. The provided implementation requires exactly one. I was hoping maybe there was always some sort of default column family always print on a table but it does not appear so. I will look at the provided coprocessor implementation and see why it is required and see if it can be optional, and if so, what the performance penalty would be. In the mean time, I am just using the first column family returned from a query to the admin client for a table. Seems to work fine.

> HBase is open source. You can check out the source code and look at the> source code.> > $ svn info> Path: .> URL: http://svn.apache.org/repos/asf/hbase/branches/0.94> Repository Root: http://svn.apache.org/repos/asf> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68> Revision: 1525061> > > On Fri, Sep 20, 2013 at 6:46 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> Ted,>> >> My apologies if I am being thick, but I am looking at the API docs>> here: http://hbase.apache.org/apidocs/index.html and I do not see that>> package. And the coprocessor package only contains an exception.>> >> Ok, weird. Those classes do not show up through normal navigation>> from that link, however, the documentation does exist if I google for it>> directly. Maybe the javadocs need to be regenerated??? Dunno, but I will>> check it out.>> >> Birch>> >> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> >>> Please take a look at the javadoc>>> for>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java>>> >>> As long as the machine can reach your HBase cluster, you should be able>> to>>> run AggregationClient and utilize the AggregateImplementation endpoint in>>> the region servers.>>> >>> Cheers>>> >>> >>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <>>> [EMAIL PROTECTED]> wrote:>>> >>>> Thanks Ted.>>>> >>>> That was the direction I have been working towards as I am learning>> today.>>>> Much appreciation to all the replies to this thread.>>>> >>>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor>>>> (which is turning out that it should be possible for me here), I need to>>>> make sure I am running the client in an efficient manner. Lars may have>>>> hit upon the core problem. I am not running the map reduce job on the>>>> cluster, but rather from a stand alone remote java client executing the>> job>>>> in process. This may very well turn out to be the number one issue. I>>>> would love it if this turns out to be true. Would make this a great>>>> learning lesson for me as a relative newcomer to working with HBase, and>>>> potentially allow me to finish this initial task much quicker than I was>>>> thinking.>>>> >>>> So assuming the MapReduce jobs need to be run on the cluster instead of>>>> locally, does a coprocessor endpoint client need to be run the same, or>> is>>>> it safe to run it on a remote machine since the work gets distributed>> out>>>> to the region servers? Just wondering if I would run into the same>> issues>>>> if what I said above holds true.>>>> >>>> Thanks!>>>> Birch>>>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>>>> >>>>> In 0.94, we have AggregateImplementation, an endpoint coprocessor,>> which>>>>> implements getRowNum().>>>>> >>>>> Example is in AggregationClient.java

I logged HBASE-9605 for relaxation of this requirement for rowcount aggregate.On Fri, Sep 20, 2013 at 8:46 PM, James Birchfield <[EMAIL PROTECTED]> wrote:

> Thanks. I have ben taking a look this evening. We enabled the> Aggregation coprocessor and the Aggregation client works great. I still> have to execute it with the 'hadoop jar' command though, but can live with> that. When I try to run it in process, it just hangs. I am not going to> fight i though.>> The only thing I dislike about the AggrgationClient is that it requires a> column family. I was hoping to do this in a completely generic way,> without having any information about a tables column families to get a row> count. The provided implementation requires exactly one. I was hoping> maybe there was always some sort of default column family always print on a> table but it does not appear so. I will look at the provided coprocessor> implementation and see why it is required and see if it can be optional,> and if so, what the performance penalty would be. In the mean time, I am> just using the first column family returned from a query to the admin> client for a table. Seems to work fine.>> Thanks!> Birch> On Sep 20, 2013, at 8:41 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> > HBase is open source. You can check out the source code and look at the> > source code.> >> > $ svn info> > Path: .> > URL: http://svn.apache.org/repos/asf/hbase/branches/0.94> > Repository Root: http://svn.apache.org/repos/asf> > Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68> > Revision: 1525061> >> >> > On Fri, Sep 20, 2013 at 6:46 PM, James Birchfield <> > [EMAIL PROTECTED]> wrote:> >> >> Ted,> >>> >> My apologies if I am being thick, but I am looking at the API> docs> >> here: http://hbase.apache.org/apidocs/index.html and I do not see that> >> package. And the coprocessor package only contains an exception.> >>> >> Ok, weird. Those classes do not show up through normal> navigation> >> from that link, however, the documentation does exist if I google for it> >> directly. Maybe the javadocs need to be regenerated??? Dunno, but I> will> >> check it out.> >>> >> Birch> >>> >> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >>> >>> Please take a look at the javadoc> >>> for> >>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java> >>>> >>> As long as the machine can reach your HBase cluster, you should be able> >> to> >>> run AggregationClient and utilize the AggregateImplementation endpoint> in> >>> the region servers.> >>>> >>> Cheers> >>>> >>>> >>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <> >>> [EMAIL PROTECTED]> wrote:> >>>> >>>> Thanks Ted.> >>>>> >>>> That was the direction I have been working towards as I am learning> >> today.> >>>> Much appreciation to all the replies to this thread.> >>>>> >>>> Whether I keep the MapReduce job or utilize the Aggregation> coprocessor> >>>> (which is turning out that it should be possible for me here), I need> to> >>>> make sure I am running the client in an efficient manner. Lars may> have> >>>> hit upon the core problem. I am not running the map reduce job on the> >>>> cluster, but rather from a stand alone remote java client executing> the> >> job> >>>> in process. This may very well turn out to be the number one issue.> I> >>>> would love it if this turns out to be true. Would make this a great> >>>> learning lesson for me as a relative newcomer to working with HBase,> and> >>>> potentially allow me to finish this initial task much quicker than I> was> >>>> thinking.> >>>>> >>>> So assuming the MapReduce jobs need to be run on the cluster instead> of> >>>> locally, does a coprocessor endpoint client need to be run the same,> or> >> is> >>>> it safe to run it on a remote machine since the work gets distributed

Sweet! Thanks a lot Ted. Like I said, I haven't looked at the code to try to determine if I could understand any potential side affects of not requiring it. But if it isn't detrimental to the speed, would be nice to have it optional if you really just don't care or even know the column family makeup of the table. Perhaps this is a use case specific to my particular usage, but an observation nonetheless.

BirchOn Sep 20, 2013, at 9:11 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Thanks for the feedback.> > I logged HBASE-9605 for relaxation of this requirement for row> count aggregate.> > > On Fri, Sep 20, 2013 at 8:46 PM, James Birchfield <> [EMAIL PROTECTED]> wrote:> >> Thanks. I have ben taking a look this evening. We enabled the>> Aggregation coprocessor and the Aggregation client works great. I still>> have to execute it with the 'hadoop jar' command though, but can live with>> that. When I try to run it in process, it just hangs. I am not going to>> fight i though.>> >> The only thing I dislike about the AggrgationClient is that it requires a>> column family. I was hoping to do this in a completely generic way,>> without having any information about a tables column families to get a row>> count. The provided implementation requires exactly one. I was hoping>> maybe there was always some sort of default column family always print on a>> table but it does not appear so. I will look at the provided coprocessor>> implementation and see why it is required and see if it can be optional,>> and if so, what the performance penalty would be. In the mean time, I am>> just using the first column family returned from a query to the admin>> client for a table. Seems to work fine.>> >> Thanks!>> Birch>> On Sep 20, 2013, at 8:41 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>> >>> HBase is open source. You can check out the source code and look at the>>> source code.>>> >>> $ svn info>>> Path: .>>> URL: http://svn.apache.org/repos/asf/hbase/branches/0.94>>> Repository Root: http://svn.apache.org/repos/asf>>> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68>>> Revision: 1525061>>> >>> >>> On Fri, Sep 20, 2013 at 6:46 PM, James Birchfield <>>> [EMAIL PROTECTED]> wrote:>>> >>>> Ted,>>>> >>>> My apologies if I am being thick, but I am looking at the API>> docs>>>> here: http://hbase.apache.org/apidocs/index.html and I do not see that>>>> package. And the coprocessor package only contains an exception.>>>> >>>> Ok, weird. Those classes do not show up through normal>> navigation>>>> from that link, however, the documentation does exist if I google for it>>>> directly. Maybe the javadocs need to be regenerated??? Dunno, but I>> will>>>> check it out.>>>> >>>> Birch>>>> >>>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[EMAIL PROTECTED]> wrote:>>>> >>>>> Please take a look at the javadoc>>>>> for>>>> >> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java>>>>> >>>>> As long as the machine can reach your HBase cluster, you should be able>>>> to>>>>> run AggregationClient and utilize the AggregateImplementation endpoint>> in>>>>> the region servers.>>>>> >>>>> Cheers>>>>> >>>>> >>>>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield <>>>>> [EMAIL PROTECTED]> wrote:>>>>> >>>>>> Thanks Ted.>>>>>> >>>>>> That was the direction I have been working towards as I am learning>>>> today.>>>>>> Much appreciation to all the replies to this thread.>>>>>> >>>>>> Whether I keep the MapReduce job or utilize the Aggregation>> coprocessor>>>>>> (which is turning out that it should be possible for me here), I need>> to>>>>>> make sure I am running the client in an efficient manner. Lars may>> have>>>>>> hit upon the core problem. I am not running the map reduce job on the>>>>>> cluster, but rather from a stand alone remote java client executing>> the>>>> job>>>>>> in process. This may very well turn out to be the number one issue.

Hey we all start somewhere. I did the "LocalJobRunner" thing many times and wondered why it was so slow, until I realized I hadn't setup my client correctly.The LocalJobRunner runs the M/R job on the client machine. This is really just for testing and terribly slow.

From later emails in this I gather you managed to run this as an actual M/R on the cluster? (by the way you do not need to start the job on a machine on the cluster, but just configure your client correctly to ship the job to the M/R cluster)Was that still too slow? I would love to get my hand on some numbers. If you have trillions of rows and can run this job with a few mappers per machines, those would be good numbers to publish here.In any case, let us know how it goes.-- Larsbtw. my calculation were assuming that network IO is the bottleneck. For larger jobs (such as yours) it's typically either that or disk IO. ________________________________

So this is where my inexperience is probably going to come glaring through. And maybe the root of all this. I am not running the MapReduce job on a node in the cluster. It is running on a development server that connects remotely to the cluster. Further more, I am not executing the MpReduce job from the command line using the CLI as seen in many of the examples. I am executing them in process of a stand-alone Java process I have written. It is simple in nature, it simply creates an HBaseAdmin connection, list the tables and looks up the column families, code the admin connection, then loops over the table list, and runs the following code:

makes me think I am not taking advantage of the cluster effectively, if at all. I do not mind at all running the MapReduce job using the hbase/hadoop CLI, I can script that as well. I just thought this would work decently enough.

It does seem like it will be possible to use the Agregation coprocessor as suggested a little earlier in this thread. It may speed things up as well. But either way, I need to understand if I am losing significant performance running in the manner I am. Which at this point sounds like I probably am.

> From your numbers below you have about 26k regions, thus each region is about 545tb/26k = 20gb. Good.> > How many mappers are you running?> And just to rule out the obvious, the M/R is running on the cluster and not locally, right? (it will default to a local runner when it cannot use the M/R cluster).> > Some back of the envelope calculations tell me that assuming 1ge network cards, the best you can expect for 110 machines to map through this data is about 10h. (so way faster than what you see).> (545tb/(110*1/8gb/s) ~ 40ks ~11h)> > > We should really add a rowcounting coprocessor to HBase and allow using it via M/R.> > -- Lars> > > > ________________________________> From: James Birchfield <[EMAIL PROTECTED]>

> Hey we all start somewhere. I did the "LocalJobRunner" thing many times and wondered why it was so slow, until I realized I hadn't setup my client correctly.> The LocalJobRunner runs the M/R job on the client machine. This is really just for testing and terribly slow.> > From later emails in this I gather you managed to run this as an actual M/R on the cluster? (by the way you do not need to start the job on a machine on the cluster, but just configure your client correctly to ship the job to the M/R cluster)> > > Was that still too slow? I would love to get my hand on some numbers. If you have trillions of rows and can run this job with a few mappers per machines, those would be good numbers to publish here.> In any case, let us know how it goes.> > > -- Lars> > > btw. my calculation were assuming that network IO is the bottleneck. For > larger jobs (such as yours) it's typically either that or disk IO. > ________________________________> > From: James Birchfield <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Sent: Friday, September 20, 2013 6:21 PM> Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> > > Thanks Lars. I like your time calculations much better than mine.> > So this is where my inexperience is probably going to come glaring through. And maybe the root of all this. I am not running the MapReduce job on a node in the cluster. It is running on a development server that connects remotely to the cluster. Further more, I am not executing the MpReduce job from the command line using the CLI as seen in many of the examples. I am executing them in process of a stand-alone Java process I have written. It is simple in nature, it simply creates an HBaseAdmin connection, list the tables and looks up the column families, code the admin connection, then loops over the table list, and runs the following code:> > public class RowCounterRunner {> > public static long countRows(String tableName) throws Exception {> > Job job = RowCounter.createSubmittableJob(> ConfigManager.getConfiguration(), new String[]{tableName});> boolean waitForCompletion = job.waitForCompletion(true);> Counters counters = job.getCounters();> Counter findCounter = counters.findCounter(hbaseadminconnection.Counters.ROWS);> long value2 = findCounter.getValue();> return value2;> > }> }> > I sort of stumbled on to this implementation as a fairly easy way to automate the process. So based on your comments, and the fact that I see this in my log:> > 2013-09-20 23:41:05,556 INFO [LocalJobRunner Map Task Executor #0] LocalJobRunner : map> > makes me think I am not taking advantage of the cluster effectively, if at all. I do not mind at all running the MapReduce job using the hbase/hadoop CLI, I can script that as well. I just thought this would work decently enough.> > It does seem like it will be possible to use the Agregation coprocessor as suggested a little earlier in this thread. It may speed things up as well. But either way, I need to understand if I am losing significant performance running in the manner I am. Which at this point sounds like I probably am.> > Birch> On Sep 20, 2013, at 6:09 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:> >> From your numbers below you have about 26k regions, thus each region is about 545tb/26k = 20gb. Good.>> >> How many mappers are you running?>> And just to rule out the obvious, the M/R is running on the cluster and not locally, right? (it will default to a local runner when it cannot use the M/R cluster).

The Aggregation coprocessor works well for smaller datasets, or in case youare computing it on a range of a table.

During its development phase, I used to do row count of 1m, 10m rows(spanning across about 25 regions for the test table). In its current form,I would avoid using it for tables bigger than that.

In case you are scanning a huge data set (which you are doing), there is achance your request would fail: you get SocketTimeOutException if yourrequest is under processing for more than (default) client rpc timeout of60 sec, for e.g. Or, it may block other requests in case all theregionserver handlers gets busy processing your request.

I would use the rowcounter mapreduce job (co-locating it with hbasecluster) in order to get the result within a decent processing time.

> Just wanted to follow up here with a little update. We enabled the> Aggregation coprocessor on our dev cluster. Here are the quick timing> stats.>> Tables: 565> Total Rows: 2,749,015,957> Total Time (to count): 52m:33s>> Will be interesting to see how this fairs against our production clusters> with a lot more data.>> Thanks again for all of your help!> Birch> On Sep 20, 2013, at 10:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:>> > Hey we all start somewhere. I did the "LocalJobRunner" thing many times> and wondered why it was so slow, until I realized I hadn't setup my client> correctly.> > The LocalJobRunner runs the M/R job on the client machine. This is> really just for testing and terribly slow.> >> > From later emails in this I gather you managed to run this as an actual> M/R on the cluster? (by the way you do not need to start the job on a> machine on the cluster, but just configure your client correctly to ship> the job to the M/R cluster)> >> >> > Was that still too slow? I would love to get my hand on some numbers. If> you have trillions of rows and can run this job with a few mappers per> machines, those would be good numbers to publish here.> > In any case, let us know how it goes.> >> >> > -- Lars> >> >> > btw. my calculation were assuming that network IO is the bottleneck. For> > larger jobs (such as yours) it's typically either that or disk IO.> > ________________________________> >> > From: James Birchfield <[EMAIL PROTECTED]>> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>> > Sent: Friday, September 20, 2013 6:21 PM> > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help> >> >> > Thanks Lars. I like your time calculations much better than mine.> >> > So this is where my inexperience is probably going to come glaring> through. And maybe the root of all this. I am not running the MapReduce> job on a node in the cluster. It is running on a development server that> connects remotely to the cluster. Further more, I am not executing the> MpReduce job from the command line using the CLI as seen in many of the> examples. I am executing them in process of a stand-alone Java process I> have written. It is simple in nature, it simply creates an HBaseAdmin> connection, list the tables and looks up the column families, code the> admin connection, then loops over the table list, and runs the following> code:> >> > public class RowCounterRunner {> >> > public static long countRows(String tableName) throws Exception {> >> > Job job = RowCounter.createSubmittableJob(> > ConfigManager.getConfiguration(), new> String[]{tableName});> > boolean waitForCompletion = job.waitForCompletion(true);> > Counters counters = job.getCounters();> > Counter findCounter > counters.findCounter(hbaseadminconnection.Counters.ROWS);

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext