I am deriving the time complexities for an algorithm I implemented inHadoop using Accumulo and need to know the time complexity of bulkinserting m records evenly distributed across p nodes into an emptytable with p tablet servers. Assuming B is the bandwidth of thenetwork, would the communication complexity be O(m/B) and thecomputation complexity O(m/p * log(m/p))? If the table contained nrecords would the values be O(m/B) and O(m/p * log(m/p) + n/p)?

Are you referring to "bulk inserts" as importing a pre-sorted rfile of Key/Values or usinga BatchWriter?

On 10/18/12 10:49 AM, Jeff Kubina wrote:> I am deriving the time complexities for an algorithm I implemented in> Hadoop using Accumulo and need to know the time complexity of bulk> inserting m records evenly distributed across p nodes into an empty> table with p tablet servers. Assuming B is the bandwidth of the> network, would the communication complexity be O(m/B) and the> computation complexity O(m/p * log(m/p))? If the table contained n> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?

BatchWriter, but I would be interested in the answer assuming apre-sorted rfile.

On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote:> Are you referring to "bulk inserts" as importing a pre-sorted rfile of> Key/Values or usinga BatchWriter?>> On 10/18/12 10:49 AM, Jeff Kubina wrote:>>>> I am deriving the time complexities for an algorithm I implemented in>> Hadoop using Accumulo and need to know the time complexity of bulk>> inserting m records evenly distributed across p nodes into an empty>> table with p tablet servers. Assuming B is the bandwidth of the>> network, would the communication complexity be O(m/B) and the>> computation complexity O(m/p * log(m/p))? If the table contained n>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?

Adding a sorted file to accumulo (bulk loading) is essentiallyconstant in the normal case. It is O(n) + O(p) for the worst casewhere the index must be read, and the file assigned to every tabletserver. In this case, the (slow) RPCs will dominate over the (fast)read of the index, except for very small clusters or very largeindexes.

Inserting with the BatchWriter is eventually dominated by compactions,which is a merge sort, or O(n log n).

-Eric

On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]> wrote:> BatchWriter, but I would be interested in the answer assuming a> pre-sorted rfile.>> On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]> wrote:>> Are you referring to "bulk inserts" as importing a pre-sorted rfile of>> Key/Values or usinga BatchWriter?>>>> On 10/18/12 10:49 AM, Jeff Kubina wrote:>>>>>> I am deriving the time complexities for an algorithm I implemented in>>> Hadoop using Accumulo and need to know the time complexity of bulk>>> inserting m records evenly distributed across p nodes into an empty>>> table with p tablet servers. Assuming B is the bandwidth of the>>> network, would the communication complexity be O(m/B) and the>>> computation complexity O(m/p * log(m/p))? If the table contained n>>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?

> Adding a sorted file to accumulo (bulk loading) is essentially> constant in the normal case. It is O(n) + O(p) for the worst case> where the index must be read, and the file assigned to every tablet> server. In this case, the (slow) RPCs will dominate over the (fast)> read of the index, except for very small clusters or very large> indexes.>> Inserting with the BatchWriter is eventually dominated by compactions,> which is a merge sort, or O(n log n).>> -Eric>> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]>> wrote:> > BatchWriter, but I would be interested in the answer assuming a> > pre-sorted rfile.> >> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]>> wrote:> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of> >> Key/Values or usinga BatchWriter?> >>> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:> >>>> >>> I am deriving the time complexities for an algorithm I implemented in> >>> Hadoop using Accumulo and need to know the time complexity of bulk> >>> inserting m records evenly distributed across p nodes into an empty> >>> table with p tablet servers. Assuming B is the bandwidth of the> >>> network, would the communication complexity be O(m/B) and the> >>> computation complexity O(m/p * log(m/p))? If the table contained n> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?>

For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) *p), where n is the size of the file, P is the total number of tablets(proportional to tablet servers), and p is the number of tablets that getassigned that file?

For the BatchWriter case, there's a client-side lookup/binning that takesO(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p)))for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?

> @eric, assuming the records are evenly distributed and network bandwidth> is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?>>> On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <[EMAIL PROTECTED]>wrote:>>> Adding a sorted file to accumulo (bulk loading) is essentially>> constant in the normal case. It is O(n) + O(p) for the worst case>> where the index must be read, and the file assigned to every tablet>> server. In this case, the (slow) RPCs will dominate over the (fast)>> read of the index, except for very small clusters or very large>> indexes.>>>> Inserting with the BatchWriter is eventually dominated by compactions,>> which is a merge sort, or O(n log n).>>>> -Eric>>>> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <[EMAIL PROTECTED]>>> wrote:>> > BatchWriter, but I would be interested in the answer assuming a>> > pre-sorted rfile.>> >>> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <[EMAIL PROTECTED]>>> wrote:>> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of>> >> Key/Values or usinga BatchWriter?>> >>>> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:>> >>>>> >>> I am deriving the time complexities for an algorithm I implemented in>> >>> Hadoop using Accumulo and need to know the time complexity of bulk>> >>> inserting m records evenly distributed across p nodes into an empty>> >>> table with p tablet servers. Assuming B is the bandwidth of the>> >>> network, would the communication complexity be O(m/B) and the>> >>> computation complexity O(m/p * log(m/p))? If the table contained n>> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?>>>>

+

Adam Fuchs 2012-10-24, 21:50

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext