Justin If I get your requirement right you need to get in data frommultiple rdbms sources and do a join on the same, also may be some morecustom operations on top of this. For this you don't need to go in forwriting your custom mapreduce code unless it is that required. You canachieve the same in two easy steps- Import data from RDBMS into Hive using SQOOP (Import)- Use hive to do some join and processing on this data

> I would like join some db tables, possibly from different databases, in a> MR job.>> I would essentially like to use MultipleInputs, but that seems file> oriented. I need a different mapper for each db table.>> Suggestions?>> Thanks!>> Justin Vincent>

Hi Justin, Just to add on to my response. If you need to fetch data fromrdbms on your mapper using your custom mapreduce code you can use theDBInputFormat in your mapper class with MultipleInputs. You have to becareful in using the number of mappers for your application as dbs would beconstrained with a limit on maximum simultaneous connections. Also you needto ensure that that the same Query is not executed n number of times in nmappers all fetching the same data, It'd be just wastage of network. Sqoop+ Hive would be my recommendation and a good combination for such usecases. If you have Pig competency you can also look into pig instead ofhive.

Hope it helps!...

RegardsBejoy.K.S

On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Justin> If I get your requirement right you need to get in data from> multiple rdbms sources and do a join on the same, also may be some more> custom operations on top of this. For this you don't need to go in for> writing your custom mapreduce code unless it is that required. You can> achieve the same in two easy steps> - Import data from RDBMS into Hive using SQOOP (Import)> - Use hive to do some join and processing on this data>> Hope it helps!..>> Regards> Bejoy.K.S>>> On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <[EMAIL PROTECTED]>wrote:>>> I would like join some db tables, possibly from different databases, in a>> MR job.>>>> I would essentially like to use MultipleInputs, but that seems file>> oriented. I need a different mapper for each db table.>>>> Suggestions?>>>> Thanks!>>>> Justin Vincent>>>>

Thanks Bejoy,I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes aPath parameter. Are these paths just ignored here?

On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Hi Justin,> Just to add on to my response. If you need to fetch data from> rdbms on your mapper using your custom mapreduce code you can use the> DBInputFormat in your mapper class with MultipleInputs. You have to be> careful in using the number of mappers for your application as dbs would be> constrained with a limit on maximum simultaneous connections. Also you need> to ensure that that the same Query is not executed n number of times in n> mappers all fetching the same data, It'd be just wastage of network. Sqoop> + Hive would be my recommendation and a good combination for such use> cases. If you have Pig competency you can also look into pig instead of> hive.>> Hope it helps!...>> Regards> Bejoy.K.S>> On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:>> > Justin> > If I get your requirement right you need to get in data from> > multiple rdbms sources and do a join on the same, also may be some more> > custom operations on top of this. For this you don't need to go in for> > writing your custom mapreduce code unless it is that required. You can> > achieve the same in two easy steps> > - Import data from RDBMS into Hive using SQOOP (Import)> > - Use hive to do some join and processing on this data> >> > Hope it helps!..> >> > Regards> > Bejoy.K.S> >> >> > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <[EMAIL PROTECTED]> >wrote:> >> >> I would like join some db tables, possibly from different databases, in> a> >> MR job.> >>> >> I would essentially like to use MultipleInputs, but that seems file> >> oriented. I need a different mapper for each db table.> >>> >> Suggestions?> >>> >> Thanks!> >>> >> Justin Vincent> >>> >> >>

MultipleInputs take multiple Path (files) and not DB as input. As mentionedearlier export tables into HDFS either using Sqoop or native DB export tooland then do the processing. Sqoop is configured to use native DB exporttool whenever possible.

> Thanks Bejoy,> I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a> Path parameter. Are these paths just ignored here?>> On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:>> > Hi Justin,> > Just to add on to my response. If you need to fetch data from> > rdbms on your mapper using your custom mapreduce code you can use the> > DBInputFormat in your mapper class with MultipleInputs. You have to be> > careful in using the number of mappers for your application as dbs would> be> > constrained with a limit on maximum simultaneous connections. Also you> need> > to ensure that that the same Query is not executed n number of times in n> > mappers all fetching the same data, It'd be just wastage of network.> Sqoop> > + Hive would be my recommendation and a good combination for such use> > cases. If you have Pig competency you can also look into pig instead of> > hive.> >> > Hope it helps!...> >> > Regards> > Bejoy.K.S> >> > On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:> >> > > Justin> > > If I get your requirement right you need to get in data from> > > multiple rdbms sources and do a join on the same, also may be some more> > > custom operations on top of this. For this you don't need to go in for> > > writing your custom mapreduce code unless it is that required. You can> > > achieve the same in two easy steps> > > - Import data from RDBMS into Hive using SQOOP (Import)> > > - Use hive to do some join and processing on this data> > >> > > Hope it helps!..> > >> > > Regards> > > Bejoy.K.S> > >> > >> > > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <[EMAIL PROTECTED]> > >wrote:> > >> > >> I would like join some db tables, possibly from different databases,> in> > a> > >> MR job.> > >>> > >> I would essentially like to use MultipleInputs, but that seems file> > >> oriented. I need a different mapper for each db table.> > >>> > >> Suggestions?> > >>> > >> Thanks!> > >>> > >> Justin Vincent> > >>> > >> > >> >>

If it is not feasible for you to do as praveen suggested, here you cango.

1. You can write customized InputFormat which can create differentconnections for different data sources and returns splits from those datasource tables. Internally you can use DBInputFormat for each data source inyour customized InputFormat if you can.

2. If your mapper input is not same for two data sources, you can write onemapper which internally delegates to mappers corresponding to the mapperbased on the inputsplit(you can refer MultipleInputs for this).

MultipleInputs doesn't support for DBInputFormat, it supports only the inputformat's which uses file path as input path.

If you explain your use case with more details, I may help you better.

MultipleInputs take multiple Path (files) and not DB as input. As mentionedearlier export tables into HDFS either using Sqoop or native DB export tooland then do the processing. Sqoop is configured to use native DB exporttool whenever possible.

> Thanks Bejoy,> I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a> Path parameter. Are these paths just ignored here?>> On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:>> > Hi Justin,> > Just to add on to my response. If you need to fetch data from> > rdbms on your mapper using your custom mapreduce code you can use the> > DBInputFormat in your mapper class with MultipleInputs. You have to be> > careful in using the number of mappers for your application as dbs would> be> > constrained with a limit on maximum simultaneous connections. Also you> need> > to ensure that that the same Query is not executed n number of times inn> > mappers all fetching the same data, It'd be just wastage of network.> Sqoop> > + Hive would be my recommendation and a good combination for such use> > cases. If you have Pig competency you can also look into pig instead of> > hive.> >> > Hope it helps!...> >> > Regards> > Bejoy.K.S> >> > On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:> >> > > Justin> > > If I get your requirement right you need to get in data from> > > multiple rdbms sources and do a join on the same, also may be somemore> > > custom operations on top of this. For this you don't need to go in for> > > writing your custom mapreduce code unless it is that required. You can> > > achieve the same in two easy steps> > > - Import data from RDBMS into Hive using SQOOP (Import)> > > - Use hive to do some join and processing on this data> > >> > > Hope it helps!..> > >> > > Regards> > > Bejoy.K.S> > >> > >> > > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <[EMAIL PROTECTED]> > >wrote:> > >> > >> I would like join some db tables, possibly from different databases,> in> > a> > >> MR job.> > >>> > >> I would essentially like to use MultipleInputs, but that seems file> > >> oriented. I need a different mapper for each db table.> > >>> > >> Suggestions?> > >>> > >> Thanks!> > >>> > >> Justin Vincent> > >>> > >> > >> >>

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext