We are new to hadoop and hive, we are trying to use hive torun analytical queries and we are using sqoop to import data into hive, inour RDBMS the data updated very frequently and this needs to be reflectedto hive. Hive does not support update/delete but there are many workaroundsto do this task.

What's in our mind is importing all the tables into hive as is, then webuild the required tables for reporting.

My questions are:

1. What is the best way to reflect MySQL updates into Hive with minimal resources? 2. Is sqoop the right tool to do the ETL? 3. Is Hive the right tool to do this kind of queries or we should search for alternatives?

A quick questio. Are you planning to replace your SQL DB with Hive? Ifthat is the case, I would not suggest to do that. Both are meant forentirely different purposes. Hive is for batch processing and not for realtime system. So if you are requirements involve real time things, you needto think before moving ahead.

> Hi All,>> We are new to hadoop and hive, we are trying to use hive to> run analytical queries and we are using sqoop to import data into hive, in> our RDBMS the data updated very frequently and this needs to be reflected> to hive. Hive does not support update/delete but there are many workarounds> to do this task.>> What's in our mind is importing all the tables into hive as is, then we> build the required tables for reporting.>> My questions are:>> 1. What is the best way to reflect MySQL updates into Hive with> minimal resources?> 2. Is sqoop the right tool to do the ETL?> 3. Is Hive the right tool to do this kind of queries or we should> search for alternatives?>> Any hint will be useful, thanks in advanced.>> --> Ibrahim>

Thanks Mohammad, No, we do not have any plans to replace our RDBMS withHive. Hadoop/Hive will be used as Data Warehouse & batch processingcomputing, as I said we want to use Hive for analytical queries.--IbrahimOn Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hello Ibrahim,>> A quick questio. Are you planning to replace your SQL DB with Hive?> If that is the case, I would not suggest to do that. Both are meant for> entirely different purposes. Hive is for batch processing and not for real> time system. So if you are requirements involve real time things, you need> to think before moving ahead.>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>> HTH>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Hi All,>>>> We are new to hadoop and hive, we are trying to use hive to>> run analytical queries and we are using sqoop to import data into hive, in>> our RDBMS the data updated very frequently and this needs to be reflected>> to hive. Hive does not support update/delete but there are many workarounds>> to do this task.>>>> What's in our mind is importing all the tables into hive as is, then we>> build the required tables for reporting.>>>> My questions are:>>>> 1. What is the best way to reflect MySQL updates into Hive with>> minimal resources?>> 2. Is sqoop the right tool to do the ETL?>> 3. Is Hive the right tool to do this kind of queries or we should>> search for alternatives?>>>> Any hint will be useful, thanks in advanced.>>>> -->> Ibrahim>>>>

> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing> computing, as I said we want to use Hive for analytical queries.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>> Hello Ibrahim,>>>> A quick questio. Are you planning to replace your SQL DB with Hive?>> If that is the case, I would not suggest to do that. Both are meant for>> entirely different purposes. Hive is for batch processing and not for real>> time system. So if you are requirements involve real time things, you need>> to think before moving ahead.>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>> HTH>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>> Hi All,>>>>>> We are new to hadoop and hive, we are trying to use hive to>>> run analytical queries and we are using sqoop to import data into hive, in>>> our RDBMS the data updated very frequently and this needs to be reflected>>> to hive. Hive does not support update/delete but there are many workarounds>>> to do this task.>>>>>> What's in our mind is importing all the tables into hive as is, then we>>> build the required tables for reporting.>>>>>> My questions are:>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>> minimal resources?>>> 2. Is sqoop the right tool to do the ETL?>>> 3. Is Hive the right tool to do this kind of queries or we should>>> search for alternatives?>>>>>> Any hint will be useful, thanks in advanced.>>>>>> -->>> Ibrahim>>>>>>>>

> Cool. Then go ahead :)>> Just in case you need something in realtime, you can have a look at> Impala.(I know nobody likes to get preached, but just in case ;) ).>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>> computing, as I said we want to use Hive for analytical queries.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>> Hello Ibrahim,>>>>>> A quick questio. Are you planning to replace your SQL DB with Hive?>>> If that is the case, I would not suggest to do that. Both are meant for>>> entirely different purposes. Hive is for batch processing and not for real>>> time system. So if you are requirements involve real time things, you need>>> to think before moving ahead.>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>> HTH>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>> Hi All,>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>> run analytical queries and we are using sqoop to import data into hive, in>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>> to hive. Hive does not support update/delete but there are many workarounds>>>> to do this task.>>>>>>>> What's in our mind is importing all the tables into hive as is, then we>>>> build the required tables for reporting.>>>>>>>> My questions are:>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>>> minimal resources?>>>> 2. Is sqoop the right tool to do the ETL?>>>> 3. Is Hive the right tool to do this kind of queries or we should>>>> search for alternatives?>>>>>>>> Any hint will be useful, thanks in advanced.>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>>>

Alternatively, you can have an additional column in your SQL table, sayLastUpdatedTime or something. As soon as there is a change in this columnyou can start the import from this point. This way you don't have to importall the things everytime there is a change in your table. You just have tomove only the most recent data, say only the 'delta' amount of data.

> My question was how to reflect MySQL updates to hadoop/hive, this is our> problem now.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>> Cool. Then go ahead :)>>>> Just in case you need something in realtime, you can have a look at>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>> Hello Ibrahim,>>>>>>>> A quick questio. Are you planning to replace your SQL DB with>>>> Hive? If that is the case, I would not suggest to do that. Both are meant>>>> for entirely different purposes. Hive is for batch processing and not for>>>> real time system. So if you are requirements involve real time things, you>>>> need to think before moving ahead.>>>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>>>> HTH>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>>>> Hi All,>>>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>>> run analytical queries and we are using sqoop to import data into hive, in>>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>>> to hive. Hive does not support update/delete but there are many workarounds>>>>> to do this task.>>>>>>>>>> What's in our mind is importing all the tables into hive as is, then>>>>> we build the required tables for reporting.>>>>>>>>>> My questions are:>>>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>>>> minimal resources?>>>>> 2. Is sqoop the right tool to do the ETL?>>>>> 3. Is Hive the right tool to do this kind of queries or we should>>>>> search for alternatives?>>>>>>>>>> Any hint will be useful, thanks in advanced.>>>>>>>>>> -->>>>> Ibrahim>>>>>>>>>>>>>>>>>>>

This already done, but Hive does not support update nor deletion of data,so when I import the data after specific "last_update_time" records, hivewill append it not replace.--IbrahimOn Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> You can use Apache Oozie to schedule your imports.>> Alternatively, you can have an additional column in your SQL table, say> LastUpdatedTime or something. As soon as there is a change in this column> you can start the import from this point. This way you don't have to import> all the things everytime there is a change in your table. You just have to> move only the most recent data, say only the 'delta' amount of data.>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> My question was how to reflect MySQL updates to hadoop/hive, this is our>> problem now.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>> Cool. Then go ahead :)>>>>>> Just in case you need something in realtime, you can have a look at>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>> Hello Ibrahim,>>>>>>>>>> A quick questio. Are you planning to replace your SQL DB with>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant>>>>> for entirely different purposes. Hive is for batch processing and not for>>>>> real time system. So if you are requirements involve real time things, you>>>>> need to think before moving ahead.>>>>>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>>>>>> HTH>>>>>>>>>> Best Regards,>>>>> Tariq>>>>> +91-9741563634>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>>> Hi All,>>>>>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>>>> run analytical queries and we are using sqoop to import data into hive, in>>>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>>>> to hive. Hive does not support update/delete but there are many workarounds>>>>>> to do this task.>>>>>>>>>>>> What's in our mind is importing all the tables into hive as is, then>>>>>> we build the required tables for reporting.>>>>>>>>>>>> My questions are:>>>>>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>>>>> minimal resources?>>>>>> 2. Is sqoop the right tool to do the ETL?>>>>>> 3. Is Hive the right tool to do this kind of queries or we should>>>>>> search for alternatives?>>>>>>>>>>>> Any hint will be useful, thanks in advanced.>>>>>>>>>>>> -->>>>>> Ibrahim>>>>>>>>>>>>>>>>>>>>>>>>>>

If it were me, I would find a way to identify the partitions that havemodified data and then re-load a subset of the partitions (only the oneswith changes) on a regular basis. Instead of updating/deleting data, you'llbe re-loading specific partitions as an all or nothing action.

On Monday, December 24, 2012, Ibrahim Yakti wrote:

> This already done, but Hive does not support update nor deletion of data,> so when I import the data after specific "last_update_time" records, hive> will append it not replace.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>> You can use Apache Oozie to schedule your imports.>> Alternatively, you can have an additional column in your SQL table, say> LastUpdatedTime or something. As soon as there is a change in this column> you can start the import from this point. This way you don't have to import> all the things everytime there is a change in your table. You just have to> move only the most recent data, say only the 'delta' amount of data.>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>> My question was how to reflect MySQL updates to hadoop/hive, this is our> problem now.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>> Cool. Then go ahead :)>> Just in case you need something in realtime, you can have a look at> Impala.(I know nobody likes to get preached, but just in case ;) ).>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing> computing, as I said we want to use Hive for analytical queries.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>> Hello Ibrahim,>> A quick questio. Are you planning to replace your SQL DB with Hive?> If that is the case, I would not suggest to do that. Both are meant for> entirely different purposes. Hive is for batch processing and not for real> time system. So if you are requirements involve real time things, you need> to think before moving ahead.>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>> HTH>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>> Hi All,>> We are new to hadoop and hive, we are trying to use hive to> run analytical queries and we are using sqoop to import data into hive, in> our RDBMS the data updated very frequently and this needs to be reflected> to hive. Hive does not support update/delete but there are many workarounds> to do this task.>> What's in our mind is importing all the>>

I was actually trying to answer you actual questions. What are youcurrently doing to tackle this update problem and what kind of tweak youare looking for?There is no direct solution to achieve this,out-of-the-box, as you have said.

> This already done, but Hive does not support update nor deletion of data,> so when I import the data after specific "last_update_time" records, hive> will append it not replace.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>> You can use Apache Oozie to schedule your imports.>>>> Alternatively, you can have an additional column in your SQL table, say>> LastUpdatedTime or something. As soon as there is a change in this column>> you can start the import from this point. This way you don't have to import>> all the things everytime there is a change in your table. You just have to>> move only the most recent data, say only the 'delta' amount of data.>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is our>>> problem now.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>> Cool. Then go ahead :)>>>>>>>> Just in case you need something in realtime, you can have a look at>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>>>>>>>> -->>>>> Ibrahim>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>>>> Hello Ibrahim,>>>>>>>>>>>> A quick questio. Are you planning to replace your SQL DB with>>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant>>>>>> for entirely different purposes. Hive is for batch processing and not for>>>>>> real time system. So if you are requirements involve real time things, you>>>>>> need to think before moving ahead.>>>>>>>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>>>>>>>> HTH>>>>>>>>>>>> Best Regards,>>>>>> Tariq>>>>>> +91-9741563634>>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>>>>> Hi All,>>>>>>>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>>>>> run analytical queries and we are using sqoop to import data into hive, in>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>>>>> to hive. Hive does not support update/delete but there are many workarounds>>>>>>> to do this task.>>>>>>>>>>>>>> What's in our mind is importing all the tables into hive as is, then>>>>>>> we build the required tables for reporting.>>>>>>>>>>>>>> My questions are:>>>>>>>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>>>>>> minimal resources?>>>>>>> 2. Is sqoop the right tool to do the ETL?>>>>>>> 3. Is Hive the right tool to do this kind of queries or we>>>>>>> should search for alternatives?>>>>>>>>>>>>>> Any hint will be useful, thanks in advanced.>>>>>>>>>>>>>> -->>>>>>> Ibrahim>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

What if you have many columns that need to be updated? a simple example:confirmation date, payment status(es) + status update time, delivery, ...etc ? on what base you will set your partition and how the old data willbe removed because the updated data will be reloaded in other partition ifI partition using payment status for example.--IbrahimOn Mon, Dec 24, 2012 at 5:25 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> I was actually trying to answer you actual questions. What are you> currently doing to tackle this update problem and what kind of tweak you> are looking for?There is no direct solution to achieve this,> out-of-the-box, as you have said.>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Mon, Dec 24, 2012 at 7:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> This already done, but Hive does not support update nor deletion of data,>> so when I import the data after specific "last_update_time" records, hive>> will append it not replace.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>> You can use Apache Oozie to schedule your imports.>>>>>> Alternatively, you can have an additional column in your SQL table, say>>> LastUpdatedTime or something. As soon as there is a change in this column>>> you can start the import from this point. This way you don't have to import>>> all the things everytime there is a change in your table. You just have to>>> move only the most recent data, say only the 'delta' amount of data.>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is>>>> our problem now.>>>>>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>> Cool. Then go ahead :)>>>>>>>>>> Just in case you need something in realtime, you can have a look at>>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>>>>>> Best Regards,>>>>> Tariq>>>>> +91-9741563634>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS>>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>>>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>>>>>>>>>>> -->>>>>> Ibrahim>>>>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>>>>>> Hello Ibrahim,>>>>>>>>>>>>>> A quick questio. Are you planning to replace your SQL DB with>>>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant>>>>>>> for entirely different purposes. Hive is for batch processing and not for>>>>>>> real time system. So if you are requirements involve real time things, you>>>>>>> need to think before moving ahead.>>>>>>>>>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>>>>>>>>>> HTH>>>>>>>>>>>>>> Best Regards,>>>>>>> Tariq>>>>>>> +91-9741563634>>>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>>>>>>> Hi All,>>>>>>>>>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>>>>>> run analytical queries and we are using sqoop to import data into hive, in>>>>>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>>>>>> to hive. Hive does not support update/delete but there are many workarounds>>>>>>>> to do this task.>>>>>>>>>>>>>>>> What's in our mind is importing all the tables into hive as is,>>>>>>>> then we build the required tables for reporting.>>>>>>>>>>>>>>>> My questions are:>>>>>>>>>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with

You can only do the last_update idea if this is an insert only dataset.

If your table takes updates you need a different strategy.1) full dumps every interval.2) Using a storage handler like hbase or cassandra that takes updateoperationsOn Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <[EMAIL PROTECTED]> wrote:

> If it were me, I would find a way to identify the partitions that have> modified data and then re-load a subset of the partitions (only the ones> with changes) on a regular basis. Instead of updating/deleting data, you'll> be re-loading specific partitions as an all or nothing action.>> On Monday, December 24, 2012, Ibrahim Yakti wrote:>>> This already done, but Hive does not support update nor deletion of data,>> so when I import the data after specific "last_update_time" records, hive>> will append it not replace.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>> You can use Apache Oozie to schedule your imports.>>>> Alternatively, you can have an additional column in your SQL table, say>> LastUpdatedTime or something. As soon as there is a change in this column>> you can start the import from this point. This way you don't have to import>> all the things everytime there is a change in your table. You just have to>> move only the most recent data, say only the 'delta' amount of data.>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>> My question was how to reflect MySQL updates to hadoop/hive, this is our>> problem now.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>> Cool. Then go ahead :)>>>> Just in case you need something in realtime, you can have a look at>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>> computing, as I said we want to use Hive for analytical queries.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>> Hello Ibrahim,>>>> A quick questio. Are you planning to replace your SQL DB with Hive?>> If that is the case, I would not suggest to do that. Both are meant for>> entirely different purposes. Hive is for batch processing and not for real>> time system. So if you are requirements involve real time things, you need>> to think before moving ahead.>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>> HTH>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>> Hi All,>>>> We are new to hadoop and hive, we are trying to use hive to>> run analytical queries and we are using sqoop to import data into hive, in>> our RDBMS the data updated very frequently and this needs to be reflected>> to hive. Hive does not support update/delete but there are many workarounds>> to do this task.>>>> What's in our mind is importing all the>>>>>> --> ---> Jeremiah Peschka> Founder, Brent Ozar Unlimited> Microsoft SQL Server MVP>>

Edward can you explain more please? you suggesting that I should use HBasefor such tasks instead of hive?--IbrahimOn Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> You can only do the last_update idea if this is an insert only dataset.>> If your table takes updates you need a different strategy.> 1) full dumps every interval.> 2) Using a storage handler like hbase or cassandra that takes update> operations>>>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <> [EMAIL PROTECTED]> wrote:>>> If it were me, I would find a way to identify the partitions that have>> modified data and then re-load a subset of the partitions (only the ones>> with changes) on a regular basis. Instead of updating/deleting data, you'll>> be re-loading specific partitions as an all or nothing action.>>>> On Monday, December 24, 2012, Ibrahim Yakti wrote:>>>>> This already done, but Hive does not support update nor deletion of>>> data, so when I import the data after specific "last_update_time" records,>>> hive will append it not replace.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>> You can use Apache Oozie to schedule your imports.>>>>>> Alternatively, you can have an additional column in your SQL table, say>>> LastUpdatedTime or something. As soon as there is a change in this column>>> you can start the import from this point. This way you don't have to import>>> all the things everytime there is a change in your table. You just have to>>> move only the most recent data, say only the 'delta' amount of data.>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is our>>> problem now.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>> Cool. Then go ahead :)>>>>>> Just in case you need something in realtime, you can have a look at>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>> Hello Ibrahim,>>>>>> A quick questio. Are you planning to replace your SQL DB with Hive?>>> If that is the case, I would not suggest to do that. Both are meant for>>> entirely different purposes. Hive is for batch processing and not for real>>> time system. So if you are requirements involve real time things, you need>>> to think before moving ahead.>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>> HTH>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>> Hi All,>>>>>> We are new to hadoop and hive, we are trying to use hive to>>> run analytical queries and we are using sqoop to import data into hive, in>>> our RDBMS the data updated very frequently and this needs to be reflected>>> to hive. Hive does not support update/delete but there are many workarounds>>> to do this task.>>>>>> What's in our mind is importing all the>>>>>>>>>> -->> --->> Jeremiah Peschka>> Founder, Brent Ozar Unlimited>> Microsoft SQL Server MVP>>>>>

> You can only do the last_update idea if this is an insert only dataset.>> If your table takes updates you need a different strategy.> 1) full dumps every interval.> 2) Using a storage handler like hbase or cassandra that takes update> operations>>>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <> [EMAIL PROTECTED]> wrote:>>> If it were me, I would find a way to identify the partitions that have>> modified data and then re-load a subset of the partitions (only the ones>> with changes) on a regular basis. Instead of updating/deleting data, you'll>> be re-loading specific partitions as an all or nothing action.>>>> On Monday, December 24, 2012, Ibrahim Yakti wrote:>>>>> This already done, but Hive does not support update nor deletion of>>> data, so when I import the data after specific "last_update_time" records,>>> hive will append it not replace.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>> You can use Apache Oozie to schedule your imports.>>>>>> Alternatively, you can have an additional column in your SQL table, say>>> LastUpdatedTime or something. As soon as there is a change in this column>>> you can start the import from this point. This way you don't have to import>>> all the things everytime there is a change in your table. You just have to>>> move only the most recent data, say only the 'delta' amount of data.>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is our>>> problem now.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>> Cool. Then go ahead :)>>>>>> Just in case you need something in realtime, you can have a look at>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>> Hello Ibrahim,>>>>>> A quick questio. Are you planning to replace your SQL DB with Hive?>>> If that is the case, I would not suggest to do that. Both are meant for>>> entirely different purposes. Hive is for batch processing and not for real>>> time system. So if you are requirements involve real time things, you need>>> to think before moving ahead.>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>> HTH>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>> Hi All,>>>>>> We are new to hadoop and hive, we are trying to use hive to>>> run analytical queries and we are using sqoop to import data into hive, in>>> our RDBMS the data updated very frequently and this needs to be reflected>>> to hive. Hive does not support update/delete but there are many workarounds>>> to do this task.>>>>>> What's in our mind is importing all the>>>>>>>>>> -->> --->> Jeremiah Peschka>> Founder, Brent Ozar Unlimited>> Microsoft SQL Server MVP>>>>>

Hive can not easily handle updates. The most creative way I saw this donewas someone managed to capture all updates and then use union queries whichrewrote the same hive table with the newest value.

original + union delta + column with latest timestamp = new original

But that is a lot of processing especially when you may not have manupdates. Hive has storage handlers that let you lay a table over hbase andcassandra data. Store your data in those systems, they take updates, thenuse hive to query those.

> Edward can you explain more please? you suggesting that I should use HBase> for such tasks instead of hive?>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:>>> You can only do the last_update idea if this is an insert only dataset.>>>> If your table takes updates you need a different strategy.>> 1) full dumps every interval.>> 2) Using a storage handler like hbase or cassandra that takes update>> operations>>>>>>>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <>> [EMAIL PROTECTED]> wrote:>>>>> If it were me, I would find a way to identify the partitions that have>>> modified data and then re-load a subset of the partitions (only the ones>>> with changes) on a regular basis. Instead of updating/deleting data, you'll>>> be re-loading specific partitions as an all or nothing action.>>>>>> On Monday, December 24, 2012, Ibrahim Yakti wrote:>>>>>>> This already done, but Hive does not support update nor deletion of>>>> data, so when I import the data after specific "last_update_time" records,>>>> hive will append it not replace.>>>>>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>> You can use Apache Oozie to schedule your imports.>>>>>>>> Alternatively, you can have an additional column in your SQL table, say>>>> LastUpdatedTime or something. As soon as there is a change in this column>>>> you can start the import from this point. This way you don't have to import>>>> all the things everytime there is a change in your table. You just have to>>>> move only the most recent data, say only the 'delta' amount of data.>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is>>>> our problem now.>>>>>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>> Cool. Then go ahead :)>>>>>>>> Just in case you need something in realtime, you can have a look at>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS with>>>> Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>> Hello Ibrahim,>>>>>>>> A quick questio. Are you planning to replace your SQL DB with>>>> Hive? If that is the case, I would not suggest to do that. Both are meant>>>> for entirely different purposes. Hive is for batch processing and not for>>>> real time system. So if you are requirements involve real time things, you>>>> need to think before moving ahead.>>>>>>>> Yes, Sqoop is 'the' tool. It is primarily meant for this purpose.>>>>>>>> HTH>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Mon, Dec 24, 2012 at 6:38 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:

> Hive can not easily handle updates. The most creative way I saw this done> was someone managed to capture all updates and then use union queries which> rewrote the same hive table with the newest value.>> original + union delta + column with latest timestamp = new original>> But that is a lot of processing especially when you may not have man> updates. Hive has storage handlers that let you lay a table over hbase and> cassandra data. Store your data in those systems, they take updates, then> use hive to query those.>>> On Mon, Dec 24, 2012 at 9:29 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Edward can you explain more please? you suggesting that I should use>> HBase for such tasks instead of hive?>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 5:28 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:>>>>> You can only do the last_update idea if this is an insert only dataset.>>>>>> If your table takes updates you need a different strategy.>>> 1) full dumps every interval.>>> 2) Using a storage handler like hbase or cassandra that takes update>>> operations>>>>>>>>>>>> On Mon, Dec 24, 2012 at 9:22 AM, Jeremiah Peschka <>>> [EMAIL PROTECTED]> wrote:>>>>>>> If it were me, I would find a way to identify the partitions that have>>>> modified data and then re-load a subset of the partitions (only the ones>>>> with changes) on a regular basis. Instead of updating/deleting data, you'll>>>> be re-loading specific partitions as an all or nothing action.>>>>>>>> On Monday, December 24, 2012, Ibrahim Yakti wrote:>>>>>>>>> This already done, but Hive does not support update nor deletion of>>>>> data, so when I import the data after specific "last_update_time" records,>>>>> hive will append it not replace.>>>>>>>>>>>>>>> -->>>>> Ibrahim>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 5:03 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>>> You can use Apache Oozie to schedule your imports.>>>>>>>>>> Alternatively, you can have an additional column in your SQL table,>>>>> say LastUpdatedTime or something. As soon as there is a change in this>>>>> column you can start the import from this point. This way you don't have to>>>>> import all the things everytime there is a change in your table. You just>>>>> have to move only the most recent data, say only the 'delta' amount of data.>>>>>>>>>> Best Regards,>>>>> Tariq>>>>> +91-9741563634>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 7:08 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>> My question was how to reflect MySQL updates to hadoop/hive, this is>>>>> our problem now.>>>>>>>>>>>>>>> -->>>>> Ibrahim>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:35 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>>> Cool. Then go ahead :)>>>>>>>>>> Just in case you need something in realtime, you can have a look at>>>>> Impala.(I know nobody likes to get preached, but just in case ;) ).>>>>>>>>>> Best Regards,>>>>> Tariq>>>>> +91-9741563634>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 7:00 PM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>> Thanks Mohammad, No, we do not have any plans to replace our RDBMS>>>>> with Hive. Hadoop/Hive will be used as Data Warehouse & batch processing>>>>> computing, as I said we want to use Hive for analytical queries.>>>>>>>>>>>>>>> -->>>>> Ibrahim>>>>>>>>>>>>>>> On Mon, Dec 24, 2012 at 4:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>>> Hello Ibrahim,>>>>>>>>>> A quick questio. Are you planning to replace your SQL DB with>>>>> Hive? If that is the case, I would not suggest to do that. Both are meant>>>>> for entirely different purposes. Hive is for batch processing and not for

This is not as hard as it sounds. The hardest part is setting up theincremental query against your MySQL database. Then you can write theresults to new files in the HDFS directory for the table and Hive will seethem immediately. Yes, even though Hive doesn't support updates, it doesn'tcare how many files are in the directory. The trick is to avoid lots oflittle files.

As others have suggested, you should consider partitioning the data,perhaps by time. Say you import about a few HDFS blocks-worth of data eachday, then use year/month/day partitioning to speed up your Hive queries.You'll need to add the partitions to the table as you go, but actually, youcan add those once a month, for example, for all partitions. Hive doesn'tcare if the partition directories don't exist yet or the directories areempty. I also recommend using an external table, which gives you moreflexibility on directory layout, etc.

Sqoop might be the easiest tool for importing the data, as it will evengenerate a Hive table schema from the original MySQL table. However, thatfeature may not be useful in this case, as you already have the table.

I think Oozie is horribly complex to use and overkill for this purpose. Asimple bash script triggered periodically by cron is all you need. If youaren't using a partitioned table, you have a single sqoop command to run.If you have partitioned data, you'll also need a hive statement in thescript to create the partition, unless you do those in batch once a month,etc., etc.

> Hi All,>> We are new to hadoop and hive, we are trying to use hive to> run analytical queries and we are using sqoop to import data into hive, in> our RDBMS the data updated very frequently and this needs to be reflected> to hive. Hive does not support update/delete but there are many workarounds> to do this task.>> What's in our mind is importing all the tables into hive as is, then we> build the required tables for reporting.>> My questions are:>> 1. What is the best way to reflect MySQL updates into Hive with> minimal resources?> 2. Is sqoop the right tool to do the ETL?> 3. Is Hive the right tool to do this kind of queries or we should> search for alternatives?>> Any hint will be useful, thanks in advanced.>> --> Ibrahim>

Thanks Dean for the great reply, setting incremental import should be easy,if I partitioned my data how hive will get me the updated rows onlyconsidering that the row may have multiple fields that will be updated overtime? and how will I manage the tables that based on multiple sources? anddo you recommend to import the data to HDFS instead of Hive directly? Won'twe have a lot of duplicated records then?

Regarding automation we were thinking to use sqoop-job command or crons asyou suggested.

> This is not as hard as it sounds. The hardest part is setting up the> incremental query against your MySQL database. Then you can write the> results to new files in the HDFS directory for the table and Hive will see> them immediately. Yes, even though Hive doesn't support updates, it doesn't> care how many files are in the directory. The trick is to avoid lots of> little files.>> As others have suggested, you should consider partitioning the data,> perhaps by time. Say you import about a few HDFS blocks-worth of data each> day, then use year/month/day partitioning to speed up your Hive queries.> You'll need to add the partitions to the table as you go, but actually, you> can add those once a month, for example, for all partitions. Hive doesn't> care if the partition directories don't exist yet or the directories are> empty. I also recommend using an external table, which gives you more> flexibility on directory layout, etc.>> Sqoop might be the easiest tool for importing the data, as it will even> generate a Hive table schema from the original MySQL table. However, that> feature may not be useful in this case, as you already have the table.>> I think Oozie is horribly complex to use and overkill for this purpose. A> simple bash script triggered periodically by cron is all you need. If you> aren't using a partitioned table, you have a single sqoop command to run.> If you have partitioned data, you'll also need a hive statement in the> script to create the partition, unless you do those in batch once a month,> etc., etc.>> Hope this helps,> dean>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Hi All,>>>> We are new to hadoop and hive, we are trying to use hive to>> run analytical queries and we are using sqoop to import data into hive, in>> our RDBMS the data updated very frequently and this needs to be reflected>> to hive. Hive does not support update/delete but there are many workarounds>> to do this task.>>>> What's in our mind is importing all the tables into hive as is, then we>> build the required tables for reporting.>>>> My questions are:>>>> 1. What is the best way to reflect MySQL updates into Hive with>> minimal resources?>> 2. Is sqoop the right tool to do the ETL?>> 3. Is Hive the right tool to do this kind of queries or we should>> search for alternatives?>>>> Any hint will be useful, thanks in advanced.>>>> -->> Ibrahim>>>>>> --> *Dean Wampler, Ph.D.*> thinkbiganalytics.com> +1-312-339-1330>>

Looks good, but a few suggestions. If you can eliminate duplicates, etc. asyou ingest the data into HDFS, that would eliminate a cleansing step. Notethat if the target directory in HDFS IS the specified location for anexternal Hive table/partition, then there will be no separate step to "loadin Hive as External Table". It's already there!

Your "transform data..." is a common pattern; stage "raw" data into alocation, then use Hive (or Pig) to transform it into the final form andINSERT INTO the final Hive table.

> Thanks Dean for the great reply, setting incremental import should be> easy, if I partitioned my data how hive will get me the updated rows only> considering that the row may have multiple fields that will be updated over> time? and how will I manage the tables that based on multiple sources? and> do you recommend to import the data to HDFS instead of Hive directly? Won't> we have a lot of duplicated records then?>> Regarding automation we were thinking to use sqoop-job command or crons as> you suggested.>> So, the suggested flow as follows:>> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive> tables for reporting.>>> Correct?>> Appreciated.>>> --> Ibrahim>>> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <> [EMAIL PROTECTED]> wrote:>>> This is not as hard as it sounds. The hardest part is setting up the>> incremental query against your MySQL database. Then you can write the>> results to new files in the HDFS directory for the table and Hive will see>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>> care how many files are in the directory. The trick is to avoid lots of>> little files.>>>> As others have suggested, you should consider partitioning the data,>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>> day, then use year/month/day partitioning to speed up your Hive queries.>> You'll need to add the partitions to the table as you go, but actually, you>> can add those once a month, for example, for all partitions. Hive doesn't>> care if the partition directories don't exist yet or the directories are>> empty. I also recommend using an external table, which gives you more>> flexibility on directory layout, etc.>>>> Sqoop might be the easiest tool for importing the data, as it will even>> generate a Hive table schema from the original MySQL table. However, that>> feature may not be useful in this case, as you already have the table.>>>> I think Oozie is horribly complex to use and overkill for this purpose. A>> simple bash script triggered periodically by cron is all you need. If you>> aren't using a partitioned table, you have a single sqoop command to run.>> If you have partitioned data, you'll also need a hive statement in the>> script to create the partition, unless you do those in batch once a month,>> etc., etc.>>>> Hope this helps,>> dean>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>> Hi All,>>>>>> We are new to hadoop and hive, we are trying to use hive to>>> run analytical queries and we are using sqoop to import data into hive, in>>> our RDBMS the data updated very frequently and this needs to be reflected>>> to hive. Hive does not support update/delete but there are many workarounds>>> to do this task.>>>>>> What's in our mind is importing all the tables into hive as is, then we>>> build the required tables for reporting.>>>>>> My questions are:>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>> minimal resources?>>> 2. Is sqoop the right tool to do the ETL?>>> 3. Is Hive the right tool to do this kind of queries or we should>>> search for alternatives?>>>>>> Any hint will be useful, thanks in advanced.>

> Looks good, but a few suggestions. If you can eliminate duplicates, etc.> as you ingest the data into HDFS, that would eliminate a cleansing step.> Note that if the target directory in HDFS IS the specified location for an> external Hive table/partition, then there will be no separate step to "load> in Hive as External Table". It's already there!>> Your "transform data..." is a common pattern; stage "raw" data into a> location, then use Hive (or Pig) to transform it into the final form and> INSERT INTO the final Hive table.>> dean>> On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Thanks Dean for the great reply, setting incremental import should be>> easy, if I partitioned my data how hive will get me the updated rows only>> considering that the row may have multiple fields that will be updated over>> time? and how will I manage the tables that based on multiple sources? and>> do you recommend to import the data to HDFS instead of Hive directly? Won't>> we have a lot of duplicated records then?>>>> Regarding automation we were thinking to use sqoop-job command or crons>> as you suggested.>>>> So, the suggested flow as follows:>>>> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in>> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive>> tables for reporting.>>>>>> Correct?>>>> Appreciated.>>>>>> -->> Ibrahim>>>>>> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <>> [EMAIL PROTECTED]> wrote:>>>>> This is not as hard as it sounds. The hardest part is setting up the>>> incremental query against your MySQL database. Then you can write the>>> results to new files in the HDFS directory for the table and Hive will see>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>>> care how many files are in the directory. The trick is to avoid lots of>>> little files.>>>>>> As others have suggested, you should consider partitioning the data,>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>>> day, then use year/month/day partitioning to speed up your Hive queries.>>> You'll need to add the partitions to the table as you go, but actually, you>>> can add those once a month, for example, for all partitions. Hive doesn't>>> care if the partition directories don't exist yet or the directories are>>> empty. I also recommend using an external table, which gives you more>>> flexibility on directory layout, etc.>>>>>> Sqoop might be the easiest tool for importing the data, as it will even>>> generate a Hive table schema from the original MySQL table. However, that>>> feature may not be useful in this case, as you already have the table.>>>>>> I think Oozie is horribly complex to use and overkill for this purpose.>>> A simple bash script triggered periodically by cron is all you need. If you>>> aren't using a partitioned table, you have a single sqoop command to run.>>> If you have partitioned data, you'll also need a hive statement in the>>> script to create the partition, unless you do those in batch once a month,>>> etc., etc.>>>>>> Hope this helps,>>> dean>>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>> Hi All,>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>> run analytical queries and we are using sqoop to import data into hive, in>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>> to hive. Hive does not support update/delete but there are many workarounds>>>> to do this task.>>>>>>>> What's in our mind is importing all the tables into hive as is, then we>>>> build the required tables for reporting.>>>>>>>> My questions are:>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with

> This is not as hard as it sounds. The hardest part is setting up the> incremental query against your MySQL database. Then you can write the> results to new files in the HDFS directory for the table and Hive will see> them immediately. Yes, even though Hive doesn't support updates, it doesn't> care how many files are in the directory. The trick is to avoid lots of> little files.>> As others have suggested, you should consider partitioning the data,> perhaps by time. Say you import about a few HDFS blocks-worth of data each> day, then use year/month/day partitioning to speed up your Hive queries.> You'll need to add the partitions to the table as you go, but actually, you> can add those once a month, for example, for all partitions. Hive doesn't> care if the partition directories don't exist yet or the directories are> empty. I also recommend using an external table, which gives you more> flexibility on directory layout, etc.>> Sqoop might be the easiest tool for importing the data, as it will even> generate a Hive table schema from the original MySQL table. However, that> feature may not be useful in this case, as you already have the table.>> I think Oozie is horribly complex to use and overkill for this purpose. A> simple bash script triggered periodically by cron is all you need. If you> aren't using a partitioned table, you have a single sqoop command to run.> If you have partitioned data, you'll also need a hive statement in the> script to create the partition, unless you do those in batch once a month,> etc., etc.>> Hope this helps,> dean>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Hi All,>>>> We are new to hadoop and hive, we are trying to use hive to>> run analytical queries and we are using sqoop to import data into hive, in>> our RDBMS the data updated very frequently and this needs to be reflected>> to hive. Hive does not support update/delete but there are many workarounds>> to do this task.>>>> What's in our mind is importing all the tables into hive as is, then we>> build the required tables for reporting.>>>> My questions are:>>>> 1. What is the best way to reflect MySQL updates into Hive with>> minimal resources?>> 2. Is sqoop the right tool to do the ETL?>> 3. Is Hive the right tool to do this kind of queries or we should>> search for alternatives?>>>> Any hint will be useful, thanks in advanced.>>>> -->> Ibrahim>>>>>> --> *Dean Wampler, Ph.D.*> thinkbiganalytics.com> +1-312-339-1330>>

> Hi,>> Is there any Hive editors and where we can write 100 to 150 Hive scripts> I'm believing is not essay to do in CLI mode all scripts .> Like IDE for JAVA /TOAD for SQL pls advice , many thanks>>> Thanks>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <> [EMAIL PROTECTED]> wrote:>>> This is not as hard as it sounds. The hardest part is setting up the>> incremental query against your MySQL database. Then you can write the>> results to new files in the HDFS directory for the table and Hive will see>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>> care how many files are in the directory. The trick is to avoid lots of>> little files.>>>> As others have suggested, you should consider partitioning the data,>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>> day, then use year/month/day partitioning to speed up your Hive queries.>> You'll need to add the partitions to the table as you go, but actually, you>> can add those once a month, for example, for all partitions. Hive doesn't>> care if the partition directories don't exist yet or the directories are>> empty. I also recommend using an external table, which gives you more>> flexibility on directory layout, etc.>>>> Sqoop might be the easiest tool for importing the data, as it will even>> generate a Hive table schema from the original MySQL table. However, that>> feature may not be useful in this case, as you already have the table.>>>> I think Oozie is horribly complex to use and overkill for this purpose. A>> simple bash script triggered periodically by cron is all you need. If you>> aren't using a partitioned table, you have a single sqoop command to run.>> If you have partitioned data, you'll also need a hive statement in the>> script to create the partition, unless you do those in batch once a month,>> etc., etc.>>>> Hope this helps,>> dean>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>> Hi All,>>>>>> We are new to hadoop and hive, we are trying to use hive to>>> run analytical queries and we are using sqoop to import data into hive, in>>> our RDBMS the data updated very frequently and this needs to be reflected>>> to hive. Hive does not support update/delete but there are many workarounds>>> to do this task.>>>>>> What's in our mind is importing all the tables into hive as is, then we>>> build the required tables for reporting.>>>>>> My questions are:>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>> minimal resources?>>> 2. Is sqoop the right tool to do the ETL?>>> 3. Is Hive the right tool to do this kind of queries or we should>>> search for alternatives?>>>>>> Any hint will be useful, thanks in advanced.>>>>>> -->>> Ibrahim>>>>>>>>>>> -->> *Dean Wampler, Ph.D.*>> thinkbiganalytics.com>> +1-312-339-1330>>>>>

> Have a look at Beeswax.>> BTW, do you have access to Google at your station?Same question on the Pig> mailing list as well, that too twice.>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]> wrote:>>> Hi,>>>> Is there any Hive editors and where we can write 100 to 150 Hive scripts>> I'm believing is not essay to do in CLI mode all scripts .>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks>>>>>> Thanks>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <>> [EMAIL PROTECTED]> wrote:>>>>> This is not as hard as it sounds. The hardest part is setting up the>>> incremental query against your MySQL database. Then you can write the>>> results to new files in the HDFS directory for the table and Hive will see>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>>> care how many files are in the directory. The trick is to avoid lots of>>> little files.>>>>>> As others have suggested, you should consider partitioning the data,>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>>> day, then use year/month/day partitioning to speed up your Hive queries.>>> You'll need to add the partitions to the table as you go, but actually, you>>> can add those once a month, for example, for all partitions. Hive doesn't>>> care if the partition directories don't exist yet or the directories are>>> empty. I also recommend using an external table, which gives you more>>> flexibility on directory layout, etc.>>>>>> Sqoop might be the easiest tool for importing the data, as it will even>>> generate a Hive table schema from the original MySQL table. However, that>>> feature may not be useful in this case, as you already have the table.>>>>>> I think Oozie is horribly complex to use and overkill for this purpose.>>> A simple bash script triggered periodically by cron is all you need. If you>>> aren't using a partitioned table, you have a single sqoop command to run.>>> If you have partitioned data, you'll also need a hive statement in the>>> script to create the partition, unless you do those in batch once a month,>>> etc., etc.>>>>>> Hope this helps,>>> dean>>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>> Hi All,>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>> run analytical queries and we are using sqoop to import data into hive, in>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>> to hive. Hive does not support update/delete but there are many workarounds>>>> to do this task.>>>>>>>> What's in our mind is importing all the tables into hive as is, then we>>>> build the required tables for reporting.>>>>>>>> My questions are:>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>>> minimal resources?>>>> 2. Is sqoop the right tool to do the ETL?>>>> 3. Is Hive the right tool to do this kind of queries or we should>>>> search for alternatives?>>>>>>>> Any hint will be useful, thanks in advanced.>>>>>>>> -->>>> Ibrahim>>>>>>>>>>>>>>>> -->>> *Dean Wampler, Ph.D.*>>> thinkbiganalytics.com>>> +1-312-339-1330>>>>>>>>>

Mohammad, I am not sure if the answers & the link were to me or to Kshiva'squestion.

if I have partitioned my data based on status for example, when I run theupdate query it will add the updated data on a new partition (success orshipped for example) and it will keep the old data (confirmed or paid forexample), right?--IbrahimOn Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Also, have a look at this :> http://www.catb.org/~esr/faqs/smart-questions.html>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>> Have a look at Beeswax.>>>> BTW, do you have access to Google at your station?Same question on the>> Pig mailing list as well, that too twice.>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]> wrote:>>>>> Hi,>>>>>> Is there any Hive editors and where we can write 100 to 150 Hive scripts>>> I'm believing is not essay to do in CLI mode all scripts .>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks>>>>>>>>> Thanks>>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <>>> [EMAIL PROTECTED]> wrote:>>>>>>> This is not as hard as it sounds. The hardest part is setting up the>>>> incremental query against your MySQL database. Then you can write the>>>> results to new files in the HDFS directory for the table and Hive will see>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>>>> care how many files are in the directory. The trick is to avoid lots of>>>> little files.>>>>>>>> As others have suggested, you should consider partitioning the data,>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>>>> day, then use year/month/day partitioning to speed up your Hive queries.>>>> You'll need to add the partitions to the table as you go, but actually, you>>>> can add those once a month, for example, for all partitions. Hive doesn't>>>> care if the partition directories don't exist yet or the directories are>>>> empty. I also recommend using an external table, which gives you more>>>> flexibility on directory layout, etc.>>>>>>>> Sqoop might be the easiest tool for importing the data, as it will even>>>> generate a Hive table schema from the original MySQL table. However, that>>>> feature may not be useful in this case, as you already have the table.>>>>>>>> I think Oozie is horribly complex to use and overkill for this purpose.>>>> A simple bash script triggered periodically by cron is all you need. If you>>>> aren't using a partitioned table, you have a single sqoop command to run.>>>> If you have partitioned data, you'll also need a hive statement in the>>>> script to create the partition, unless you do those in batch once a month,>>>> etc., etc.>>>>>>>> Hope this helps,>>>> dean>>>>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>>>>>> Hi All,>>>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>>> run analytical queries and we are using sqoop to import data into hive, in>>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>>> to hive. Hive does not support update/delete but there are many workarounds>>>>> to do this task.>>>>>>>>>> What's in our mind is importing all the tables into hive as is, then>>>>> we build the required tables for reporting.>>>>>>>>>> My questions are:>>>>>>>>>> 1. What is the best way to reflect MySQL updates into Hive with>>>>> minimal resources?>>>>> 2. Is sqoop the right tool to do the ETL?>>>>> 3. Is Hive the right tool to do this kind of queries or we should>>>>> search for alternatives?>>>>>>>>>> Any hint will be useful, thanks in advanced.>>>>>>>>>> -->>>>> Ibrahim>>>>>>>>>>>>>>>>>>>>> -->>>> *Dean Wampler, Ph.D.*>

> Mohammad, I am not sure if the answers & the link were to me or to> Kshiva's question.>> if I have partitioned my data based on status for example, when I run the> update query it will add the updated data on a new partition (success or> shipped for example) and it will keep the old data (confirmed or paid for> example), right?>>> --> Ibrahim>>> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>> Also, have a look at this :>> http://www.catb.org/~esr/faqs/smart-questions.html>>>> Best Regards,>> Tariq>> +91-9741563634>> https://mtariq.jux.com/>>>>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>> Have a look at Beeswax.>>>>>> BTW, do you have access to Google at your station?Same question on the>>> Pig mailing list as well, that too twice.>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]>wrote:>>>>>>> Hi,>>>>>>>> Is there any Hive editors and where we can write 100 to 150 Hive>>>> scripts I'm believing is not essay to do in CLI mode all scripts .>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks>>>>>>>>>>>> Thanks>>>>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <>>>> [EMAIL PROTECTED]> wrote:>>>>>>>>> This is not as hard as it sounds. The hardest part is setting up the>>>>> incremental query against your MySQL database. Then you can write the>>>>> results to new files in the HDFS directory for the table and Hive will see>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>>>>> care how many files are in the directory. The trick is to avoid lots of>>>>> little files.>>>>>>>>>> As others have suggested, you should consider partitioning the data,>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>>>>> day, then use year/month/day partitioning to speed up your Hive queries.>>>>> You'll need to add the partitions to the table as you go, but actually, you>>>>> can add those once a month, for example, for all partitions. Hive doesn't>>>>> care if the partition directories don't exist yet or the directories are>>>>> empty. I also recommend using an external table, which gives you more>>>>> flexibility on directory layout, etc.>>>>>>>>>> Sqoop might be the easiest tool for importing the data, as it will>>>>> even generate a Hive table schema from the original MySQL table. However,>>>>> that feature may not be useful in this case, as you already have the table.>>>>>>>>>> I think Oozie is horribly complex to use and overkill for this>>>>> purpose. A simple bash script triggered periodically by cron is all you>>>>> need. If you aren't using a partitioned table, you have a single sqoop>>>>> command to run. If you have partitioned data, you'll also need a hive>>>>> statement in the script to create the partition, unless you do those in>>>>> batch once a month, etc., etc.>>>>>>>>>> Hope this helps,>>>>> dean>>>>>>>>>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <[EMAIL PROTECTED]>wrote:>>>>>>>>>>> Hi All,>>>>>>>>>>>> We are new to hadoop and hive, we are trying to use hive to>>>>>> run analytical queries and we are using sqoop to import data into hive, in>>>>>> our RDBMS the data updated very frequently and this needs to be reflected>>>>>> to hive. Hive does not support update/delete but there are many workarounds>>>>>> to do this task.>>>>>>>>>>>> What's in our mind is importing all the tables into hive as is, then>>>>>> we build the required tables for reporting.

Sorry for the late response. Those replies were for Kshiva. Isaw his question(exactly same as this one) multiple times on Pig mailinglist as well, so just thought of giving some pointers to him on how to usethe list. I should have specified it properly. Apologies for creating thenuisance.

Coming back to the actual point, yes the flow is fine. Normally people doit like this. But I was looking for some alternate way, so that we don'thave to go through this long process for the updates. I'll let you knowonce I find something useful. But till now I haven't found anything betterthan whatever Dean sir has suggested. Please, do let me know if you findsomething before me.

> After more reading, a suggested scenario looks like:>> MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as> external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins &> Queries ---> Update HBase as needed & Reload in Hive.>> What do you think please?>>>> --> Ibrahim>>> On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> Mohammad, I am not sure if the answers & the link were to me or to>> Kshiva's question.>>>> if I have partitioned my data based on status for example, when I run the>> update query it will add the updated data on a new partition (success or>> shipped for example) and it will keep the old data (confirmed or paid for>> example), right?>>>>>> -->> Ibrahim>>>>>> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>> Also, have a look at this :>>> http://www.catb.org/~esr/faqs/smart-questions.html>>>>>> Best Regards,>>> Tariq>>> +91-9741563634>>> https://mtariq.jux.com/>>>>>>>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>> Have a look at Beeswax.>>>>>>>> BTW, do you have access to Google at your station?Same question on the>>>> Pig mailing list as well, that too twice.>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]>wrote:>>>>>>>>> Hi,>>>>>>>>>> Is there any Hive editors and where we can write 100 to 150 Hive>>>>> scripts I'm believing is not essay to do in CLI mode all scripts .>>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks>>>>>>>>>>>>>>> Thanks>>>>>>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <>>>>> [EMAIL PROTECTED]> wrote:>>>>>>>>>>> This is not as hard as it sounds. The hardest part is setting up the>>>>>> incremental query against your MySQL database. Then you can write the>>>>>> results to new files in the HDFS directory for the table and Hive will see>>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>>>>>> care how many files are in the directory. The trick is to avoid lots of>>>>>> little files.>>>>>>>>>>>> As others have suggested, you should consider partitioning the data,>>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>>>>>> day, then use year/month/day partitioning to speed up your Hive queries.>>>>>> You'll need to add the partitions to the table as you go, but actually, you>>>>>> can add those once a month, for example, for all partitions. Hive doesn't>>>>>> care if the partition directories don't exist yet or the directories are>>>>>> empty. I also recommend using an external table, which gives you more>>>>>> flexibility on directory layout, etc.>>>>>>>>>>>> Sqoop might be the easiest tool for importing the data, as it will>>>>>> even generate a Hive table schema from the original MySQL table. However,>>>>>> that feature may not be useful in this case, as you already have the table.>>>>>>>>>>>> I think Oozie is horribly complex to use and overkill for this

Thanks Mohammad, I will be waiting ... meanwhile, seems I will get intoHBase and give it a try ... unless someone advised with somethingbetter/easier.--IbrahimOn Wed, Dec 26, 2012 at 5:52 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hello Ibrahim,>> Sorry for the late response. Those replies were for Kshiva. I> saw his question(exactly same as this one) multiple times on Pig mailing> list as well, so just thought of giving some pointers to him on how to use> the list. I should have specified it properly. Apologies for creating the> nuisance.>> Coming back to the actual point, yes the flow is fine. Normally people do> it like this. But I was looking for some alternate way, so that we don't> have to go through this long process for the updates. I'll let you know> once I find something useful. But till now I haven't found anything better> than whatever Dean sir has suggested. Please, do let me know if you find> something before me.>> Many thanks.>>> Best Regards,> Tariq> +91-9741563634> https://mtariq.jux.com/>>> On Wed, Dec 26, 2012 at 7:24 PM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>> After more reading, a suggested scenario looks like:>>>> MySQL ---(Extract / Load)---> HDFS ---> Load into HBase --> Read as>> external in Hive ---(Transform Data & Join Tables)--> Use hive for Joins &>> Queries ---> Update HBase as needed & Reload in Hive.>>>> What do you think please?>>>>>>>> -->> Ibrahim>>>>>> On Wed, Dec 26, 2012 at 9:27 AM, Ibrahim Yakti <[EMAIL PROTECTED]> wrote:>>>>> Mohammad, I am not sure if the answers & the link were to me or to>>> Kshiva's question.>>>>>> if I have partitioned my data based on status for example, when I run>>> the update query it will add the updated data on a new partition (success>>> or shipped for example) and it will keep the old data (confirmed or paid>>> for example), right?>>>>>>>>> -->>> Ibrahim>>>>>>>>> On Tue, Dec 25, 2012 at 8:59 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>> Also, have a look at this :>>>> http://www.catb.org/~esr/faqs/smart-questions.html>>>>>>>> Best Regards,>>>> Tariq>>>> +91-9741563634>>>> https://mtariq.jux.com/>>>>>>>>>>>> On Tue, Dec 25, 2012 at 11:26 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:>>>>>>>>> Have a look at Beeswax.>>>>>>>>>> BTW, do you have access to Google at your station?Same question on the>>>>> Pig mailing list as well, that too twice.>>>>>>>>>> Best Regards,>>>>> Tariq>>>>> +91-9741563634>>>>> https://mtariq.jux.com/>>>>>>>>>>>>>>> On Tue, Dec 25, 2012 at 11:20 AM, Kshiva Kps <[EMAIL PROTECTED]>wrote:>>>>>>>>>>> Hi,>>>>>>>>>>>> Is there any Hive editors and where we can write 100 to 150 Hive>>>>>> scripts I'm believing is not essay to do in CLI mode all scripts .>>>>>> Like IDE for JAVA /TOAD for SQL pls advice , many thanks>>>>>>>>>>>>>>>>>> Thanks>>>>>>>>>>>> On Mon, Dec 24, 2012 at 8:21 PM, Dean Wampler <>>>>>> [EMAIL PROTECTED]> wrote:>>>>>>>>>>>>> This is not as hard as it sounds. The hardest part is setting up the>>>>>>> incremental query against your MySQL database. Then you can write the>>>>>>> results to new files in the HDFS directory for the table and Hive will see>>>>>>> them immediately. Yes, even though Hive doesn't support updates, it doesn't>>>>>>> care how many files are in the directory. The trick is to avoid lots of>>>>>>> little files.>>>>>>>>>>>>>> As others have suggested, you should consider partitioning the data,>>>>>>> perhaps by time. Say you import about a few HDFS blocks-worth of data each>>>>>>> day, then use year/month/day partitioning to speed up your Hive queries.>>>>>>> You'll need to add the partitions to the table as you go, but actually, you>>>>>>> can add those once a month, for example, for all partitions. Hive doesn't>>>>>>> care if the partition directories don't exist yet or the directories are>>>>>>> empty. I also recommend using an external table, which gives you more

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext