one of the greatest pains I face with debugging a pig code is that theiteration cycles are really long:the applications for which we use pig typically deal with large dataset,and if a pig script involves manyJOIN/generate/filter steps, every step takes a lot of time, but every timeI fix one step, I have to run from the start,which is meaningless.

what I am doing so far to reduce the meaningless wasted time to re-runalready-debugged steps, is tomanually divide my script into many small scripts, and save the lastvariable out into hdfs, and once thesmall script is debugged fine, I load the previous variable in the nextsmall script

after all small scripts are done, I connect them back manually to theoriginal big script.is there a way to automate this? for example add a mark around a particularstep, and tells pigthat the result is to be saved up, and all following steps are not to beexecuted. and when we moveonto the next step, it knows where to pick up the last-saved data.

writing a preprocessor to do the above is not trivial so that I can't whipup something immediately , cuz it needs to figure out theschemas of variables that propagate through the steps.ThanksYang

> one of the greatest pains I face with debugging a pig code is that the> iteration cycles are really long:> the applications for which we use pig typically deal with large dataset,> and if a pig script involves many> JOIN/generate/filter steps, every step takes a lot of time, but every time> I fix one step, I have to run from the start,> which is meaningless.>> what I am doing so far to reduce the meaningless wasted time to re-run> already-debugged steps, is to> manually divide my script into many small scripts, and save the last> variable out into hdfs, and once the> small script is debugged fine, I load the previous variable in the next> small script>> after all small scripts are done, I connect them back manually to the> original big script.>>> is there a way to automate this? for example add a mark around a particular> step, and tells pig> that the result is to be saved up, and all following steps are not to be> executed. and when we move> onto the next step, it knows where to pick up the last-saved data.>> writing a preprocessor to do the above is not trivial so that I can't whip> up something immediately , cuz it needs to figure out the> schemas of variables that propagate through the steps.>>> Thanks> Yang>

Basically it would be perfect if you first test with a small amount ofdata in local mode and then run the script on the big data to verifythe correctness.If this is not possible you can store a relation at any point of yourscript with a STORE statement, so not to lose intermediate results.And then you can remove the STORE's after debugging.

Best Regards, Ruslan

On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[EMAIL PROTECTED]> wrote:> Hello ,>> I understand the pain :)>> Have you seen PigUnit and Penny>> http://pig.apache.org/docs/r0.10.0/test.html>>>> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:>>> one of the greatest pains I face with debugging a pig code is that the>> iteration cycles are really long:>> the applications for which we use pig typically deal with large dataset,>> and if a pig script involves many>> JOIN/generate/filter steps, every step takes a lot of time, but every time>> I fix one step, I have to run from the start,>> which is meaningless.>>>> what I am doing so far to reduce the meaningless wasted time to re-run>> already-debugged steps, is to>> manually divide my script into many small scripts, and save the last>> variable out into hdfs, and once the>> small script is debugged fine, I load the previous variable in the next>> small script>>>> after all small scripts are done, I connect them back manually to the>> original big script.>>>>>> is there a way to automate this? for example add a mark around a particular>> step, and tells pig>> that the result is to be saved up, and all following steps are not to be>> executed. and when we move>> onto the next step, it knows where to pick up the last-saved data.>>>> writing a preprocessor to do the above is not trivial so that I can't whip>> up something immediately , cuz it needs to figure out the>> schemas of variables that propagate through the steps.>>>>>> Thanks>> Yang>>

but manually adding and removing the STORE and LOAD commands is difficult,and more importantlyit adds the possibility to introduce bugs during the code change. thebest scenario is to put a "marker" so that certain variables are stored orskipped computation but instead LOADed

> Hi,>> Basically it would be perfect if you first test with a small amount of> data in local mode and then run the script on the big data to verify> the correctness.> If this is not possible you can store a relation at any point of your> script with a STORE statement, so not to lose intermediate results.> And then you can remove the STORE's after debugging.>> Best Regards, Ruslan>> On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[EMAIL PROTECTED]> wrote:> > Hello ,> >> > I understand the pain :)> >> > Have you seen PigUnit and Penny> >> > http://pig.apache.org/docs/r0.10.0/test.html> >> >> >> > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:> >> >> one of the greatest pains I face with debugging a pig code is that the> >> iteration cycles are really long:> >> the applications for which we use pig typically deal with large dataset,> >> and if a pig script involves many> >> JOIN/generate/filter steps, every step takes a lot of time, but every> time> >> I fix one step, I have to run from the start,> >> which is meaningless.> >>> >> what I am doing so far to reduce the meaningless wasted time to re-run> >> already-debugged steps, is to> >> manually divide my script into many small scripts, and save the last> >> variable out into hdfs, and once the> >> small script is debugged fine, I load the previous variable in the next> >> small script> >>> >> after all small scripts are done, I connect them back manually to the> >> original big script.> >>> >>> >> is there a way to automate this? for example add a mark around a> particular> >> step, and tells pig> >> that the result is to be saved up, and all following steps are not to be> >> executed. and when we move> >> onto the next step, it knows where to pick up the last-saved data.> >>> >> writing a preprocessor to do the above is not trivial so that I can't> whip> >> up something immediately , cuz it needs to figure out the> >> schemas of variables that propagate through the steps.> >>> >>> >> Thanks> >> Yang> >>>

I am using PigUnit, but it's somewhat limited: it can run only localmode,so I can't find issues that come with fairly large test data; you have tocreate small snippets of code that you cut out manually from your originalcode, so after you tested a snippet to be fine, you have to copy-paste thatback into the production code, which introduces possible copy-paste errors. if you compare this to java junit, this is really very crude: in java, youhave a class, and you can do junit testing on individual methods of theclass, instead of having to copy paste and create a special "test version"of that class.overall, I feel that testability is an area where PIG could spend a lotmore efforts and it will greatly benefit its wider adoption. ----- someother tools (Cascading, Cascalog etc) advertise testability as one of theirimportant features.

> Hello ,>> I understand the pain :)>> Have you seen PigUnit and Penny>> http://pig.apache.org/docs/r0.10.0/test.html>>>> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:>> > one of the greatest pains I face with debugging a pig code is that the> > iteration cycles are really long:> > the applications for which we use pig typically deal with large dataset,> > and if a pig script involves many> > JOIN/generate/filter steps, every step takes a lot of time, but every> time> > I fix one step, I have to run from the start,> > which is meaningless.> >> > what I am doing so far to reduce the meaningless wasted time to re-run> > already-debugged steps, is to> > manually divide my script into many small scripts, and save the last> > variable out into hdfs, and once the> > small script is debugged fine, I load the previous variable in the next> > small script> >> > after all small scripts are done, I connect them back manually to the> > original big script.> >> >> > is there a way to automate this? for example add a mark around a> particular> > step, and tells pig> > that the result is to be saved up, and all following steps are not to be> > executed. and when we move> > onto the next step, it knows where to pick up the last-saved data.> >> > writing a preprocessor to do the above is not trivial so that I can't> whip> > up something immediately , cuz it needs to figure out the> > schemas of variables that propagate through the steps.> >> >> > Thanks> > Yang> >>

As for:>the>best scenario is to put a "marker" so that certain variables are stored or>skipped computation but instead LOADedI remember there was some discussion on this in the past. Actuallythis is not trivial. What would it do if you changed a UDF internalcode, for example? How would it know that it should reprocess insteadof load? As far as I remember some other problems were mentioned.

Ruslan

On Fri, Oct 19, 2012 at 11:01 PM, Yang <[EMAIL PROTECTED]> wrote:> I am using PigUnit, but it's somewhat limited: it can run only localmode,> so I can't find issues that come with fairly large test data; you have to> create small snippets of code that you cut out manually from your original> code, so after you tested a snippet to be fine, you have to copy-paste that> back into the production code, which introduces possible copy-paste errors.> if you compare this to java junit, this is really very crude: in java, you> have a class, and you can do junit testing on individual methods of the> class, instead of having to copy paste and create a special "test version"> of that class.>>> overall, I feel that testability is an area where PIG could spend a lot> more efforts and it will greatly benefit its wider adoption. ----- some> other tools (Cascading, Cascalog etc) advertise testability as one of their> important features.>> let me check out penny... thanks>> On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote:>>> Hello ,>>>> I understand the pain :)>>>> Have you seen PigUnit and Penny>>>> http://pig.apache.org/docs/r0.10.0/test.html>>>>>>>> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:>>>> > one of the greatest pains I face with debugging a pig code is that the>> > iteration cycles are really long:>> > the applications for which we use pig typically deal with large dataset,>> > and if a pig script involves many>> > JOIN/generate/filter steps, every step takes a lot of time, but every>> time>> > I fix one step, I have to run from the start,>> > which is meaningless.>> >>> > what I am doing so far to reduce the meaningless wasted time to re-run>> > already-debugged steps, is to>> > manually divide my script into many small scripts, and save the last>> > variable out into hdfs, and once the>> > small script is debugged fine, I load the previous variable in the next>> > small script>> >>> > after all small scripts are done, I connect them back manually to the>> > original big script.>> >>> >>> > is there a way to automate this? for example add a mark around a>> particular>> > step, and tells pig>> > that the result is to be saved up, and all following steps are not to be>> > executed. and when we move>> > onto the next step, it knows where to pick up the last-saved data.>> >>> > writing a preprocessor to do the above is not trivial so that I can't>> whip>> > up something immediately , cuz it needs to figure out the>> > schemas of variables that propagate through the steps.>> >>> >>> > Thanks>> > Yang>> >>>

1) parametrize your load/store statements so that if you have to runin hadoop mode, it's easy to switch to debug inputs / outputs (anddebug input/output loaders and storers). It's vastly preferable totest in local mode when possible, since the iterations are so muchfaster.

2) it's a good thing that PigUnit makes you test small pieces of code!Factor out macros so that you can create unit tests; don't copy andpaste code, use macros and the import statement.

On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote:> I am using PigUnit, but it's somewhat limited: it can run only localmode,> so I can't find issues that come with fairly large test data; you have to> create small snippets of code that you cut out manually from your original> code, so after you tested a snippet to be fine, you have to copy-paste that> back into the production code, which introduces possible copy-paste errors.> if you compare this to java junit, this is really very crude: in java, you> have a class, and you can do junit testing on individual methods of the> class, instead of having to copy paste and create a special "test version"> of that class.>>> overall, I feel that testability is an area where PIG could spend a lot> more efforts and it will greatly benefit its wider adoption. ----- some> other tools (Cascading, Cascalog etc) advertise testability as one of their> important features.>> let me check out penny... thanks>> On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]> wrote:>>> Hello ,>>>> I understand the pain :)>>>> Have you seen PigUnit and Penny>>>> http://pig.apache.org/docs/r0.10.0/test.html>>>>>>>> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:>>>> > one of the greatest pains I face with debugging a pig code is that the>> > iteration cycles are really long:>> > the applications for which we use pig typically deal with large dataset,>> > and if a pig script involves many>> > JOIN/generate/filter steps, every step takes a lot of time, but every>> time>> > I fix one step, I have to run from the start,>> > which is meaningless.>> >>> > what I am doing so far to reduce the meaningless wasted time to re-run>> > already-debugged steps, is to>> > manually divide my script into many small scripts, and save the last>> > variable out into hdfs, and once the>> > small script is debugged fine, I load the previous variable in the next>> > small script>> >>> > after all small scripts are done, I connect them back manually to the>> > original big script.>> >>> >>> > is there a way to automate this? for example add a mark around a>> particular>> > step, and tells pig>> > that the result is to be saved up, and all following steps are not to be>> > executed. and when we move>> > onto the next step, it knows where to pick up the last-saved data.>> >>> > writing a preprocessor to do the above is not trivial so that I can't>> whip>> > up something immediately , cuz it needs to figure out the>> > schemas of variables that propagate through the steps.>> >>> >>> > Thanks>> > Yang>> >>>

> Some testing tips:>> 1) parametrize your load/store statements so that if you have to run> in hadoop mode, it's easy to switch to debug inputs / outputs (and> debug input/output loaders and storers). It's vastly preferable to> test in local mode when possible, since the iterations are so much> faster.>> 2) it's a good thing that PigUnit makes you test small pieces of code!> Factor out macros so that you can create unit tests; don't copy and> paste code, use macros and the import statement.>> 3) Try using mock.Storage (see> https://issues.apache.org/jira/browse/PIG-2650) to automatically> create inputs and examine outputs in your unit tests, if you are on> pig 11.>> D>> On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote:> > I am using PigUnit, but it's somewhat limited: it can run only localmode,> > so I can't find issues that come with fairly large test data; you have to> > create small snippets of code that you cut out manually from your> original> > code, so after you tested a snippet to be fine, you have to copy-paste> that> > back into the production code, which introduces possible copy-paste> errors.> > if you compare this to java junit, this is really very crude: in java,> you> > have a class, and you can do junit testing on individual methods of the> > class, instead of having to copy paste and create a special "test> version"> > of that class.> >> >> > overall, I feel that testability is an area where PIG could spend a lot> > more efforts and it will greatly benefit its wider adoption. ----- some> > other tools (Cascading, Cascalog etc) advertise testability as one of> their> > important features.> >> > let me check out penny... thanks> >> > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]>> wrote:> >> >> Hello ,> >>> >> I understand the pain :)> >>> >> Have you seen PigUnit and Penny> >>> >> http://pig.apache.org/docs/r0.10.0/test.html> >>> >>> >>> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:> >>> >> > one of the greatest pains I face with debugging a pig code is that the> >> > iteration cycles are really long:> >> > the applications for which we use pig typically deal with large> dataset,> >> > and if a pig script involves many> >> > JOIN/generate/filter steps, every step takes a lot of time, but every> >> time> >> > I fix one step, I have to run from the start,> >> > which is meaningless.> >> >> >> > what I am doing so far to reduce the meaningless wasted time to re-run> >> > already-debugged steps, is to> >> > manually divide my script into many small scripts, and save the last> >> > variable out into hdfs, and once the> >> > small script is debugged fine, I load the previous variable in the> next> >> > small script> >> >> >> > after all small scripts are done, I connect them back manually to the> >> > original big script.> >> >> >> >> >> > is there a way to automate this? for example add a mark around a> >> particular> >> > step, and tells pig> >> > that the result is to be saved up, and all following steps are not to> be> >> > executed. and when we move> >> > onto the next step, it knows where to pick up the last-saved data.> >> >> >> > writing a preprocessor to do the above is not trivial so that I can't> >> whip> >> > up something immediately , cuz it needs to figure out the> >> > schemas of variables that propagate through the steps.> >> >> >> >> >> > Thanks> >> > Yang> >> >> >>>

ok, I found this practice to be useful:I divide my code into sections, each section implemented as a macro.

then I debug each macro separately, at the end of each macro, I manuallywriteits output vars into tmp storage. Then for each macro, I write acorresponding "***_fake.pig" macro, which has the same signature, butpopulates the same return vars by loading them from the tmp storage.

then after I am done with one section, I swap out the IMPORT sentence toimport the **_fake.pig script instead, so that the same computation is notdone again.On Tue, Oct 23, 2012 at 11:11 AM, Yang <[EMAIL PROTECTED]> wrote:

> nice, thanks>> macros and mock.Storage() are both new to me, I believe it will help a lot>>> On Mon, Oct 22, 2012 at 5:32 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote:>>> Some testing tips:>>>> 1) parametrize your load/store statements so that if you have to run>> in hadoop mode, it's easy to switch to debug inputs / outputs (and>> debug input/output loaders and storers). It's vastly preferable to>> test in local mode when possible, since the iterations are so much>> faster.>>>> 2) it's a good thing that PigUnit makes you test small pieces of code!>> Factor out macros so that you can create unit tests; don't copy and>> paste code, use macros and the import statement.>>>> 3) Try using mock.Storage (see>> https://issues.apache.org/jira/browse/PIG-2650) to automatically>> create inputs and examine outputs in your unit tests, if you are on>> pig 11.>>>> D>>>> On Fri, Oct 19, 2012 at 12:01 PM, Yang <[EMAIL PROTECTED]> wrote:>> > I am using PigUnit, but it's somewhat limited: it can run only>> localmode,>> > so I can't find issues that come with fairly large test data; you have>> to>> > create small snippets of code that you cut out manually from your>> original>> > code, so after you tested a snippet to be fine, you have to copy-paste>> that>> > back into the production code, which introduces possible copy-paste>> errors.>> > if you compare this to java junit, this is really very crude: in java,>> you>> > have a class, and you can do junit testing on individual methods of the>> > class, instead of having to copy paste and create a special "test>> version">> > of that class.>> >>> >>> > overall, I feel that testability is an area where PIG could spend a lot>> > more efforts and it will greatly benefit its wider adoption. ----- some>> > other tools (Cascading, Cascalog etc) advertise testability as one of>> their>> > important features.>> >>> > let me check out penny... thanks>> >>> > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[EMAIL PROTECTED]>>> wrote:>> >>> >> Hello ,>> >>>> >> I understand the pain :)>> >>>> >> Have you seen PigUnit and Penny>> >>>> >> http://pig.apache.org/docs/r0.10.0/test.html>> >>>> >>>> >>>> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[EMAIL PROTECTED]> wrote:>> >>>> >> > one of the greatest pains I face with debugging a pig code is that>> the>> >> > iteration cycles are really long:>> >> > the applications for which we use pig typically deal with large>> dataset,>> >> > and if a pig script involves many>> >> > JOIN/generate/filter steps, every step takes a lot of time, but every>> >> time>> >> > I fix one step, I have to run from the start,>> >> > which is meaningless.>> >> >>> >> > what I am doing so far to reduce the meaningless wasted time to>> re-run>> >> > already-debugged steps, is to>> >> > manually divide my script into many small scripts, and save the last>> >> > variable out into hdfs, and once the>> >> > small script is debugged fine, I load the previous variable in the>> next>> >> > small script>> >> >>> >> > after all small scripts are done, I connect them back manually to the>> >> > original big script.>> >> >>> >> >>> >> > is there a way to automate this? for example add a mark around a>> >> particular>> >> > step, and tells pig>> >> > that the result is to be saved up, and all following steps are not

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext