I have been thinking about using Pig statistics for # reducers estimation.Though the current heuristic approachworks fine, it requires an admin or the programmer to understand what thebest number would be for the job.We are seeing a large number of jobs over-utilizing resources, and there isobviously no default number that works wellfor all kinds of pig scripts. A few non-technical users find it difficultto estimate the best # for their jobs.It would be great if we can use stats from previous runs of a job to setthe numberof reducers for future runs.

This would be a nice feature for jobs running in production, where the jobor the dataset size does not fluctuatea huge deal. 1. Set a config param in the script - set script.unique.id prashant.1111222111.demo_script 2. If the above is not set, we fallback on the current implementation 3. If the above is set - At the end of the job, persist PigStats (namely Reduce Shuffle Bytes) to FS (hdfs, local, s3....). This would be"${script.unique.id}_YYYYMMDDHHmmss" - lets call this stats_on_hdfs - Read "stats_on_hdfs" for previous runs, and based on the number of such stats to read (based onscript.reducer.estimation.num.stats) calculate an average number of reducers for the current run. - If no stats_on_hdfs exists, we fallback on current implementation

It will be advised to not keep the retention of stats too long, and Pig canmake sure to clear up old files that are not required.

We do basically what you're describing. Each of our scripts has a logicalname which defines the workflow. For each job in the workflow we persistthe job stats, counters and conf in HBase via an implementation ofPigProgressNotificationListener. We can then correlate jobs in a run of theworkflow together based on the pig.script.start.time and pig.job.start timeproperties. We use the logical plan script signature to determine thescript version has changed.

During job execution we query the service in a impl of PigReducerEstimatorfor matching workflows.

One simple estimation algo is to multiply Pig's default estimated reducers(which are based on mapper input bytes) by the ratio of mapper output bytesover mapper input bytes of previous runs. The same could also be done withslot time, but we haven't tried that yet.

We plan to open source parts of this at some point.On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> I have been thinking about using Pig statistics for # reducers estimation.> Though the current heuristic approach> works fine, it requires an admin or the programmer to understand what the> best number would be for the job.> We are seeing a large number of jobs over-utilizing resources, and there is> obviously no default number that works well> for all kinds of pig scripts. A few non-technical users find it difficult> to estimate the best # for their jobs.> It would be great if we can use stats from previous runs of a job to set> the number> of reducers for future runs.>> This would be a nice feature for jobs running in production, where the job> or the dataset size does not fluctuate> a huge deal.>>> 1. Set a config param in the script> - set script.unique.id prashant.1111222111.demo_script> 2. If the above is not set, we fallback on the current implementation> 3. If the above is set> - At the end of the job, persist PigStats (namely Reduce Shuffle> Bytes) to FS (hdfs, local, s3....). This would be> "${script.unique.id}_YYYYMMDDHHmmss"> - lets call this stats_on_hdfs> - Read "stats_on_hdfs" for previous runs, and based on the number of> such stats to read (based on> script.reducer.estimation.num.stats) calculate> an average number of reducers for the current run.> - If no stats_on_hdfs exists, we fallback on current implementation>> It will be advised to not keep the retention of stats too long, and Pig can> make sure to clear up old files that are not required.>> What do you guys think?>> -Prashant>

Awesome! It would be good to have a flat-file based impl as there willprobably a lot of pig users not having an hbase instance setup forstats persistence. Let me know if I can help in anyway.

Is there a timeframe you are looking at for open-sourcing this?On Dec 4, 2012, at 12:32 PM, Bill Graham <[EMAIL PROTECTED]> wrote:

> We do basically what you're describing. Each of our scripts has a logical> name which defines the workflow. For each job in the workflow we persist> the job stats, counters and conf in HBase via an implementation of> PigProgressNotificationListener. We can then correlate jobs in a run of the> workflow together based on the pig.script.start.time and pig.job.start time> properties. We use the logical plan script signature to determine the> script version has changed.>> During job execution we query the service in a impl of PigReducerEstimator> for matching workflows.>> One simple estimation algo is to multiply Pig's default estimated reducers> (which are based on mapper input bytes) by the ratio of mapper output bytes> over mapper input bytes of previous runs. The same could also be done with> slot time, but we haven't tried that yet.>> We plan to open source parts of this at some point.>>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:>>> I have been thinking about using Pig statistics for # reducers estimation.>> Though the current heuristic approach>> works fine, it requires an admin or the programmer to understand what the>> best number would be for the job.>> We are seeing a large number of jobs over-utilizing resources, and there is>> obviously no default number that works well>> for all kinds of pig scripts. A few non-technical users find it difficult>> to estimate the best # for their jobs.>> It would be great if we can use stats from previous runs of a job to set>> the number>> of reducers for future runs.>>>> This would be a nice feature for jobs running in production, where the job>> or the dataset size does not fluctuate>> a huge deal.>>>>>> 1. Set a config param in the script>> - set script.unique.id prashant.1111222111.demo_script>> 2. If the above is not set, we fallback on the current implementation>> 3. If the above is set>> - At the end of the job, persist PigStats (namely Reduce Shuffle>> Bytes) to FS (hdfs, local, s3....). This would be>> "${script.unique.id}_YYYYMMDDHHmmss">> - lets call this stats_on_hdfs>> - Read "stats_on_hdfs" for previous runs, and based on the number of>> such stats to read (based on>> script.reducer.estimation.num.stats) calculate>> an average number of reducers for the current run.>> - If no stats_on_hdfs exists, we fallback on current implementation>>>> It will be advised to not keep the retention of stats too long, and Pig can>> make sure to clear up old files that are not required.>>>> What do you guys think?>>>> -Prashant>>>>>> --> *Note that I'm no longer using my Yahoo! email address. Please email me at> [EMAIL PROTECTED] going forward.*

> Awesome! It would be good to have a flat-file based impl as there will> probably a lot of pig users not having an hbase instance setup for> stats persistence. Let me know if I can help in anyway.> > Is there a timeframe you are looking at for open-sourcing this?> > > On Dec 4, 2012, at 12:32 PM, Bill Graham <[EMAIL PROTECTED]> wrote:> >> We do basically what you're describing. Each of our scripts has a logical>> name which defines the workflow. For each job in the workflow we persist>> the job stats, counters and conf in HBase via an implementation of>> PigProgressNotificationListener. We can then correlate jobs in a run of the>> workflow together based on the pig.script.start.time and pig.job.start time>> properties. We use the logical plan script signature to determine the>> script version has changed.>> >> During job execution we query the service in a impl of PigReducerEstimator>> for matching workflows.>> >> One simple estimation algo is to multiply Pig's default estimated reducers>> (which are based on mapper input bytes) by the ratio of mapper output bytes>> over mapper input bytes of previous runs. The same could also be done with>> slot time, but we haven't tried that yet.>> >> We plan to open source parts of this at some point.>> >> >> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:>> >>> I have been thinking about using Pig statistics for # reducers estimation.>>> Though the current heuristic approach>>> works fine, it requires an admin or the programmer to understand what the>>> best number would be for the job.>>> We are seeing a large number of jobs over-utilizing resources, and there is>>> obviously no default number that works well>>> for all kinds of pig scripts. A few non-technical users find it difficult>>> to estimate the best # for their jobs.>>> It would be great if we can use stats from previous runs of a job to set>>> the number>>> of reducers for future runs.>>> >>> This would be a nice feature for jobs running in production, where the job>>> or the dataset size does not fluctuate>>> a huge deal.>>> >>> >>> 1. Set a config param in the script>>> - set script.unique.id prashant.1111222111.demo_script>>> 2. If the above is not set, we fallback on the current implementation>>> 3. If the above is set>>> - At the end of the job, persist PigStats (namely Reduce Shuffle>>> Bytes) to FS (hdfs, local, s3....). This would be>>> "${script.unique.id}_YYYYMMDDHHmmss">>> - lets call this stats_on_hdfs>>> - Read "stats_on_hdfs" for previous runs, and based on the number of>>> such stats to read (based on>>> script.reducer.estimation.num.stats) calculate>>> an average number of reducers for the current run.>>> - If no stats_on_hdfs exists, we fallback on current implementation>>> >>> It will be advised to not keep the retention of stats too long, and Pig can>>> make sure to clear up old files that are not required.>>> >>> What do you guys think?>>> >>> -Prashant>>> >> >> >> >> -->> *Note that I'm no longer using my Yahoo! email address. Please email me at>> [EMAIL PROTECTED] going forward.*

You could store stats for the last X runs and take an average for the next run.

Sent from my iPhone

On Dec 6, 2012, at 8:09 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> How would flat files work? The data needs to be updated by every pig run.>> On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote:>>> Awesome! It would be good to have a flat-file based impl as there will>> probably a lot of pig users not having an hbase instance setup for>> stats persistence. Let me know if I can help in anyway.>>>> Is there a timeframe you are looking at for open-sourcing this?>>>>>> On Dec 4, 2012, at 12:32 PM, Bill Graham <[EMAIL PROTECTED]> wrote:>>>>> We do basically what you're describing. Each of our scripts has a logical>>> name which defines the workflow. For each job in the workflow we persist>>> the job stats, counters and conf in HBase via an implementation of>>> PigProgressNotificationListener. We can then correlate jobs in a run of the>>> workflow together based on the pig.script.start.time and pig.job.start time>>> properties. We use the logical plan script signature to determine the>>> script version has changed.>>>>>> During job execution we query the service in a impl of PigReducerEstimator>>> for matching workflows.>>>>>> One simple estimation algo is to multiply Pig's default estimated reducers>>> (which are based on mapper input bytes) by the ratio of mapper output bytes>>> over mapper input bytes of previous runs. The same could also be done with>>> slot time, but we haven't tried that yet.>>>>>> We plan to open source parts of this at some point.>>>>>>>>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:>>>>>>> I have been thinking about using Pig statistics for # reducers estimation.>>>> Though the current heuristic approach>>>> works fine, it requires an admin or the programmer to understand what the>>>> best number would be for the job.>>>> We are seeing a large number of jobs over-utilizing resources, and there is>>>> obviously no default number that works well>>>> for all kinds of pig scripts. A few non-technical users find it difficult>>>> to estimate the best # for their jobs.>>>> It would be great if we can use stats from previous runs of a job to set>>>> the number>>>> of reducers for future runs.>>>>>>>> This would be a nice feature for jobs running in production, where the job>>>> or the dataset size does not fluctuate>>>> a huge deal.>>>>>>>>>>>> 1. Set a config param in the script>>>> - set script.unique.id prashant.1111222111.demo_script>>>> 2. If the above is not set, we fallback on the current implementation>>>> 3. If the above is set>>>> - At the end of the job, persist PigStats (namely Reduce Shuffle>>>> Bytes) to FS (hdfs, local, s3....). This would be>>>> "${script.unique.id}_YYYYMMDDHHmmss">>>> - lets call this stats_on_hdfs>>>> - Read "stats_on_hdfs" for previous runs, and based on the number of>>>> such stats to read (based on>>>> script.reducer.estimation.num.stats) calculate>>>> an average number of reducers for the current run.>>>> - If no stats_on_hdfs exists, we fallback on current implementation>>>>>>>> It will be advised to not keep the retention of stats too long, and Pig can>>>> make sure to clear up old files that are not required.>>>>>>>> What do you guys think?>>>>>>>> -Prashant>>>>>>>>>>>>>>>> -->>> *Note that I'm no longer using my Yahoo! email address. Please email me at>>> [EMAIL PROTECTED] going forward.*

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext