add a storeless version of Percentile

Details

Description

The Percentile class can handle only in-memory data.
It would be interesting to use an on-line algorithm to estimate quantiles as a storeless statistic.
An example of such an algorithm is the exponentially weighted stochastic approximation described in a 2000 paper by Fei Chen , Diane Lambert and José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which can be retrieved from CiteSeerX at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580.

Also, as a special case, I have a Remdedian implementation, following http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/Remedian.pdf that I was going to propose for inclusion in [math]. I still have some work to do on the tests, but I could do that fairly quickly if others are OK adding it. I understand if consensus is to just implement a general solution.

Phil Steitz
added a comment - 13/Jun/13 17:57 Another algorithm to consider is P-Square http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
Also, as a special case, I have a Remdedian implementation, following http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/Remedian.pdf that I was going to propose for inclusion in [math] . I still have some work to do on the tests, but I could do that fairly quickly if others are OK adding it. I understand if consensus is to just implement a general solution.

Any solution is better than the current situation. However, there are two desirable features.

Reducability: Since large data datasets typically are associated with parallel environments, the ideal algorithm would be map/reduce-able.

Heteroskedasticity: What if one passes it the quantiles of say an exponential distribution not randomly but in sequential order to the quantile estimator? How big would the quantile error be? Is it significantly more than the iid case?

Ajo Fod
added a comment - 17/Jun/13 18:03 Any solution is better than the current situation. However, there are two desirable features.
Reducability: Since large data datasets typically are associated with parallel environments, the ideal algorithm would be map/reduce-able.
Heteroskedasticity: What if one passes it the quantiles of say an exponential distribution not randomly but in sequential order to the quantile estimator? How big would the quantile error be? Is it significantly more than the iid case?

Ajo Fod
added a comment - 21/Jun/13 20:34 Just in case anyone is interested: The colt implementation for the class here:
http://acs.lbl.gov/software/colt/api/hep/aida/bin/QuantileBin1D.html
is based on the paper by Manku: http://www.cs.umd.edu/~samir/498/manku.pdf

This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation of GK01 and against the Q-digest algorithm. T-digest dominates both in terms of space and accuracy. I haven't tested for speed yet since the t-digest implementation is improving rapidly. T-digest is also a very simple algorithm conceptually which helps with accurate implementation. T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly skewed distribution. GK01 does well on skewed distributions, but maintains constant absolute accuracy rather than relative accuracy. Q-digest depends on integer representations and thus fails badly on sufficiently skewed distributions.

The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this.

The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they adopted t-digest). References are in the paper above.

Ted Dunning
added a comment - 23/Mar/14 16:38 The t-digest is also available.
You can read more at
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true
This paper contains accuracy comparisons against the DataFu StreamingQuantile implementation of GK01 and against the Q-digest algorithm. T-digest dominates both in terms of space and accuracy. I haven't tested for speed yet since the t-digest implementation is improving rapidly. T-digest is also a very simple algorithm conceptually which helps with accurate implementation. T-digest is particularly good at maintaining accuracy for extreme quantiles and for highly skewed distribution. GK01 does well on skewed distributions, but maintains constant absolute accuracy rather than relative accuracy. Q-digest depends on integer representations and thus fails badly on sufficiently skewed distributions.
The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this.
The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what stream-lib used before they adopted t-digest). References are in the paper above.
In case it isn't obvious, source code is available on github at
https://github.com/tdunning/t-digest
The p^2 algorithm is actually quite old and is far from the state of the art.

Here is the issue i created for p^2 (https://issues.apache.org/jira/browse/MATH-1112 ) .
Hi All
thanks for mentioning this old issue. By the way would it help having a common strategy interface to accommodate these various algorithms.
I feel either ways we should soon have this ability.
Please let know if i could still submit my implementation for review

Venkatesha Murthy TS
added a comment - 23/Mar/14 18:24 Here is the issue i created for p^2 ( https://issues.apache.org/jira/browse/MATH-1112 ) .
Hi All
thanks for mentioning this old issue. By the way would it help having a common strategy interface to accommodate these various algorithms.
I feel either ways we should soon have this ability.
Please let know if i could still submit my implementation for review
thanks
vmurthy

Ted Dunning
added a comment - 23/Mar/14 19:17
If you already have an implementation, feel free to post it.
Best, however, would be to have a stand-alone implementation that does accuracy and speed comparison in the style of the tests in the t-digest code.
Perhaps the shortest path for you would be to fork t-digest, put your implementation into those tests and we can look at accuracy and speed.

Venkatesha Murthy TS
added a comment - 13/Apr/14 06:04 placed a first snapshot into sonatype and have added tests similar to what is available for Percentile.
Please guide me for next steps.
https://issues.sonatype.org/browse/OSSRH-9364

Venkatesha Murthy TS
added a comment - 13/Apr/14 19:54 Please ignore files such as build, unversioned_files and file named patch itself in the attached file. the only interested files are java files (one in src/main and another src/test).

Sorry for the delay reviewing. I am not finished reviewing the implementation, but need to confirm a couple of things before plowing ahead with it:

0) You are OK removing the @author tag. We don't use those in Apache Commons.
1) The code in the patch is a "clean room" implementation based on the articles, not a port / translation from another source
2) You can contribute this code under the terms of the Apache Contributor's License Agreement (https://www.apache.org/licenses/icla.txt)

CheckStyle and Findbugs complain about quite a few things in the source. These will need to be fixed or explained prior to commit. If you run "mvn site" from trunk you can see the reports. An updated patch with this stuff cleaned up would save us some time.

Phil Steitz
added a comment - 19/Apr/14 16:22 Thanks for the patch, Murthy!
Sorry for the delay reviewing. I am not finished reviewing the implementation, but need to confirm a couple of things before plowing ahead with it:
0) You are OK removing the @author tag. We don't use those in Apache Commons.
1) The code in the patch is a "clean room" implementation based on the articles, not a port / translation from another source
2) You can contribute this code under the terms of the Apache Contributor's License Agreement ( https://www.apache.org/licenses/icla.txt )
CheckStyle and Findbugs complain about quite a few things in the source. These will need to be fixed or explained prior to commit. If you run "mvn site" from trunk you can see the reports. An updated patch with this stuff cleaned up would save us some time.

This algorithm is essentially identical to the one that Mahout used to use. We dropped it in favor of t-digest because it was only able to compute one percentile per set of markers and because accuracy for skewed distributions was relatively marginal.

The t-digest and Greenwald Khanna algorithms (let's call it GK01) both handle skewed distributions much more accurately than the Chen, Lambert, Pinheiro (let's call it CLP) algorithm. A particular virtue of GK01 and t-digest is that they allow any quantile to be computed without having to pre-specify which quantiles you want. It is common to need many quantiles from the same set of data and some applications require arbitrary quantiles. GK01 and t-digest also allow inverse CDF calculation (where you get to find out the quantile value for a particular x). GK01 is the algorithm used by Sawzall and (I think) Dremel. The t-digest algorithm has been adopted by Mahout, Streamlib and Elastic Search.

The particular virtue of t-digest over GK01 is that it is several orders of magnitude more accurate for extreme quantiles. There are three well-tested implementations of t-digest available in both source and as a maven artifact. All the code is ASL 2 licensed.

Ted Dunning
added a comment - 19/Apr/14 21:00 I just read the proposed code.
The tests are very good when it comes to checking for boundary conditions and illegal input, but they do not test several functional aspects.
In particular, it would be good to have tests for the following conditions:
a) extreme quantiles such as 99.999%-ile or 0.00001 %-ile
b) highly skewed input distributions such as Gamma(0.1, 0.1), especially with low quantiles like 0.1%-ile
This algorithm is essentially identical to the one that Mahout used to use. We dropped it in favor of t-digest because it was only able to compute one percentile per set of markers and because accuracy for skewed distributions was relatively marginal.
The t-digest and Greenwald Khanna algorithms (let's call it GK01) both handle skewed distributions much more accurately than the Chen, Lambert, Pinheiro (let's call it CLP) algorithm. A particular virtue of GK01 and t-digest is that they allow any quantile to be computed without having to pre-specify which quantiles you want. It is common to need many quantiles from the same set of data and some applications require arbitrary quantiles. GK01 and t-digest also allow inverse CDF calculation (where you get to find out the quantile value for a particular x). GK01 is the algorithm used by Sawzall and (I think) Dremel. The t-digest algorithm has been adopted by Mahout, Streamlib and Elastic Search.
The particular virtue of t-digest over GK01 is that it is several orders of magnitude more accurate for extreme quantiles. There are three well-tested implementations of t-digest available in both source and as a maven artifact. All the code is ASL 2 licensed.
See https://github.com/tdunning/t-digest for more information.

Ted ,
I understand the point of view ; however just to clarify
a) While I have modeled this implement based on the existing Percentile class and as StorelessUnivariatestatistic; i agree its approximate for smaller sets but should improve for larger set. I could add a comment that this is an approximate technique,
b) GK, T-Digest, Q-Digest, BinMedian, BinApprox etc i feel all of them can be implementation choices that user could use rather than sticking to any perticular algo

Please let me know .

Phil,
Also l will clear the check style and findbug issues and re submit with tests added

Venkatesha Murthy TS
added a comment - 19/Apr/14 22:27 Ted ,
I understand the point of view ; however just to clarify
a) While I have modeled this implement based on the existing Percentile class and as StorelessUnivariatestatistic; i agree its approximate for smaller sets but should improve for larger set. I could add a comment that this is an approximate technique,
b) GK, T-Digest, Q-Digest, BinMedian, BinApprox etc i feel all of them can be implementation choices that user could use rather than sticking to any perticular algo
Please let me know .
Phil,
Also l will clear the check style and findbug issues and re submit with tests added

All of these algorithms (except those retaining all of the dat) are approximate. That isn't news, nor does it differentiate the algorithms.

The questions are:

a) is only a single quantile computed and is that quantile flexible?

b) how accurate is the quantile calculation?

c) how does the accuracy vary as a function of quantile, especially near 0 or 1?

d) how long does it take to insert a new point?

My assertion is that t-digest dominates all of these other options on all of these questions except possibly for (d) where it roughly matches them. I am not sure I see the point of implementing these other algorithms if they provide no advantage.

To be specific:

GK - is relatively fast and allows computation of multiple quantiles. Memory usage for a large number of quantiles is large, accuracy is less than t-digest controlling for memory usage. Accuracy at quantiles near 0 or 1 is several orders of magnitude worse. GK is not particularly susceptible to skewed distributions.

Q-digest - only works on an integer domain which makes it not work well at all for skewed distributions. It has comparable accuracy as GK when quantization is not a factor. Memory usage is quite comparable to t-digest, but accuracy is noticeably worse and near q=0 or 1, accuracy is several orders of magnitude worse.

BinMedian and BinApprox - only computes the median, fails entirely for skewed distributions where the median is outside the range mean - sd to mean + sd. Accuracy is comparable to other algorithms but since no quantiles other than median are computed it doesn't even compete.

Ted Dunning
added a comment - 20/Apr/14 03:05 Murthy,
I think that you didn't get the point.
All of these algorithms (except those retaining all of the dat) are approximate. That isn't news, nor does it differentiate the algorithms.
The questions are:
a) is only a single quantile computed and is that quantile flexible?
b) how accurate is the quantile calculation?
c) how does the accuracy vary as a function of quantile, especially near 0 or 1?
d) how long does it take to insert a new point?
My assertion is that t-digest dominates all of these other options on all of these questions except possibly for (d) where it roughly matches them. I am not sure I see the point of implementing these other algorithms if they provide no advantage.
To be specific:
GK - is relatively fast and allows computation of multiple quantiles. Memory usage for a large number of quantiles is large, accuracy is less than t-digest controlling for memory usage. Accuracy at quantiles near 0 or 1 is several orders of magnitude worse. GK is not particularly susceptible to skewed distributions.
Q-digest - only works on an integer domain which makes it not work well at all for skewed distributions. It has comparable accuracy as GK when quantization is not a factor. Memory usage is quite comparable to t-digest, but accuracy is noticeably worse and near q=0 or 1, accuracy is several orders of magnitude worse.
BinMedian and BinApprox - only computes the median, fails entirely for skewed distributions where the median is outside the range mean - sd to mean + sd. Accuracy is comparable to other algorithms but since no quantiles other than median are computed it doesn't even compete.

As a clarification, the median and the mean are never separated by more than a single standard deviation. The failure of binapprox on skewed data occurs because essentially all of the data can be in the range around the mean. A secondary failure is on sorted data.

For example, for Gamma(0.1, 0.1), the mean is 1, the sd is 3.15 and the median is 0.006. About 93% of the distribution is in the range [mean-sd, mean+sd]

Ted Dunning
added a comment - 20/Apr/14 03:09 As a clarification, the median and the mean are never separated by more than a single standard deviation. The failure of binapprox on skewed data occurs because essentially all of the data can be in the range around the mean. A secondary failure is on sorted data.
For example, for Gamma(0.1, 0.1), the mean is 1, the sd is 3.15 and the median is 0.006. About 93% of the distribution is in the range [mean-sd, mean+sd]

Venkatesha Murthy TS
added a comment - 21/Apr/14 05:04 Thanks for the details..
So are there any plans to get this T-Digest into Apache math3 interfaces that can be substituted for say Percentile or as a StorelessUnivariateStatistic.

You are welcome to jump on it if you like. I am happy to help with advice, but beyond what I have already done in deriving the algorithm and writing the t-digest reference implementation, I have to bow out of real coding for this.

Ted Dunning
added a comment - 21/Apr/14 05:58 There are no plans as far as I know.
You are welcome to jump on it if you like. I am happy to help with advice, but beyond what I have already done in deriving the algorithm and writing the t-digest reference implementation, I have to bow out of real coding for this.

With Ted commenting on the accuracy at near extremes, and T-Digest not yet being available in Math3 interfaces; does it still hold good if i resubmit the patch for p-2 algo with all the necessary changes

Please let know.

Also if there are other problems in commons that i can take a look at. please let me know the link for the same.

Venkatesha Murthy TS
added a comment - 22/Apr/14 15:51 Hi Phil,
With Ted commenting on the accuracy at near extremes, and T-Digest not yet being available in Math3 interfaces; does it still hold good if i resubmit the patch for p-2 algo with all the necessary changes
Please let know.
Also if there are other problems in commons that i can take a look at. please let me know the link for the same.
thanks for your help/advice and review
Regards
Murthy

We are a "do-ocracy" here. Those who actually do the work determine what happens to the code. Ted has a cool new algorithm and implementation that he is willing to grant, but he is not willing to do the work to get it ready to commit. You have implemented a standard algorithm that is close to ready to commit. Complete the cleanup and unless others have technical issues with the code, I will commit it.

Phil Steitz
added a comment - 22/Apr/14 18:07 We are a "do-ocracy" here. Those who actually do the work determine what happens to the code. Ted has a cool new algorithm and implementation that he is willing to grant, but he is not willing to do the work to get it ready to commit. You have implemented a standard algorithm that is close to ready to commit. Complete the cleanup and unless others have technical issues with the code, I will commit it.

Ted Dunning
added a comment - 22/Apr/14 18:15 Venkat,
Phil's comment is exactly correct. I am merely offering code you can use if you like. I am, sadly, not offering to do the work to integrate t-digest into commons math.
This means that you are the one to choose how you would like to go forward.

Please find the revised patch named psquare-patch. In this i have tried to address the following
a) better code coverage (>93%) for my class
b) Addressed the PMD, findbugs and checkstyle report issues.

(Personal preference) The first line of a Javadoc comment should not be broken after "@param x"; it is harder to read and needlessly increases the height of the text.

Indicates the purpose of "package" visibility when it is used instead of "private" or "protected".

Fully comment public fields, parameters, constructor and methods: see e.g. the constructor of "Marker" where the reader is referred to the documentation of a private field (which would not appear in the generated HTML)

I can't comment much about the algorithmic side.
A few points I noticed:

The input (quantile) is allowed to be either in [0, 1] or in (1, 100]; this is error-prone and this kind of flexibility is not to be handled at the math level (IMHO). The input should be a primitive "double", not a "Number".

At line 1206, the error message is abused, as it should refer to an index.

At line 1211, use "NullArgumentException" in place of "MathIllegalArgumentException".

At line 1214, do not combine different types of failures, and use "DimensionMismatchException" in place of "MathIllegalArgumentException".

Gilles
added a comment - 16/May/14 11:20 Thanks for you effort up to now. There are however some remarks about the coding style, to try and ensure consistency throughout the Commons Math codebase:
You should not use an underscore as a suffix or prefix (for method or variable names)
For constants, use the "final" keyword in the declaration
Do not overuse <p> in comments (especially in protected and private classes)
No initial uppercase letter after @return
Do not use protected fields (i.e. create setters instead)
Conform to the natural rules for capitalization in comments (e.g. no uppercase for the second word in a sentence)
Remove code that was commented-out (and indicated as unnecessary)
Do not start a @param or @throws with the word "is" (or "passed in")
Do not use a single uppercase letter as a variable name
Prefer (moderately) long names for instance fields and methods wherever it would make the code more self-documenting (cf. "ns", "nps", "qs")
Avoid consecutive uppercase letters (e.g. "processADataPoint" should rather be " "processDataPoint").
(Personal preference) The first line of a Javadoc comment should not be broken after "@param x"; it is harder to read and needlessly increases the height of the text.
Indicates the purpose of "package" visibility when it is used instead of "private" or "protected".
Fully comment public fields, parameters, constructor and methods: see e.g. the constructor of "Marker" where the reader is referred to the documentation of a private field (which would not appear in the generated HTML)
I can't comment much about the algorithmic side.
A few points I noticed:
The input (quantile) is allowed to be either in [0, 1] or in (1, 100]; this is error-prone and this kind of flexibility is not to be handled at the math level (IMHO). The input should be a primitive "double", not a "Number".
At line 1206, the error message is abused, as it should refer to an index.
At line 1211, use "NullArgumentException" in place of "MathIllegalArgumentException".
At line 1214, do not combine different types of failures, and use "DimensionMismatchException" in place of "MathIllegalArgumentException".

First of, many thanks for suggesting a good set of comments which i have tried incorporating allmost all of them.

The new patch to be used for review is 418-psquare-patch,

a) The earlier reasons for using variables such as n, np etc were to indicate the variables as close as possible to resemble variables used in the paper. However i have already incorporated self describing long names in the new patch

b) is there a way to put these rules in checkstyle or such things to automate; as its hard to remember to manually do these every time. Please let me know if this is available somewhere so i can use

c) Do i need to remove the earlier patch file as having 2 patch files may confuse. Please let know.

Venkatesha Murthy TS
added a comment - 21/May/14 18:10 just wanted to add few points here:
First of, many thanks for suggesting a good set of comments which i have tried incorporating allmost all of them.
The new patch to be used for review is 418-psquare-patch,
a) The earlier reasons for using variables such as n, np etc were to indicate the variables as close as possible to resemble variables used in the paper. However i have already incorporated self describing long names in the new patch
b) is there a way to put these rules in checkstyle or such things to automate; as its hard to remember to manually do these every time. Please let me know if this is available somewhere so i can use
c) Do i need to remove the earlier patch file as having 2 patch files may confuse. Please let know.
thanks again
venkat.

Indeed the message is the default one, and it takes no argument (hence the string "estimator" is useless).

You use "package" visibility in places where it must be "private" (e.g. "postConstruct").

I don't get why "Marker", "Markers", "PSquareInterpolatorEvaluator", ... are "protected". It is not helpful to indicate that it is for extensibility (as this is obvious); rather you should indicate what customization may make sense. Until an actual need arises (TBD on the ML), all the implementation details should be inside a "black-box" (to allow modifications without breaking compatibility).

IMHO, formatting should be taken care of in a separate class. The "toString" method should not do fancy formatting. (But this issue should probably be raised on the ML.)

No comment in all uppercase; please. This akin to shouting.

What is the purpose of "toMap"?

Keyword "final" is missing in several places (see e.g. method "estimator(PSquareEstimator)", at line 1035).

Is it really necessary to override "equals" (see e.g. at line 1066)? Since you make strict comparisons of double values, it is unlikely that intances are ever going to be equal...

In Javadoc comment of public methods, do not use a @link to a "private" field or method.

Missing comment at lines 669 and 1368. Yes, everything must be documented.

Citing the paper more than once is redundant. The "href" link should appear in the top class's Javadoc; then it is sufficient (and more readable) to use a short-hand like "P-square".

Do you run the "site" target of maven and make sure that the reports are clean? (Among other things, this runs "CheckStyle" and "FindBugs".)

A general remark: when introducing a new functionality, you should first focus on the correctness of the implementation's main purpose, leaving out everything that is not strictly necessary (cf. formatting, provision for extensibility, ...). The longer the code, the longer it will take to review... Whereas once a solid base is committed, anyone can easily polish it, adding bells and whistles.

Gilles
added a comment - 23/May/14 14:17 There are still things to fix.
Usage of CM exceptions:
throw new NullArgumentException(LocalizedFormats.NULL_NOT_ALLOWED, "estimator" )
must rather be
throw new NullArgumentException();
Indeed the message is the default one, and it takes no argument (hence the string "estimator" is useless).
You use "package" visibility in places where it must be "private" (e.g. "postConstruct").
I don't get why "Marker", "Markers", "PSquareInterpolatorEvaluator", ... are "protected". It is not helpful to indicate that it is for extensibility (as this is obvious); rather you should indicate what customization may make sense. Until an actual need arises (TBD on the ML), all the implementation details should be inside a "black-box" (to allow modifications without breaking compatibility).
IMHO, formatting should be taken care of in a separate class. The "toString" method should not do fancy formatting. (But this issue should probably be raised on the ML.)
No comment in all uppercase; please. This akin to shouting.
What is the purpose of "toMap"?
Keyword "final" is missing in several places (see e.g. method "estimator(PSquareEstimator)", at line 1035).
Is it really necessary to override "equals" (see e.g. at line 1066)? Since you make strict comparisons of double values, it is unlikely that intances are ever going to be equal...
In Javadoc comment of public methods, do not use a @link to a "private" field or method.
Missing comment at lines 669 and 1368. Yes, everything must be documented.
Citing the paper more than once is redundant. The "href" link should appear in the top class's Javadoc; then it is sufficient (and more readable) to use a short-hand like "P-square".
Do you run the "site" target of maven and make sure that the reports are clean? (Among other things, this runs "CheckStyle" and "FindBugs".)
A general remark: when introducing a new functionality, you should first focus on the correctness of the implementation's main purpose, leaving out everything that is not strictly necessary (cf. formatting, provision for extensibility, ...). The longer the code, the longer it will take to review... Whereas once a solid base is committed, anyone can easily polish it, adding bells and whistles.

I have incorporated almost all comments. Please find the latest patch by date 30-may-2014-418-psquare-patch.

Please suggest for further changes and i am greatly indebted to the review comments and hope to have much less going further.

One question i have though is if i need to major algorithm part which is under private static now; i am under the impression to provide a creation method and expose the object as an protected interface. In order to do coverage and tests ; was forced to do this. Please help me if this is ok. Now with this i have made sure to not impact existing instance.

Summary:
-------------

a) Modified to use appropriate exception such as NullArgumentException etc
where required
b) Turned all inner classes as private static; however keeping a thin interfaces as protected. Even this will not impact the existing instance of PSquaredPercentile.
c) I have cleared findbugs, checkstyle, pmd and about 93% of jacoco code coverage.

Thanks a ton. Please let me know if i could do a quick chat over so that i can quickly understand context and rationale of any further comments.

Venkatesha Murthy TS
added a comment - 30/May/14 04:04 I have incorporated almost all comments. Please find the latest patch by date 30-may-2014-418-psquare-patch.
Please suggest for further changes and i am greatly indebted to the review comments and hope to have much less going further.
One question i have though is if i need to major algorithm part which is under private static now; i am under the impression to provide a creation method and expose the object as an protected interface. In order to do coverage and tests ; was forced to do this. Please help me if this is ok. Now with this i have made sure to not impact existing instance.
Summary:
-------------
a) Modified to use appropriate exception such as NullArgumentException etc
where required
b) Turned all inner classes as private static; however keeping a thin interfaces as protected. Even this will not impact the existing instance of PSquaredPercentile.
c) I have cleared findbugs, checkstyle, pmd and about 93% of jacoco code coverage.
Thanks a ton. Please let me know if i could do a quick chat over so that i can quickly understand context and rationale of any further comments.
Regards
Venkat.

The context is : I have modeled the algorithm into 2 nested static classes(i.e Markers and Marker) as i couldnt possibly move all of them to one big main class (The main class is PSquarePercentile). Since the bulk of logic needs to be tested and coveraged; i am forced to expose the logic as a protected inner interface(Which is PSquareMarkers where Markers implements PSquareMarkers). This interface is instantiated using a creation method (newMarkers) that returns an instance of Markers. Even the PSquarePercentile makes use of newMarkers for instantiating Markers,while PSquarePercentile class per-se doesnt allow external setting of Markers.

So my thought here has been to model marker and markers as separate and clear entities/classes (infact private static as they are really independent from main class) ; yet however allow sufficient hooks for testing (for instance to cover for more than 90%) by means of exposing them as one minimal interface(i.e PSquareMarkers). Even yet; do not allow external setting of markers into PSquarePercentile to keep up the compatability and safety. I am assuming this to be the black box. So for no extensibility at the moment except for clients to look at an interface. Infact an earlier interface PSquareEstimator has been subsumed inside Marker class. so essentially its all in these 2 nested classes.

I am most willing to understand solutions to my problem here. please help if there are other simplifications that i can add here ; but just thought of sharing what has been going on in my mind when i am doing this. Some where though i tend to think if i merge all the nested classes to main, its going to be one big class (akin to GOD class anti pattern) which is where i am at cross roads unable to decide.

Yes i have run mvn site target and taken care of pmd, checkstyle, findbug and jacoco code coverage

Venkatesha Murthy TS
added a comment - 30/May/14 12:58 Sorry i meant to ask the question as follows:
The context is : I have modeled the algorithm into 2 nested static classes(i.e Markers and Marker) as i couldnt possibly move all of them to one big main class (The main class is PSquarePercentile). Since the bulk of logic needs to be tested and coveraged; i am forced to expose the logic as a protected inner interface(Which is PSquareMarkers where Markers implements PSquareMarkers). This interface is instantiated using a creation method (newMarkers) that returns an instance of Markers. Even the PSquarePercentile makes use of newMarkers for instantiating Markers,while PSquarePercentile class per-se doesnt allow external setting of Markers.
So my thought here has been to model marker and markers as separate and clear entities/classes (infact private static as they are really independent from main class) ; yet however allow sufficient hooks for testing (for instance to cover for more than 90%) by means of exposing them as one minimal interface(i.e PSquareMarkers). Even yet; do not allow external setting of markers into PSquarePercentile to keep up the compatability and safety. I am assuming this to be the black box. So for no extensibility at the moment except for clients to look at an interface. Infact an earlier interface PSquareEstimator has been subsumed inside Marker class. so essentially its all in these 2 nested classes.
I am most willing to understand solutions to my problem here. please help if there are other simplifications that i can add here ; but just thought of sharing what has been going on in my mind when i am doing this. Some where though i tend to think if i merge all the nested classes to main, its going to be one big class (akin to GOD class anti pattern) which is where i am at cross roads unable to decide.
Yes i have run mvn site target and taken care of pmd, checkstyle, findbug and jacoco code coverage
I have also
a) removed toMap() method,
b) all the private javadoc @link parameters
c) removed all caps comments,
d) removed second word upper case starting etc in javadoc.
e) turned almost all noninterface methods to private
f) turned all nested classes to private static
Gilles,Phill: Please let me know if i could contact you over messenger such as yahoo! etc..

The last patch looks good to me, modulo a few typos in the javadoc that I can fix in the commit and converting the random data generation to use a fixed seed (we like to do that to avoid random test failures).

I will wait a day or two to see if there is other feedback; otherwise will commit. Thanks for the patch!

Phil Steitz
added a comment - 02/Jun/14 04:11 The last patch looks good to me, modulo a few typos in the javadoc that I can fix in the commit and converting the random data generation to use a fixed seed (we like to do that to avoid random test failures).
I will wait a day or two to see if there is other feedback; otherwise will commit. Thanks for the patch!

Venkatesha Murthy TS
added a comment - 20/Jun/14 13:38 Hi
Just checking if anything else i need to do here. if things look ok;Please let know as to when i could check back on the status of commit on this
Thanks
Venkat.

A small change however and i have attached the patch against today.
a) the public double quantile() method is a simple getter method which just returns the set quantile during construction. So its not the result. Basically its the p/100 value which is within [0.0-1.0]

b) Next in the test method testDistribution() of PSquarePercentileTest.java the fixed seed random generator need to be used even in here as well. '

Venkatesha Murthy TS
added a comment - 23/Jun/14 18:19 Hi Phil,
Thanks for committing.
A small change however and i have attached the patch against today.
a) the public double quantile() method is a simple getter method which just returns the set quantile during construction. So its not the result. Basically its the p/100 value which is within [0.0-1.0]
b) Next in the test method testDistribution() of PSquarePercentileTest.java the fixed seed random generator need to be used even in here as well. '
Regards
Venkatesh

Some how it seemed initially as though getResult value being stored in quantile and is being returned.

Well, now after i re-read the javadoc it kind of conveys its just a getter. I was just trying to make it bit more explicit by saying that @return will return (or get ) just the desired input quantile set during construction.
I guess we can live without that explicit comment.

Venkatesha Murthy TS
added a comment - 29/Jun/14 19:14 Hi Phil,
Thanks for taking the patch.
Some how it seemed initially as though getResult value being stored in quantile and is being returned.
Well, now after i re-read the javadoc it kind of conveys its just a getter. I was just trying to make it bit more explicit by saying that @return will return (or get ) just the desired input quantile set during construction.
I guess we can live without that explicit comment.