Summary:
This paper describes the design and implementation of a group of classes representing Statistical Distributions. The classes are intended to fill a void in the existing ITK Statistics Framework.
Open Science:
The paper fully adheres to the principles and practices of Open Science. It provides a clear description of the problem, it provides the source code of the implementation, as well as tests that verify the correct behavior of the implementation. The material is suficient for any reader to replicate with minimal effort the work described by the author.
Reproducibility:
This reviewer downloaded the source code, made a code review of it, and build the code and its tests. The code was built without any problem under cygwing with gcc 3.4. No modifications were needed in the code or the CMake configuration files. All the test passed without any problem.
A plot of the distributions would have been a nice addendum to the tests, but not totally necessary given that the distributions were already sampled in multiple points during the testing phase. The test verify the precision of the computation, which provides a lot of confidence on the correct implementation of the computation.
Use of Open Source Software:
The author based his implementation on the existing Statistical Framework of the Insight Toolkit. The source code of his contribution is also available in open source and it is intended to be included in ITK.
The final computation of some distributions is using the netlib library. The author did a very nice job on analyzing the issues that this library have with multi-threading, due to the fact that the library was translated from Fortran into C. This section of the paper should be highlighted, and in fact, given its relevance for the rest of the Insight Toolkit, it could probably be a paper in itself. It was particularly interesting the discussion about the computation of mathematicl precision on every platform.
Open Source Contributions:
The author provides the source code of his contribution. It already follows most of the coding style of the Insight Toolkit.
The code is usable without any changes. Although the reviewer has some reservations on the naming of methods (see below in Coding Style).
The paper presents a clear and didactic description of how the classes are intended to be used.
This reviewer was able to use the code in a matter of minutes (almost seconds).
Code Quality:
Coding Style:
The Methods have acronyms. This contradicts the standards coding guidelines.
Methods such as EvaluatePDF() should be named EvaluateProbabilityDensityFunction().
The code seems to be portable although this reviewer only tried it on Cygwin with gcc 3.4, on a single processor machine.
Applicability to other problems:
This work is quite important since it fills a fundamental void in the Statistical Framework of the Insight Toolkit. Given than these classes provide basic functionalities for computing statistical distributions, they will certainly be used all accross the toolkit, and in many medical image applications.
Suggestions for future work:
The Distributions provided by the author are intended for single value distributions. However, the rest of the Statistical framework in ITK is based on the concept of multi-variate statistics. It seems to be convenient to create variationos of these distributions that are applicable to single-value and to multi-valued situations. This is already proposed by the author in the Doxygen documentation of the classes.
Requests for additional information from authors:
[Did you find that information was missing from the paper? Maybe parameters for running the tests? Maybe some images were missing? Would you like to get more details on how the diagrams, or plots were generated?]
Additional Comments:
A) The following statement in the Doxygen documentation of the ProbabilityDistribution is unclear:
" ProbabilityDistribution also defines an abstract interface for setting parameters of distribution (mean/variance for a Gaussian, degrees of freedom for Student-t, etc.). "
It leads to believe that the ProbabilityDistribution is offering the union of the methods available in the subclasses. Which is not the case.
B) It is questionable whether the methods in the base class : ProbabilityDistribution should return ad-hoc default values, or instead they should be abstract. For example:
* HasMean
* HasVariance
* GetMean
* GetVariance
C) Operator assignement and Copy constructor declarations are missing from all the ProbabilityDistribution class. They should be declared private, and should not be implemented. This is part of the API of any class using SmartPointers.
D) It is questionable whether the parametric distributions should have the parameters to be member variables. The approach of requiring to pass the parameters along with the X value to be computed do not allow to use a generic ProbabilityDensity function and use polymorphism. In the current implementation, this would only be possible by using the "cannonical" parameters. Since polymorphism is not available in this current implementation, there is no use either for making these classes derive from the itk::Object and to use SmartPointers. Without polymorphism there is not much advantage in using this classes via pointers, instead they should simply be created as Functors that can be fully inlined at compilation time.
E) The naming of the functions "PDF()" and "EvaluatePDF()" appears to be redundant and confusing. A better naming could be to replace PDF() with
EvaluateCannonicalProbabilityDensityFunction()
or
EvaluateNormalizedProbabilityDensithyFunction()
while EvaluatePDF() should be named:
EvaluateProbabilityDensityFunction()
In any case it seems that this class is providing services with two different paradigms. In one, it is being used as a mathematical calculator, when the static methods are used. On the other, it is used as a modular component that would be plugged into a framework. It seems that those two roles should be provided by two independent classes, and that the second role could actually delegate the computation to the first role.
F) The constructor of the TDistribution declare a number of constants. They could take advantage of the "const" keyword in order to clarify their intent.
Comment by Jim Miller: Polymorphism

I think PDF, CDF are common enough to warrent abbreviation. We have a few accepted abbreviations in ITK already.
HasMean()/GetMean()/etc. could be made abstract. I considered having GetMean() default to return quiet_NaN().
For the rest of the discussion, let's outline some goals of the Distribution library:
1. Allow an algorithm developer to interact with a distribution in its natural parameterization. That is, if I know my distribution is parameterized by two parameters (mean, variance), I should be able to use an API that has mean/variance specified. This could be via SetMean()/SetVariance() methods or by a Evaluate*(x, mean, variance) method. If my distribution is parameterized by degrees of freedom and not by mean and variance, then I should be able use an API that references degrees of freedom, Evaluate*(x, degreesOfFreedom).
2. Allow distributions to be evaluated without an instance. This is critical for supporting algorithms that need to evaluate distributions within inner loops (where each evaluation may have different parameterizations). The PDF() methods allow the instance to be evaluated without an instance, while the EvaluatePDF() methods are meant to be evaluated from an instance. These names were chosen (after much debate) because the former mimics other statistical packages and the latter mimics ITK.
3. Allow distributions to be evaluated with an instance but without having to specify the parameters on each evaluation. This avoids the overhead of passing parameters (even by reference) when the distribution parameters are know a priori. It also allows the parameters for a distribution to be preconfigured, so an algorithm that needs to evaluate a pdf may know nothing about the distributions parameterization.
4. The simplicity of univariate distributions should not be compromised in order to support multivariate distributions.
Where possible, the Distribution library attempts to provide an API familar to users of other statistical packages.
To make the distributions truly polymorphic, the distributions could have SetParameters(p)/GetParameters() methods that take/return a vector for doubles. (I really do not want to allow the users to prescribe floats inplace of doubles). The parameterizations of the distributions could be stored in a parameter vector. Methods like TDistribution::GetDegreesOfFreedom() would know what index into that parameter vector to access. Since we are doing this earlier in the design than the Transform hierachy, separate ivars could be avoided and the parameter vector could be the first class representation or the parameters.
The API could be become:
EvaluatePDF(x) - evaluate the pdf at the cached parameter settings
EvaluatePDF(x, p) - evaluate the pdf using the parameters p
and distributions could be free to provide methods like
EvaluatePDF(x, mean, variance) - for distributions parameterized by a mean and variance
operator=() and copy constructors with missing implementations was indeed an oversight.