[Statistics] Convention when outside support?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Statistics] Convention when outside support?

Gilles Sadowski-2
Hello.

For all implemented distributions, what convention should be adopted
when methods
 * density(x)
 * logDensity(x)
 * cumulativeProbability(x)
are called with "x" out of the "support" bounds?

Currently some (but not all[1]) are documented to return "NaN".
An alternative could be to throw an exception.

Regards,
Gilles

[1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Statistics] Convention when outside support?

Fran Lattanzio
Hi,

I was involved in a similar debate on a different project, and we came to the conclusion that (double -> double) methods in Java should return NaN in the case of invalid arguments, rather than throw Exceptions.

Our reasoning was by analogy with how IEEE 754 floating-point exceptions are handled by Java. Obviously, the definition of a floating-point exception is quite different from a Java exception. But anyway, our question was, how should raising an exception in the floating-point world map to throwing an exception in Java? The core Java libraries effectively behave as if all floating-point traps are disabled*: Overflow results in an infinity, underflow in subnormal/zero, square root of negative returns NaN, etc.

Based on this, we decided that returning NaN is the “best” behavior, since this is what IEEE spec says to do when in the invalid operation flag is disabled.

Fran.

* = We did discuss having a kind of floating-point signal policy that would change the behavior from returning a default value to throwing a (Java) exception when these floating-point exceptions were detected. But this would be a complex implementation problem, not least because incorporating this into existing numerical libraries would be difficult to impossible.



> On Nov 29, 2019, at 11:48 AM, Gilles Sadowski <[hidden email]> wrote:
>
> Hello.
>
> For all implemented distributions, what convention should be adopted
> when methods
> * density(x)
> * logDensity(x)
> * cumulativeProbability(x)
> are called with "x" out of the "support" bounds?
>
> Currently some (but not all[1]) are documented to return "NaN".
> An alternative could be to throw an exception.
>
> Regards,
> Gilles
>
> [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Statistics] Convention when outside support?

Alex Herbert
In reply to this post by Gilles Sadowski-2
On 29/11/2019 16:48, Gilles Sadowski wrote:

> Hello.
>
> For all implemented distributions, what convention should be adopted
> when methods
>   * density(x)
>   * logDensity(x)
>   * cumulativeProbability(x)
> are called with "x" out of the "support" bounds?
>
> Currently some (but not all[1]) are documented to return "NaN".
> An alternative could be to throw an exception.

The convention in the java.lang.Math class is to return NaN for things
that do not make sense, e.g.

Math.log(-1)
Math.asin(4)

This leaves it as the responsibility of the caller to know when it may
be possible to pass in a bad value and so check the results.

It unfortunately leaves open the issue that not everyone will do that
and so their program can be brought to a stop by presence of NaN values
that may have appeared some way further back in the computation.

Throwing an exception seems to be the only way to preserve the stack
trace of where the computation went wrong.

So either case has merit.

What do other languages do? A few seem to return 0 for out of support.

I had a look at Python. Here there is not much consistency using scipy:

 >>> import math
 >>> from scipy.stats import gamma
 >>> gamma.pdf(0.5, 1.99)
0.3066586069413397
 >>> gamma.pdf(-0.5, 1.99)
0.0
 >>> gamma.logpdf(-0.5, 1.99)
-inf
 >>> math.log(0)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
ValueError: math domain error

So scipy returns 0 for the density function when outside support. It
returns -inf for the log of zero but python's math function returns an
exception for the log of zero.

In R the behaviour is the same as python with the exception that the log
of zero is -Inf.

 > dgamma(0, 2)
[1] 0
 > dgamma(-1, 2)
[1] 0
 > dgamma(-1, 2, log=TRUE)
[1] -Inf
 > log(0)
[1] -Inf

So returning 0 is another option. However this cannot distinguish a
valid return of 0 from an error.

Note that if we did not have double as a return value then throwing an
exception would be the primary choice for signalling error as there is
no NaN for other numbers. However there are documented cases for
computations in the JDK which do not make sense that avoid throwing
exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns
a negative.

I'm not a fan of static properties to configure the behaviour either
way. I don't think using zero is a good idea as it cannot signal
something is wrong.

I would favour one of the following:

- Provide alternative methods to return NaN or throw
- Always return NaN (which seems more Java conventional) and provide a
wrapper distribution that can wrap calls to density, logDensity and
cumulativeProbability and throw an exception if the underlying
distribution returns NaN.
- Always throw (which forces users to safe usage) and provide a wrapper
distribution that can wrap calls to density, logDensity and
cumulativeProbability and return NaN or zero if the underlying
distribution throws.

When considering the situation where you can create a distribution with
a bad value and you get an exception, but you can use a distribution
with a bad value and you get NaN it seems to me that throwing an
exception may be the more sensible approach. A wrapper to guard
exceptions can be user configurable to return NaN or zero.

Alex
> Regards,
> Gilles
>
> [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Statistics] Convention when outside support?

Gilles Sadowski-2
Hi.

Le ven. 29 nov. 2019 à 18:41, Alex Herbert <[hidden email]> a écrit :

>
> On 29/11/2019 16:48, Gilles Sadowski wrote:
> > Hello.
> >
> > For all implemented distributions, what convention should be adopted
> > when methods
> >   * density(x)
> >   * logDensity(x)
> >   * cumulativeProbability(x)
> > are called with "x" out of the "support" bounds?
> >
> > Currently some (but not all[1]) are documented to return "NaN".
> > An alternative could be to throw an exception.
>
> The convention in the java.lang.Math class is to return NaN for things
> that do not make sense, e.g.
>
> Math.log(-1)
> Math.asin(4)

But are we in the same kind of (wrong) usage when considering
the argument to the above methods?
I mean: If we ask the question of "What is the density at x?", is
it really an error to reply "0" when outside the domain?

> This leaves it as the responsibility of the caller to know when it may
> be possible to pass in a bad value and so check the results.
>
> It unfortunately leaves open the issue that not everyone will do that
> and so their program can be brought to a stop by presence of NaN values
> that may have appeared some way further back in the computation.
>
> Throwing an exception seems to be the only way to preserve the stack
> trace of where the computation went wrong.
>
> So either case has merit.
>
> What do other languages do? A few seem to return 0 for out of support.
>
> I had a look at Python. Here there is not much consistency using scipy:
>
>  >>> import math
>  >>> from scipy.stats import gamma
>  >>> gamma.pdf(0.5, 1.99)
> 0.3066586069413397
>  >>> gamma.pdf(-0.5, 1.99)
> 0.0
>  >>> gamma.logpdf(-0.5, 1.99)
> -inf
>  >>> math.log(0)
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> ValueError: math domain error
>
> So scipy returns 0 for the density function when outside support. It
> returns -inf for the log of zero but python's math function returns an
> exception for the log of zero.
>
> In R the behaviour is the same as python with the exception that the log
> of zero is -Inf.
>
>  > dgamma(0, 2)
> [1] 0
>  > dgamma(-1, 2)
> [1] 0
>  > dgamma(-1, 2, log=TRUE)
> [1] -Inf
>  > log(0)
> [1] -Inf
>
> So returning 0 is another option. However this cannot distinguish a
> valid return of 0 from an error.
>
> Note that if we did not have double as a return value then throwing an
> exception would be the primary choice for signalling error as there is
> no NaN for other numbers. However there are documented cases for
> computations in the JDK which do not make sense that avoid throwing
> exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns
> a negative.
>
> I'm not a fan of static properties to configure the behaviour either
> way. I don't think using zero is a good idea as it cannot signal
> something is wrong.
>
> I would favour one of the following:
>
> - Provide alternative methods to return NaN or throw
> - Always return NaN (which seems more Java conventional) and provide a
> wrapper distribution that can wrap calls to density, logDensity and
> cumulativeProbability and throw an exception if the underlying
> distribution returns NaN.
> - Always throw (which forces users to safe usage) and provide a wrapper
> distribution that can wrap calls to density, logDensity and
> cumulativeProbability and return NaN or zero if the underlying
> distribution throws.
>
> When considering the situation where you can create a distribution with
> a bad value and you get an exception, but you can use a distribution
> with a bad value and you get NaN it seems to me that throwing an
> exception may be the more sensible approach. A wrapper to guard
> exceptions can be user configurable to return NaN or zero.

Instantiating and raising an exception is (relatively) costly.
So if the "return NaN" feature is used in a use-case where performance
matters, the wrapper would spoil the intended purpose.

Gilles

>
> Alex
> > Regards,
> > Gilles
> >
> > [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Statistics] Convention when outside support?

Alex Herbert


> On 29 Nov 2019, at 18:24, Gilles Sadowski <[hidden email]> wrote:
>
> Hi.
>
> Le ven. 29 nov. 2019 à 18:41, Alex Herbert <[hidden email] <mailto:[hidden email]>> a écrit :
>>
>> On 29/11/2019 16:48, Gilles Sadowski wrote:
>>> Hello.
>>>
>>> For all implemented distributions, what convention should be adopted
>>> when methods
>>>  * density(x)
>>>  * logDensity(x)
>>>  * cumulativeProbability(x)
>>> are called with "x" out of the "support" bounds?
>>>
>>> Currently some (but not all[1]) are documented to return "NaN".
>>> An alternative could be to throw an exception.
>>
>> The convention in the java.lang.Math class is to return NaN for things
>> that do not make sense, e.g.
>>
>> Math.log(-1)
>> Math.asin(4)
>
> But are we in the same kind of (wrong) usage when considering
> the argument to the above methods?
> I mean: If we ask the question of "What is the density at x?", is
> it really an error to reply "0" when outside the domain?

In the case of probabilities then returning 0 does not seem wrong.

It would be akin to the use of the instanceof operator where you wish to do something based on whether the object is of the correct type. Here you wish to have a probability for a value. It the value is not correct then it has no probability, you return zero and the caller can do any computation they want based on it having no probability.

As I mentioned popular R and Python implementations return zero for out of domain cases. So this behaviour would not be unprecendented.

I previously checked the gamma distribution. The same is true for others I’ve just checked, e.g. a Binomial in R:

> dbinom(-1, size=12, prob=0.2)
[1] 0
> dbinom(44, size=12, prob=0.2)
[1] 0

Or scipy:

>>> from scipy.stats import binom
>>> n, p = 12, 0.2
>>> binom.pmf(-1, n, p)
0.0
>>> binom.pmf(44, n, p)
0.0

>
>> This leaves it as the responsibility of the caller to know when it may
>> be possible to pass in a bad value and so check the results.
>>
>> It unfortunately leaves open the issue that not everyone will do that
>> and so their program can be brought to a stop by presence of NaN values
>> that may have appeared some way further back in the computation.
>>
>> Throwing an exception seems to be the only way to preserve the stack
>> trace of where the computation went wrong.
>>
>> So either case has merit.
>>
>> What do other languages do? A few seem to return 0 for out of support.
>>
>> I had a look at Python. Here there is not much consistency using scipy:
>>
>>>>> import math
>>>>> from scipy.stats import gamma
>>>>> gamma.pdf(0.5, 1.99)
>> 0.3066586069413397
>>>>> gamma.pdf(-0.5, 1.99)
>> 0.0
>>>>> gamma.logpdf(-0.5, 1.99)
>> -inf
>>>>> math.log(0)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> ValueError: math domain error
>>
>> So scipy returns 0 for the density function when outside support. It
>> returns -inf for the log of zero but python's math function returns an
>> exception for the log of zero.
>>
>> In R the behaviour is the same as python with the exception that the log
>> of zero is -Inf.
>>
>>> dgamma(0, 2)
>> [1] 0
>>> dgamma(-1, 2)
>> [1] 0
>>> dgamma(-1, 2, log=TRUE)
>> [1] -Inf
>>> log(0)
>> [1] -Inf
>>
>> So returning 0 is another option. However this cannot distinguish a
>> valid return of 0 from an error.
>>
>> Note that if we did not have double as a return value then throwing an
>> exception would be the primary choice for signalling error as there is
>> no NaN for other numbers. However there are documented cases for
>> computations in the JDK which do not make sense that avoid throwing
>> exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns
>> a negative.
>>
>> I'm not a fan of static properties to configure the behaviour either
>> way. I don't think using zero is a good idea as it cannot signal
>> something is wrong.
>>
>> I would favour one of the following:
>>
>> - Provide alternative methods to return NaN or throw
>> - Always return NaN (which seems more Java conventional) and provide a
>> wrapper distribution that can wrap calls to density, logDensity and
>> cumulativeProbability and throw an exception if the underlying
>> distribution returns NaN.
>> - Always throw (which forces users to safe usage) and provide a wrapper
>> distribution that can wrap calls to density, logDensity and
>> cumulativeProbability and return NaN or zero if the underlying
>> distribution throws.
>>
>> When considering the situation where you can create a distribution with
>> a bad value and you get an exception, but you can use a distribution
>> with a bad value and you get NaN it seems to me that throwing an
>> exception may be the more sensible approach. A wrapper to guard
>> exceptions can be user configurable to return NaN or zero.
>
> Instantiating and raising an exception is (relatively) costly.
> So if the "return NaN" feature is used in a use-case where performance
> matters, the wrapper would spoil the intended purpose.

Yes. On more reflection it would be the default to return a standard answer for invalid and provide a wrapper to throw if the argument is out of bounds. Providing a wrapper at least acknowledges that this is something people should consider when using the distribution classes. Do they want a zero for out-of-domain or do they want an exception.

>
> Gilles
>
>>
>> Alex
>>> Regards,
>>> Gilles
>>>
>>> [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503 <https://issues.apache.org/jira/projects/MATH/issues/MATH-1503>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] <mailto:[hidden email]>
> For additional commands, e-mail: [hidden email] <mailto:[hidden email]>