running average of a rate

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

running average of a rate

Benson Margulies
Please excuse the following ignorant question.

I want to maintain summary statistics of a rate. At each 'event', I
know the number of characters and the time it took to process them,
and I want to maintain summary statistics for the rate of
chars/second. I imagine that I'm missing something basic, but I don't
see how to do this.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: running average of a rate

Ted Dunning
For the overall average you need is the total time and total characters.  Do
the division at presentation time.

If you want a decaying moving average, then a variant on Welford's method is
useful for an on-line estimate.  I just had such a discussion in the hbase
group.  For that you need state, however, and atomic updates.  A useful
approximation might be possible if you can report many moving averages which
can themselves be averaged.

Can you say a bit more about your context?  Is this in hadoop?  Did you want
to maintain the average across the cluster?  Did you want real averages or a
moving average?  Do you have constant sampling rate?

On Mon, Mar 14, 2011 at 7:33 AM, Benson Margulies <[hidden email]>wrote:

> Please excuse the following ignorant question.
>
> I want to maintain summary statistics of a rate. At each 'event', I
> know the number of characters and the time it took to process them,
> and I want to maintain summary statistics for the rate of
> chars/second. I imagine that I'm missing something basic, but I don't
> see how to do this.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

[math] Re: running average of a rate

Luc Maisonobe
In reply to this post by Benson Margulies
Le 14/03/2011 15:33, Benson Margulies a écrit :
> Please excuse the following ignorant question.
>
> I want to maintain summary statistics of a rate. At each 'event', I
> know the number of characters and the time it took to process them,
> and I want to maintain summary statistics for the rate of
> chars/second. I imagine that I'm missing something basic, but I don't
> see how to do this.

You should define some windows width, either in terms of a time span
(all events in the last n seconds) or in terms of number of events (last
n events).

In [math], we do not provide (yet) anything for maintaining such a data
structure, you'll have to maintain the events in this slot by yourself,
with something similar to a FIFO.

When you have your data available, each time a new event is added or
removed from the ones that belong to the window, you can fetch compute
the statistics you want on this data (min, max, mean, median, standard
deviation ...) and wait for next addition/removel to recompute it again.

Another thing we discussed some months ago (but did not implement yet)
is a way to compute an approximation of percentiles in a flow of data
without storing them. There is an interesting algorithm for it that was
developed for the needs of telecommunication companies, I think it may
be of interest to you. This would provide results like : currently 95%
of the characters are processed in n milliseconds. would you be
interested in us implementing this feature ?

Luc

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Re: running average of a rate

Ted Dunning
On Mon, Mar 14, 2011 at 12:34 PM, Luc Maisonobe <[hidden email]>wrote:

> Le 14/03/2011 15:33, Benson Margulies a écrit :
> > Please excuse the following ignorant question.
> >
> > I want to maintain summary statistics of a rate. At each 'event', I
> > know the number of characters and the time it took to process them,
> > and I want to maintain summary statistics for the rate of
> > chars/second. I imagine that I'm missing something basic, but I don't
> > see how to do this.
>
> You should define some windows width, either in terms of a time span
> (all events in the last n seconds) or in terms of number of events (last
> n events).
>
> In [math], we do not provide (yet) anything for maintaining such a data
> structure, you'll have to maintain the events in this slot by yourself,
> with something similar to a FIFO.
>

The problem for Benson's case is that the FIFO is a global structure.
 Computing the global moving
average is a bit trickier.


> Another thing we discussed some months ago (but did not implement yet)
> is a way to compute an approximation of percentiles in a flow of data
> without storing them. There is an interesting algorithm for it that was
> developed for the needs of telecommunication companies, I think it may
> be of interest to you. This would provide results like : currently 95%
> of the characters are processed in n milliseconds. would you be
> interested in us implementing this feature ?
>

I would recommend stealing it instead from the OnlineSummarizer in Mahout
which computes rank
statistics as well as average and standard deviation in a completely on-line
way.  The only rank statistics
computed with now are the quartiles, but this can easily be changed to
compute 1, 5, 95 and 99%-iles.  This
is done in a completely on-line way (O(1) memory and time per sample) and is
roughly as accurate for
computing quantiles of a distribution as sorting all the samples and
observing the empirical quantiles.

The current version simplifies the published algorithm slightly.  As a
result, some extremely skewed distributions
can have quartiles that are in error.  For instance, if I remember
accurately, the unit gamma distribution with
shape = 0.001 has a minimum of 0 and a median of 10^-50 while the Mahout
code only gives an estimate of the
median of 10^-20.  Relative to the distance between the min to the median,
this is an unfortunately large error.
Relative to the distance from the mean to the min or relative to the
inter-quartile distance, this error is trivial.

I know of no practical use cases for such an extreme distributions except as
very conservative scale preserving
priors in some Bayesian computations.  Even in these cases, estimation of
the quantiles is not useful since only
the posterior is of interest and it invariably is less extreme if any data
is considered.
Reply | Threaded
Open this post in threaded view
|

Re: [math] Re: running average of a rate

Phil Steitz
In reply to this post by Luc Maisonobe
On 3/14/11 12:34 PM, Luc Maisonobe wrote:

> Le 14/03/2011 15:33, Benson Margulies a écrit :
>> Please excuse the following ignorant question.
>>
>> I want to maintain summary statistics of a rate. At each 'event', I
>> know the number of characters and the time it took to process them,
>> and I want to maintain summary statistics for the rate of
>> chars/second. I imagine that I'm missing something basic, but I don't
>> see how to do this.
> You should define some windows width, either in terms of a time span
> (all events in the last n seconds) or in terms of number of events (last
> n events).
>
> In [math], we do not provide (yet) anything for maintaining such a data
> structure, you'll have to maintain the events in this slot by yourself,
> with something similar to a FIFO.

I am not sure I understand what the problem is exactly, but if what
you need is simply "rolling" statistics, where a dataset of 0,...,n
values are maintained with the newest values replacing the oldest,
we do in fact support that in
o.a.c.math.descriptive.DescriptiveStatistics.

Phil

> When you have your data available, each time a new event is added or
> removed from the ones that belong to the window, you can fetch compute
> the statistics you want on this data (min, max, mean, median, standard
> deviation ...) and wait for next addition/removel to recompute it again.
>
> Another thing we discussed some months ago (but did not implement yet)
> is a way to compute an approximation of percentiles in a flow of data
> without storing them. There is an interesting algorithm for it that was
> developed for the needs of telecommunication companies, I think it may
> be of interest to you. This would provide results like : currently 95%
> of the characters are processed in n milliseconds. would you be
> interested in us implementing this feature ?
>
> Luc
>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Re: running average of a rate

Ted Dunning
Phil,

A big part of the problem as I understand it is that these rates are needed
across a potentially quite large distributed system.

On Mon, Mar 14, 2011 at 1:42 PM, Phil Steitz <[hidden email]> wrote:

> On 3/14/11 12:34 PM, Luc Maisonobe wrote:
> > Le 14/03/2011 15:33, Benson Margulies a écrit :
> >> Please excuse the following ignorant question.
> >>
> >> I want to maintain summary statistics of a rate. At each 'event', I
> >> know the number of characters and the time it took to process them,
> >> and I want to maintain summary statistics for the rate of
> >> chars/second. I imagine that I'm missing something basic, but I don't
> >> see how to do this.
> > You should define some windows width, either in terms of a time span
> > (all events in the last n seconds) or in terms of number of events (last
> > n events).
> >
> > In [math], we do not provide (yet) anything for maintaining such a data
> > structure, you'll have to maintain the events in this slot by yourself,
> > with something similar to a FIFO.
>
> I am not sure I understand what the problem is exactly, but if what
> you need is simply "rolling" statistics, where a dataset of 0,...,n
> values are maintained with the newest values replacing the oldest,
> we do in fact support that in
> o.a.c.math.descriptive.DescriptiveStatistics.
>
> Phil
> > When you have your data available, each time a new event is added or
> > removed from the ones that belong to the window, you can fetch compute
> > the statistics you want on this data (min, max, mean, median, standard
> > deviation ...) and wait for next addition/removel to recompute it again.
> >
> > Another thing we discussed some months ago (but did not implement yet)
> > is a way to compute an approximation of percentiles in a flow of data
> > without storing them. There is an interesting algorithm for it that was
> > developed for the needs of telecommunication companies, I think it may
> > be of interest to you. This would provide results like : currently 95%
> > of the characters are processed in n milliseconds. would you be
> > interested in us implementing this feature ?
> >
> > Luc
> >
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [math] Re: running average of a rate

sebb-2-2
In reply to this post by Phil Steitz
On 14 March 2011 20:42, Phil Steitz <[hidden email]> wrote:

> On 3/14/11 12:34 PM, Luc Maisonobe wrote:
>> Le 14/03/2011 15:33, Benson Margulies a écrit :
>>> Please excuse the following ignorant question.
>>>
>>> I want to maintain summary statistics of a rate. At each 'event', I
>>> know the number of characters and the time it took to process them,
>>> and I want to maintain summary statistics for the rate of
>>> chars/second. I imagine that I'm missing something basic, but I don't
>>> see how to do this.
>> You should define some windows width, either in terms of a time span
>> (all events in the last n seconds) or in terms of number of events (last
>> n events).
>>
>> In [math], we do not provide (yet) anything for maintaining such a data
>> structure, you'll have to maintain the events in this slot by yourself,
>> with something similar to a FIFO.
>
> I am not sure I understand what the problem is exactly, but if what
> you need is simply "rolling" statistics, where a dataset of 0,...,n
> values are maintained with the newest values replacing the oldest,
> we do in fact support that in
> o.a.c.math.descriptive.DescriptiveStatistics.

In JMeter we needed to display long running percentiles without using
excess memory, and someone came up with the idea of using buckets for
ranges of values. So instead of keeping details on each sample elapsed
time, we increment the count for the appropriate bucket.

If the range of values is too large to use a single bucket for each
value, each bucket can represent a range of values.
These ranges can potentially be non-uniform though that does
complicate the calculations.

JMeter actually uses a TreeMap for the values and counts - the values
need to be sorted in order to calculate percentiles.

Depending on the data-set, it might be possible to used fixed arrays
instead of the TreeMap.

> Phil
>> When you have your data available, each time a new event is added or
>> removed from the ones that belong to the window, you can fetch compute
>> the statistics you want on this data (min, max, mean, median, standard
>> deviation ...) and wait for next addition/removel to recompute it again.
>>
>> Another thing we discussed some months ago (but did not implement yet)
>> is a way to compute an approximation of percentiles in a flow of data
>> without storing them. There is an interesting algorithm for it that was
>> developed for the needs of telecommunication companies, I think it may
>> be of interest to you. This would provide results like : currently 95%
>> of the characters are processed in n milliseconds. would you be
>> interested in us implementing this feature ?
>>
>> Luc

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Re: running average of a rate

Ted Dunning
This approach is fine for relatively well-behaved distributions.  Anything
more skewed than, say, an exponential or as long tailed as a t(3)
distribution is likely to have troubles with this approach.

See
http://search-lucene.com/jd/mahout/math/org/apache/mahout/math/stats/OnlineSummarizer.htmlfor
the alternative I have been suggesting.  It can keep accurate
estimates
of any quantile that you like.

On Mon, Mar 14, 2011 at 5:17 PM, sebb <[hidden email]> wrote:

>
> In JMeter we needed to display long running percentiles without using
> excess memory, and someone came up with the idea of using buckets for
> ranges of values. So instead of keeping details on each sample elapsed
> time, we increment the count for the appropriate bucket.
>
> If the range of values is too large to use a single bucket for each
> value, each bucket can represent a range of values.
> These ranges can potentially be non-uniform though that does
> complicate the calculations.
>
> JMeter actually uses a TreeMap for the values and counts - the values
> need to be sorted in order to calculate percentiles.
>
> Depending on the data-set, it might be possible to used fixed arrays
> instead of the TreeMap.