- 1 Introduction
- 2 Descriptive statistics
- 3 Kernel estimation of probability density functions
- 3.1 Kernel estimation of the density function
- 3.2 Kernel estimation of the cumulative probability density
- 3.3 Choosing the bandwidth
- 3.4 Usage
- 3.4.1 Estimating the probability density function
- 3.4.2 Estimating the cumulative probability density function
- 3.4.3 Estimating the complement of the cumulative probability density function
- 3.4.4 Calculating the optimal bandwidth
- 3.5 Examples
- 4 Installation instructions

Statistics for Python was released under the Python License.

Michiel de Hoon

Center for Computational Biology and Bioinformatics, Columbia University.

Peter M. Neely: “Comparison of several algorithms for computation of means, standard deviations and correlation coefficients.” Communications of the ACM 9(7): 496-499 (1966).

The arithmetic mean is defined as $$\stackrel{\u203e}{x}=\frac{1}{n}\sum _{i=1}^{n}{x}_{i}$$ The statistical median of the data ${x}_{i}$ is defined as $$\stackrel{\sim}{x}=\begin{cases}{x\text{'}}_{\left(n+1\right)/2} & \text{if $n\text{\hspace{0.17em} is odd}$}\\ \frac{1}{2}\left({x\text{'}}_{n/2}+{x\text{'}}_{1+n/2}\right) & \text{if $n\text{\hspace{0.17em} is even}$}\end{cases}$$where we find the array $x\text{'}$ by sorting the array $x.$ The variance in a random variable $x$ is defined as $${\sigma}_{x}^{2}=E\left[{\left(x-E\left(x\right)\right)}^{2}\right]$$Given $n$ data of $x,$ the unbiased estimate of the variance is $${\widehat{\sigma}}_{x}^{2}=\frac{1}{n-1}\left[\left(\sum _{i=1}^{n}{x}_{i}^{2}\right)-\frac{1}{n}{\left(\sum _{i=1}^{n}{x}_{i}\right)}^{2}\right].$$For the maximum-likelihood estimate of the variance, which is a biased estimate, we divide by $n$ instead of $n-1.$ The variance in this library is implemented using the algorithm proposed by Welford (1962), which avoids the round-off errors associated with a direct calculation of the variance.

- Pearson correlation
- Spearman rank correlation
- Intraclass correlation

`mean`

returns the arithmetic mean of an array of data. `>>> mean(x)`

`x`

A one-dimensional array containing the data for which to calculate the mean.

- The arithmetic mean of the data
`x`

.

`median`

returns the median of an array of data. `>>> median(x)`

`x`

A one-dimensional array containing the data for which to calculate the median.

- The median of the data
`x`

.

`variance`

calculates the variance of a one-dimensional array of data. `>>> variance(x, mode = "Unbiased")`

`x`

A one-dimensional array containing the data for which to calculate the variance;
`mode`

For

`mode`

equal to `Unbiased`

(which is the default value), the function `variance`

returns the unbiased estimate of the variance. For `mode`

equal to `ML`

, the function returns the maximum-likelihood estimate of the variance, which is a biased estimate.
- The variance in
`x`

.

`covariance`

calculates the covariance matrix of an array of data. `>>> covariance(x, y = None, mode = "Unbiased")`

`x`

Either a one-dimensional array or a two-dimensional array containing the data for which to calculate the covariance.
- If
`x`

is a one-dimensional array and`y==None`

, then this function returns the variance of`x`

; - If both
`x`

and`y`

are one-dimensional arrays with the same length,`covariance`

returns the covariance between`x`

and`y`

; - If
`x`

is a two-dimensional array, then`covariance`

returns the covariance matrix of`x`

;`y`

is ignored. `y`

A one-dimensional array of the same length as `mode`

For

`x`

, or `None`

;
`mode`

equal to `Unbiased`

(which is the default value), the function `covariance`

returns the unbiased estimate of the covariance. For `mode`

equal to `ML`

, the function returns the maximum-likelihood estimate of the covariance, which is a biased estimate.
- If
`x`

is one-dimensional and`y==None`

: the variance in`x`

; - If
`x`

and`y`

are both one-dimensional and have the same length: the covariance between`x`

and`y`

; - If
`x`

is two-dimensional: the covariance matrix between the columns of`x`

. Element`[i,j]`

of the covariance matrix contains the covariance between columns`x[:,i]`

and`x[:,j]`

.

`correlation`

calculates the correlation matrix of an array of data. `>>> correlation(x, y = None, method = "Pearson")`

`x`

Either a one-dimensional array or a two-dimensional array containing the data for which to calculate the correlation.
- If
`x`

is a one-dimensional array and`y==None`

, then this function returns`1.0`

; - If both
`x`

and`y`

are one-dimensional arrays with the same length,`correlation`

returns the correlation between`x`

and`y`

; - If
`x`

is a two-dimensional array, then`correlation`

returns the correlation matrix of`x`

;`y`

is ignored. `y`

A one-dimensional array of the same length as `method`

Determines which type of correlation is calculated:
`"Pearson"`

: The Pearson correlation (default);`"Spearman"`

: The Spearman rank correlation;`"Intraclass"`

: The intraclass correlation.

`x`

, or `None`

;
- If
`x`

is one-dimensional and`y==None`

:`1.0`

; - If
`x`

and`y`

are both one-dimensional and have the same length: the correlation between`x`

and`y`

; - If
`x`

is two-dimensional: the correlation matrix between the columns of`x`

. Element`[i,j]`

of the correlation matrix contains the correlation between columns`x[:,i]`

and`x[:,j]`

.

`regression`

returns the intercept and slope of a linear regression line fit to two arrays `x`

and `y`

. `>>> a, b = regression(x,y)`

`x`

A one-dimensional array of data;
`y`

A one-dimensional array of data.

`x`

and `y`

should be equal.
`a`

The intercept of the linear regression line;
`b`

The slope of the linear regression line.

D. W. Scott: “Multivariate Density Estimation; Theory, Practice, and Visualization”, Chapter 6. John Wiley and Sons, New York, 1992.

Suppose we have a set of observations ${x}_{i}$, and we want to find the probability density function of the distribution from which these data were drawn. In parametric density estimations, we choose some distribution (such as the normal distribution or the extreme value distribution) and estimate the values of the parameters appearing in these functions from the observed data. However, often the functional form of the true density function is not known. In this case, the probability density function can be estimated non-parametrically by using a kernel density estimation.

Mnemonic | Kernel name | Function | Optimal bandwidth $h$ |

`'u'` | Uniform | $k(t)=\begin{cases}\frac{1}{2} & \text{if $\left|t\right|\le 1$}\\ 0 & \text{if $\left|t\right|> 1$}\end{cases}$ | $\sigma {\left(\frac{12\sqrt{\pi}}{n}\right)}^{\frac{1}{5}}$ |

`'t'` | Triangle | $k(t)=\begin{cases}1-\left|t\right| & \text{if $\left|t\right|\le 1$}\\ 0 & \text{if $\left|t\right|> 1$}\end{cases}$ | $\sigma {\left(\frac{64\sqrt{\pi}}{n}\right)}^{\frac{1}{5}}$ |

`'e'` | Epanechnikov | $k(t)=\begin{cases}\frac{3}{4}\left(1-{t}^{2}\right) & \text{if $\left|t\right|\le 1$}\\ 0 & \text{if $\left|t\right|> 1$}\end{cases}$ | $\sigma {\left(\frac{40\sqrt{\pi}}{n}\right)}^{\frac{1}{5}}$ |

`'b'` | Biweight/quartic | $k(t)=\begin{cases}\frac{15}{16}{\left(1-{t}^{2}\right)}^{2} & \text{if $\left|t\right|\le 1$}\\ 0 & \text{if $\left|t\right|> 1$}\end{cases}$ | $\sigma {\left(\frac{280\sqrt{\pi}}{3n}\right)}^{\frac{1}{5}}$ |

`'3'` | Triweight | $k(t)=\begin{cases}\frac{35}{32}{\left(1-{t}^{2}\right)}^{3} & \text{if $\left|t\right|\le 1$}\\ 0 & \text{if $\left|t\right|> 1$}\end{cases}$ | $\sigma {\left(\frac{25200\sqrt{\pi}}{143n}\right)}^{\frac{1}{5}}$ |

`'c'` | Cosine | $k(t)=\begin{cases}\frac{\pi}{4}cos(\frac{\pi}{2}t) & \text{if $\left|t\right|\le 1$}\\ 0 & \text{if $\left|t\right|> 1$}\end{cases}$ | $\sigma {\left(\frac{{\pi}^{\frac{13}{2}}}{6n{\left({\pi}^{2}-8\right)}^{2}}\right)}^{\frac{1}{5}}$ |

`'g'` | Gaussian | $k(t)=\frac{1}{\sqrt{2\pi}}exp\left(-\frac{1}{2}{t}^{2}\right)$ | $\sigma {\left(\frac{4}{3n}\right)}^{\frac{1}{5}}$ |

Estimating the probability density function with the Gaussian kernel is more computationally intensive than with other kernels, as it has infinite support (i.e., $k\left(t\right)$ is nonzero for all $t$).

`pdf`

estimates the probability density function from the observed data. You can either specify the values of `x`

at which you want to estimate the value `y`

of the probability density function explicitly: `>>> y = pdf(data, x, weight = None, h = None, kernel = 'Epanechnikov')`

or you can let the function choose

`x`

for you: `>>> y, x = pdf(data, weight = None, h = None, kernel = 'Epanechnikov', n = 100)`

In the latter case, the returned array

`x`

contains `n`

equidistant data points covering the domain where
$\widehat{f}\left(x\right)$ is nonzero.
`data`

The one-dimensional array data contains the observed data from which the probability density function is estimated;
`weight`

The one-dimensional array weight, if given, contains the weights for the observed data. If `x`

The value(s) at which the probability density function will be estimated (either a single value, or a 1D array of values). If you don't specify `h`

The bandwidth to be used for the estimation. If `kernel`

The kernel function can be specified by its name (case is ignored), or by a one-character mnemonic:
`'E'`

or`'Epanechnikov'`

- Epanechnikov kernel (default)
`'U'`

or`'Uniform'`

- Uniform kernel
`'T'`

or`'Triangle'`

- Triangle kernel
`'G'`

or`'Gaussian'`

- Gaussian kernel
`'B'`

or`'Biweight'`

- Quartic/biweight kernel
`'3'`

or`'Triweight'`

- Triweight kernel
`'C'`

or`'Cosine'`

- Cosine kernel
`n`

The number of points for which the probability density function is to be estimated. This argument is meaningful only if you don't specify

`weight==None`

, then each data point receives an equal weight 1.
`x`

, the function `pdf`

will create `x`

as a 1D array of `n`

values for you and return it together with the estimated probability density function;
`h`

is not specified (and also if the user specifies a zero or negative `h`

), the optimal bandwidth is used (which can be calculated explicitly by the function `bandwidth`

);
`x`

explicitly; passing both `x`

and `n`

raises an error. Default value of `n`

is 100.
- If you specified
`x`

explicitly: The estimated probability density, estimated at at the values in`x`

; - If you did not specify
`x`

explicitly: The estimated probability density, as well as the corresponding values of`x`

.

`cpdf`

estimates the cumulative probability density function from the observed data. You can either specify the values of `x`

at which you want to estimate the value `y`

of the cumulative probability density function explicitly: `>>> y = cpdf(data, x, h = None, kernel = 'Epanechnikov')`

or you can let the function choose

`x`

for you: `>>> y, x = cpdf(data, h = None, kernel = 'Epanechnikov', n = 100)`

In the latter case, the returned array

`x`

contains `n`

equidistant data points covering the domain where
$\widehat{f}\left(x\right)$ is nonzero; the estimated cumulative probability density is constant (either 0 or 1) outside of this domain.
`data`

The one-dimensional array data contains the observed data from which the cumulative probability density function is estimated;
`x`

The value(s) at which the cumulative probability density function will be estimated (either a single value, or a 1D array of values). If you don't specify `h`

The bandwidth to be used for the estimation. If `kernel`

The kernel function can be specified by its name (case is ignored), or by a one-character mnemonic:
`'E'`

or`'Epanechnikov'`

- Epanechnikov kernel (default)
`'U'`

or`'Uniform'`

- Uniform kernel
`'T'`

or`'Triangle'`

- Triangle kernel
`'G'`

or`'Gaussian'`

- Gaussian kernel
`'B'`

or`'Biweight'`

- Quartic/biweight kernel
`'3'`

or`'Triweight'`

- Triweight kernel
`'C'`

or`'Cosine'`

- Cosine kernel
`n`

The number of points for which the cumulative probability density function is to be estimated. This argument is meaningful only if you don't specify

`x`

, the function `cpdf`

will create `x`

as a 1D array of `n`

values for you and return it together with the estimated cumulative probability density function.
`h`

is not specified (and also if the user specifies a zero or negative `h`

), the optimal bandwidth is used (which can be calculated explicitly by the function `bandwidth`

).
`x`

explicitly; passing both `x`

and `n`

raises an error. Default value of `n`

is 100.
- If you specified
`x`

explicitly: The estimated cumulative probability density, estimated at at the values in`x`

; - If you did not specify
`x`

explicitly: The estimated cumulative probability density, as well as the corresponding values of`x`

.

`cpdfc`

estimates the complement of the cumulative probability density function from the observed data. You can either specify the values of `x`

at which you want to estimate the value `y`

of the complement of the cumulative probability density function explicitly: `>>> y = cpdfc(data, x, h = None, kernel = 'Epanechnikov')`

or you can let the function choose

`x`

for you: `>>> y, x = cpdfc(data, h = None, kernel = 'Epanechnikov', n = 100)`

In the latter case, the returned array

`x`

contains `n`

equidistant data points covering the domain where
$\widehat{f}\left(x\right)$ is nonzero; the estimated complement of the cumulative probability density is constant (either 0 or 1) outside of this domain.
`data`

The one-dimensional array data contains the observed data from which the complement of the cumulative probability density function is estimated.
`x`

The value(s) at which the complement of the cumulative probability density function will be estimated (either a single value, or a 1D array of values). If you don't specify `h`

The bandwidth to be used for the estimation. If `kernel`

The kernel function can be specified by its name (case is ignored), or by a one-character mnemonic:
`'E'`

or`'Epanechnikov'`

- Epanechnikov kernel (default)
`'U'`

or`'Uniform'`

- Uniform kernel
`'T'`

or`'Triangle'`

- Triangle kernel
`'G'`

or`'Gaussian'`

- Gaussian kernel
`'B'`

or`'Biweight'`

- Quartic/biweight kernel
`'3'`

or`'Triweight'`

- Triweight kernel
`'C'`

or`'Cosine'`

- Cosine kernel
`n`

The number of points for which the complement of the cumulative probability density function is to be estimated. This argument is meaningful only if you don't specify

`x`

, the function `cpdfc`

will create `x`

as a 1D array of `n`

values for you and return it together with the estimated complement of the cumulative probability density function.
`h`

is not specified (and also if the user specifies a zero or negative `h`

), the optimal bandwidth is used (which can be calculated explicitly by the function `bandwidth`

).
`x`

explicitly; passing both `x`

and `n`

raises an error. Default value of `n`

is 100.
- If you specified
`x`

explicitly: The estimated cumulative probability density, estimated at at the values in`x`

; - If you did not specify
`x`

explicitly: The estimated cumulative probability density, as well as the corresponding values of`x`

.

`bandwidth`

calculates the optimal bandwidth from the observed data for a given kernel: `>>> h = bandwidth(data, weight=None, kernel='Epanechnikov')`

`data`

A one-dimensional array data contains the observed data from which the probability density function will be calculated;
`weight`

The one-dimensional array weight, if given, contains the weights for the observed data. If `kernel`

The kernel function can be specified by its name (case is ignored), or by a one-character mnemonic:
`'E'`

or`'Epanechnikov'`

- Epanechnikov kernel (default)
`'U'`

or`'Uniform'`

- Uniform kernel
`'T'`

or`'Triangle'`

- Triangle kernel
`'G'`

or`'Gaussian'`

- Gaussian kernel
`'B'`

or`'Biweight'`

- Quartic/biweight kernel
`'3'`

or`'Triweight'`

- Triweight kernel
`'C'`

or`'Cosine'`

- Cosine kernel

`weight==None`

, then each data point receives an equal weight 1.
`bandwidth`

returns the optimal bandwidth for the given `data`

, using the specified `kernel`

. This bandwidth can subsequently be used when estimating the (cumulative) probability density with `pdf`

, `cpdf`

, or `cpdfc`

.
`random`

module to draw 100 random numbers from a standard normal distribution. `>>> from numpy.random import standard_normal`

`>>> data = standard_normal(100)`

We estimate the probability density function, using the default Epanechnikov kernel and the default value for the bandwidth:

`>>> import statistics`

`>>> y, x = statistics.pdf(data)`

The estimated probability function

`y`

as a function of `x`

is drawn below (figure created by Pygist). Similarly, we can estimate the cumulative probability density distribution:

`>>> y, x = statistics.pdf(data)`

`random`

module to generate 20000 random numbers from a distribution consisting of two Gaussians, one centered around -3 and one centered around 3, both with a standard deviation equal to unity. `>>> from numpy.random import standard_normal, randint`

`>>> n = 20000`

`>>> data = standard_normal(n) + 3.0*(randint(0,2,n)-0.5)`

Next, we estimate the probability density function, using the Epanechnikov kernel and the default value for the bandwidth:

`>>> import statistics`

`>>> y, x = statistics.pdf(data)`

The choice of kernel function usually has a minor effect on the estimated probability density function:

`>>> y, x = statistics.pdf(data, kernel="Gaussian")`

Now, let's find out the default value for the bandwidth was:

`>>> statistics.bandwidth(data)`

`0.58133427540755089`

Choosing a bandwidth much smaller than the default value results in overfitting:

`>>> y, x = statistics.pdf(data, h = 0.58133427540755089/10)`

Choosing a bandwidth much larger than the default value results in oversmoothing:

`>>> y, x = statistics.pdf(data, h = 0.58133427540755089*4)`

Here, we generate $n=1000$ random numbers $x,$ for increasing values of $m,$ and approximate the distribution of $x$ by a normal distribution, an extreme value distribution, and by the kernel-based estimate.

`>>> import statistics`

`>>> from numpy.random import standard_normal`

`>>> n = 1000`

`>>> m = 1`

`>>> data = array([max(standard_normal(m)) for i in range(n)])`

`>>> y, x = statistics.pdf(data)`

The estimated probability density, together with the analytically determined probability density, a normal distribution, and the extreme value distribution are drawn in the figures below for increasing values of $m.$

For the standard normal distribution, the tail probability of finding a value larger than 1.96 is equal to 0.025. We can now compare the estimated tail probability to the analytically calculated tail probability, and compare them to the tail probability estimated by fitting a normal distribution and an extreme value distribution to the data. For $m=1,$ the distribution of $x$ reduces to a standard normal distribution, and hence the tail probability is equal to 0.025. The kernel estimate of the tail probability is close to this value:

`# using the data generated for m = 1`

`>>> statistics.cpdfc(data, x = 1.96)`

`[ 0.02511014]`

We found 0.025 by fitting a normal distribution, and 0.050 by fitting an extreme value distribution. As $m$ increases, the analytically determined tail probability will become more similar to the value estimated from the extreme value distribution, and less similar to the estimate obtained by fitting a normal distribution, as shown in the table below:

$m$ | Exact (analytic) | Kernel estimate | Normal distribution | Extreme value distribution |

1 | 0.0250 | 0.0251 | 0.0250 | 0.0497 |

3 | 0.0310 | 0.0310 | 0.0250 | 0.0465 |

5 | 0.0329 | 0.0332 | 0.0250 | 0.0411 |

10 | 0.0352 | 0.0355 | 0.0250 | 0.0451 |

25 | 0.0374 | 0.0377 | 0.0250 | 0.0409 |

100 | 0.0398 | 0.0396 | 0.0250 | 0.0514 |

200 | 0.0405 | 0.0407 | 0.0250 | 0.0482 |

1000 | 0.0415 | 0.0417 | 0.0250 | 0.0454 |

10000 | 0.0424 | 0.0427 | 0.0250 | 0.0506 |

`>>> import numpy; print numpy.version.version`

at the Python prompt. To install Statistics for Python, unpack the file:

`gunzip statistics-`

`.tar.gz`

`tar -xvf statistics-`

`.tar`

and change to the directory

`statistics-`

`python setup.py config`

`python setup.py build`

`python setup.py install`

This will configure, compile, and install the library. If the installation was successful, you can remove the directory

`statistics-`