Splitting a Poisson Distribution

We consider a remarkable property of the Poisson distribution that has a connection to the multinomial distribution. We start with the following examples.

Example 1
Suppose that the arrivals of customers in a gift shop at an airport follow a Poisson distribution with a mean of \alpha=5 per 10 minutes. Furthermore, suppose that each arrival can be classified into one of three distinct types – type 1 (no purchase), type 2 (purchase under $20), and type 3 (purchase over $20). Records show that about 25% of the customers are of type 1. The percentages of type 2 and type 3 are 60% and 15%, respectively. What is the probability distribution of the number of customers per hour of each type?

Example 2
Roll a fair die N times where N is random and follows a Poisson distribution with parameter \alpha. For each i=1,2,3,4,5,6, let N_i be the number of times the upside of the die is i. What is the probability distribution of each N_i? What is the joint distribution of N_1,N_2,N_3,N_4,N_5,N_6?

In Example 1, the stream of customers arrive according to a Poisson distribution. It can be shown that the stream of each type of customers also has a Poisson distribution. One way to view this example is that we can split the Poisson distribution into three Poisson distributions.

Example 2 also describes a splitting process, i.e. splitting a Poisson variable into 6 different Poisson variables. We can also view Example 2 as a multinomial distribution where the number of trials is not fixed but is random and follows a Poisson distribution. If the number of rolls of the die is fixed in Example 2 (say 10), then each N_i would be a binomial distribution. Yet, with the number of trials being Poisson, each N_i has a Poisson distribution with mean \displaystyle \frac{\alpha}{6}. In this post, we describe this Poisson splitting process in terms of a “random” multinomial distribution (the view point of Example 2).

________________________________________________________________________

Suppose we have a multinomial experiment with parameters N, r, p_1, \cdots, p_r, where

  • N is the number of multinomial trials,
  • r is the number of distinct possible outcomes in each trial (type 1 through type r),
  • the p_i are the probabilities of the r possible outcomes in each trial.

Suppose that N follows a Poisson distribution with parameter \alpha. For each i=1, \cdots, r, let N_i be the number of occurrences of the i^{th} type of outcomes in the N trials. Then N_1,N_2,\cdots,N_r are mutually independent Poisson random variables with parameters \alpha p_1,\alpha p_2,\cdots,\alpha p_r, respectively.

The variables N_1,N_2,\cdots,N_r have a multinomial distribution and their joint probability function is:

\displaystyle (1) \ \ \ \ P(N_1=n_1,N_2=n_2,\cdots,N_r=n_r)=\frac{N!}{n_1! n_2! \cdots n_r!} \ p_1^{n_1} p_2^{n_2} \cdots p_r^{n_r}

where n_i are nonnegative integers such that N=n_1+n_2+\cdots+n_r.

Since the total number of multinomial trials N is not fixed and is random, (1) is not the end of the story. The following is the joint probability function of N_1,N_2,\cdots,N_r:

\displaystyle \begin{aligned}(2) \ \ \ \ P(N_1=n_1,N_2=n_2,\cdots,N_r=n_r)&=P(N_1=n_1,N_2=n_2,\cdots,N_r=n_r \lvert N=\sum \limits_{k=0}^r n_k) \\&\ \ \ \ \ \times P(N=\sum \limits_{k=0}^r n_k) \\&\text{ } \\&=\frac{(\sum \limits_{k=0}^r n_k)!}{n_1! \ n_2! \ \cdots \ n_r!} \ p_1^{n_1} \ p_2^{n_2} \ \cdots \ p_r^{n_r} \ \times \frac{e^{-\alpha} \alpha^{\sum \limits_{k=0}^r n_k}}{(\sum \limits_{k=0}^r n_k)!} \\&\text{ } \\&=\frac{e^{-\alpha p_1} \ (\alpha p_1)^{n_1}}{n_1!} \ \frac{e^{-\alpha p_2} \ (\alpha p_2)^{n_2}}{n_2!} \ \cdots \ \frac{e^{-\alpha p_r} \ (\alpha p_r)^{n_r}}{n_r!} \end{aligned}

To obtain the marginal probability function of N_j, j=1,2,\cdots,r, we sum out the other variables N_k=n_k (k \ne j) in (2) and obtain the following:

\displaystyle (3) \ \ \ \ P(N_j=n_j)=\frac{e^{-\alpha p_j} \ (\alpha p_j)^{n_j}}{n_j!}

Thus we can conclude that N_j, j=1,2,\cdots,r, has a Poisson distribution with parameter \alpha p_j. Furrthermore, the joint probability function of N_1,N_2,\cdots,N_r is the product of the marginal probability functions. Thus we can conclude that N_1,N_2,\cdots,N_r are mutually independent.

________________________________________________________________________
Example 1
Let N_1,N_2,N_3 be the number of customers per hour of type 1, type 2, and type 3, respectively. Here, we attempt to split a Poisson distribution with mean 30 per hour (based on 5 per 10 minutes). Thus N_1,N_2,N_3 are mutually independent Poisson variables with means 30 \times 0.25=7.5, 30 \times 0.60=18, 30 \times 0.15=4.5, respectively.

Example 2
As indicated earlier, each N_i, i=1,2,3,4,5,6, has a Poisson distribution with mean \frac{\alpha}{6}. According to (2), the joint probability function of N_1,N_2,N_3,N_4,N_5,N_6 is simply the product of the six marginal Poisson probability functions.

Advertisements

The Poisson Distribution

Let \alpha be a positive constant. Consider the following probability distribution:

\displaystyle (1) \ \ \ \ \ P(X=j)=\frac{e^{-\alpha} \alpha^j}{j!} \ \ \ \ \ j=0,1,2,\cdots

The above distribution is said to be a Poisson distribution with parameter \alpha. The Poisson distribution is usually used to model the random number of events occurring in a fixed time interval. As will be shown below, E(X)=\alpha. Thus the parameter \alpha is the rate of occurrence of the random events; it indicates on average how many events occur per unit of time. Examples of random events that may be modeled by the Poisson distribution include the number of alpha particles emitted by a radioactive substance counted in a prescribed area during a fixed period of time, the number of auto accidents in a fixed period of time or the number of losses arising from a group of insureds during a policy period.

Each of the above examples can be thought of as a process that generates a number of arrivals or changes in a fixed period of time. If such a counting process leads to a Poisson distribution, then the process is said to be a Poisson process.

We now discuss some basic properties of the Poisson distribution. Using the Taylor series expansion of e^{\alpha}, the following shows that (1) is indeed a probability distribution.

\displaystyle . \ \ \ \ \ \ \ \sum \limits_{j=0}^\infty \frac{e^{-\alpha} \alpha^j}{j!}=e^{-\alpha} \sum \limits_{j=0}^\infty \frac{\alpha^j}{j!}=e^{-\alpha} e^{\alpha}=1

The generating function of the Poisson distribution is g(z)=e^{\alpha (z-1)} (see The generating function). The mean and variance can be calculated using the generating function.

\displaystyle \begin{aligned}(2) \ \ \ \ \ &E(X)=g'(1)=\alpha \\&\text{ } \\&E[X(X-1)]=g^{(2)}(1)=\alpha^2 \\&\text{ } \\&Var(X)=E[X(X-1)]+E(X)-E(X)^2=\alpha^2+\alpha-\alpha^2=\alpha \end{aligned}

The Poisson distribution can also be interpreted as an approximation to the binomial distribution. It is well known that the Poisson distribution is the limiting case of binomial distributions (see [1] or this post).

\displaystyle (3) \ \ \ \ \ \lim \limits_{n \rightarrow \infty} \binom{n}{j} \biggl(\frac{\alpha}{n}\biggr)^j \biggl(1-\frac{\alpha}{n}\biggr)^{n-j}=\frac{e^{-\alpha} \alpha^j}{j!}

One application of (3) is that we can use Poisson probabilities to approximate Binomial probabilities. The approximation is reasonably good when the number of trials n in a binomial distribution is large and the probability of success p is small. The binomial mean is n p and the variance is n p (1-p). When p is small, 1-p is close to 1 and the binomial variance is approximately np \approx n p (1-p). Whenever the mean of a discrete distribution is approximately equaled to the mean, the Poisson approximation is quite good. As a rule of thumb, we can use Poisson to approximate binomial if n \le 100 and p \le 0.01.

As an example, we use the Poisson distribution to estimate the probability that at most 1 person out of 1000 will have a birthday on the New Year Day. Let n=1000 and p=365^{-1}. So we use the Poisson distribution with \alpha=1000 \times 365^{-1}. The following is an estimate using the Poisson distribution.

\displaystyle . \ \ \ \ \ \ \ P(X \le 1)=e^{-\alpha}+\alpha e^{-\alpha}=(1+\alpha) e^{-\alpha}=0.2415

Another useful property is that the independent sum of Poisson distributions also has a Poisson distribution. Specifically, if each X_i has a Poisson distribution with parameter \alpha_i, then the independent sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n. One way to see this is that the product of Poisson generating functions has the same general form as g(z)=e^{\alpha (z-1)} (see The generating function). One interpretation of this property is that when merging several arrival processes, each of which follow a Poisson distribution, the result is still a Poisson distribution.

For example, suppose that in an airline ticket counter, the arrival of first class customers follows a Poisson process with a mean arrival rate of 8 per 15 minutes and the arrival of customers flying coach follows a Poisson distribution with a mean rate of 12 per 15 minutes. Then the arrival of customers of either types has a Poisson distribution with a mean rate of 20 per 15 minutes or 80 per hour.

A Poisson distribution with a large mean can be thought of as an independent sum of Poisson distributions. For example, a Poisson distribution with a mean of 50 is the independent sum of 50 Poisson distributions each with mean 1. Because of the central limit theorem, when the mean is large, we can approximate the Poisson using the normal distribution.

In addition to merging several Poisson distributions into one combined Poisson distribution, we can also split a Poisson into several Poisson distributions. For example, suppose that a stream of customers arrives according to a Poisson distribution with parameter \alpha and each customer can be classified into one of two types (e.g. no purchase vs. purchase) with probabilities p_1 and p_2, respectively. Then the number of “no purchase” customers and the number of “purchase” customers are independent Poisson random variables with parameters \alpha p_1 and \alpha p_2, respectively. For more details on the splitting of Poisson, see Splitting a Poisson Distribution.

Reference

  1. Feller W. An Introduction to Probability Theory and Its Applications, Third Edition, John Wiley & Sons, New York, 1968

The generating function

Consider the function g(z)=\displaystyle e^{\alpha (z-1)} where \alpha is a positive constant. The following shows the derivatives of this function.

\displaystyle \begin{aligned}. \ \ \ \ \ \ &g(z)=e^{\alpha (z-1)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g(0)=e^{-\alpha} \\&\text{ } \\&g'(z)=e^{\alpha (z-1)} \ \alpha \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g'(0)=e^{-\alpha} \ \alpha \\&\text{ } \\&g^{(2)}(z)=e^{\alpha (z-1)} \ \alpha^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ g^{(2)}(0)=2! \ \frac{e^{-\alpha} \ \alpha^2}{2!} \\&\text{ } \\&g^{(3)}(z)=e^{\alpha (z-1)} \ \alpha^3 \ \ \ \ \ \ \ \ \ \ \ \ \ \ g^{(3)}(0)=3! \ \frac{e^{-\alpha} \ \alpha^3}{3!} \\&\text{ } \\&\ \ \ \ \ \ \ \ \cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots \\&\text{ } \\&g^{(n)}(z)=e^{\alpha (z-1)} \ \alpha^n \ \ \ \ \ \ \ \ \ \ \ \ \ \ g^{(n)}(0)=n! \ \frac{e^{-\alpha} \ \alpha^n}{n!} \end{aligned}

Note that the derivative of g(z) at each order is a multiple of a Poisson probability. Thus the Poisson distribution is coded by the function g(z)=\displaystyle e^{\alpha (z-1)}. Because of this reason, such a function is called a generating function (or probability generating function). This post discusses some basic facts about the generating function (gf) and its cousin, the moment generating function (mgf). One important characteristic is that these functions generate probabilities and moments. Another important characteristic is that there is a one-to-one correspondence between a probability distribution and its generating function and moment generating function, i.e. two random variables with different cumulative distribution functions cannot have the same gf or mgf. In some situations, this fact is useful in working with independent sum of random variables.

————————————————————————————————————

The Generating Function
Suppose that X is a random variable that takes only nonegative integer values with the probability function given by

\text{ }

(1) \ \ \ \ \ \ P(X=j)=a_j, \ \ \ \ j=0,1,2,\cdots

\text{ }

The idea of the generating function is that we use a power series to capture the entire probability distribution. The following defines the generating function that is associated with the above sequence a_j, .

(2) \ \ \ \ \ \ g(z)=a_0+a_1 \ z+a_2 \ z^2+ \cdots=\sum \limits_{j=0}^\infty a_j \ z^j

\text{ }

Since the elements of the sequence a_j are probabilities, we can also call g(z) the generating function of the probability distribution defined by the sequence in (1). The generating function g(z) is defined wherever the power series converges. It is clear that at the minimum, the power series in (2) converges for \lvert z \lvert \le 1.

We discuss the following three properties of generating functions:

  1. The generating function completely determines the distribution.
  2. The moments of the distribution can be derived from the derivatives of the generating function.
  3. The generating function of a sum of independent random variables is the product of the individual generating functions.

The Poisson generating function at the beginning of the post is an example demonstrating property 1 (see Example 0 below for the derivation of the generating function). In some cases, the probability distribution of an independent sum can be deduced from the product of the individual generating functions. Some examples are given below.

————————————————————————————————————
Generating Probabilities
We now discuss the property 1 indicated above. To see that g(z) generates the probabilities, let’s look at the derivatives of g(z):

\displaystyle \begin{aligned}(3) \ \ \ \ \ \ &g'(z)=a_1+2 a_2 \ z+3 a_3 \ z^2+\cdots=\sum \limits_{j=1}^\infty j a_j \ z^{j-1} \\&\text{ } \\&g^{(2)}(z)=2 a_2+6 a_3 \ z+ 12 a_4 \ z^2=\sum \limits_{j=2}^\infty j (j-1) a_j \ z^{j-2} \\&\text{ } \\&g^{(3)}(z)=6 a_3+ 24 a_4 \ z+60 a_5 \ z^2=\sum \limits_{j=3}^\infty j (j-1)(j-2) a_j \ z^{j-3} \\&\text{ } \\&\ \ \ \ \ \ \ \ \cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots \\&\text{ } \\&g^{(n)}(z)=\sum \limits_{j=n}^\infty j(j-1) \cdots (j-n+1) a_j \ z^{j-n}=\sum \limits_{j=n}^\infty \binom{j}{n} n! \ a_j \ z^{j-n} \end{aligned}

\text{ }

By letting z=0 above, all the terms vanishes except for the constant term. We have:

\text{ }

(4) \ \ \ \ \ \ g^{(n)}(0)=n! \ a_n=n! \ P(X=n)

\text{ }

Thus the generating function is a compact way of encoding the probability distribution. The probability distribution determines the generating function as seen in (2). On the other hand, (3) and (4) demonstrate that the generating function also determines the probability distribution.

————————————————————————————————————
Generating Moments
The generating function also determines the moments (property 2 indicated above). For example, we have:

\displaystyle \begin{aligned}(5) \ \ \ \ \ \ &g'(1)=0 \ a_0+a_1+2 a_2+3 a_3+\cdots=\sum \limits_{j=0}^\infty j a_j=E(X) \\&\text{ } \\&g^{(2)}(1)=0 a_0 + 0 a_1+2 a_2+6 a_3+ 12 a_4+\cdots=\sum \limits_{j=0}^\infty j (j-1) a_j=E[X(X-1)] \\&\text{ } \\&E(X)=g'(1) \\&\text{ } \\&E(X^2)=g'(1)+g^{(2)}(1) \end{aligned}

\text{ }

Note that g^{(n)}(1)=E[X(X-1) \cdots (X-(n-1))]. Thus the higher moment E(X^n) can be expressed in terms of g^{(n)}(1) and g^{(k)}(1) where k<n.
————————————————————————————————————
More General Definitions
Note that the definition in (2) can also be interpreted as the mathematical expectation of z^X, i.e., g(z)=E(z^X). This provides a way to define the generating function for random variables that may take on values outside of the nonnegative integers. The following is a more general definition of the generating function of the random variable X, which is defined for all z where the expectation exists.

\text{ }

(6) \ \ \ \ \ \ g(z)=E(z^X)

\text{ }

————————————————————————————————————
The Generating Function of Independent Sum
Let X_1,X_2,\cdots,X_n be independent random variables with generating functions g_1,g_2,\cdots,g_n, respectively. Then the generating function of X_1+X_2+\cdots+X_n is given by the product g_1 \cdot g_2 \cdots g_n.

Let g(z) be the generating function of the independent sum X_1+X_2+\cdots+X_n. The following derives g(z). Note that the general form of generating function (6) is used.

\displaystyle \begin{aligned}(7) \ \ \ \ \ \ g(z)&=E(z^{X_1+\cdots+X_n}) \\&\text{ } \\&=E(z^{X_1} \cdots z^{X_n}) \\&\text{ } \\&=E(z^{X_1}) \cdots E(z^{X_n}) \\&\text{ } \\&=g_1(z) \cdots g_n(z) \end{aligned}

The probability distribution of a random variable is uniquely determined by its generating function. In particular, the generating function g(z) of the independent sum X_1+X_2+\cdots+X_n that is derived in (7) is unique. So if the generating function is of a particular distribution, we can deduce that the distribution of the sum must be of the same distribution. See the examples below.

————————————————————————————————————
Example 0
In this example, we derive the generating function of the Poisson distribution. Based on the definition, we have:

\displaystyle \begin{aligned}. \ \ \ \ \ \ g(z)&=\sum \limits_{j=0}^\infty \frac{e^{-\alpha} \alpha^j}{j!} \ z^j \\&\text{ } \\&=\sum \limits_{j=0}^\infty \frac{e^{-\alpha} (\alpha z)^j}{j!}  \\&\text{ } \\&=\frac{e^{-\alpha}}{e^{- \alpha z}} \sum \limits_{j=0}^\infty \frac{e^{-\alpha z} (\alpha z)^j}{j!} \\&\text{ } \\&=e^{\alpha (z-1)} \end{aligned}

\text{ }

Example 1
Suppose that X_1,X_2,\cdots,X_n are independent random variables where each X_i has a Bernoulli distribution with probability of success p. Let q=1-p. The following is the generating function for each X_i.

\text{ }

. \ \ \ \ \ \ g(z)=q+p z

\text{ }

Then the generating function of the sum X=X_1+\cdots+X_n is g(z)^n=(q+p z)^n. The following is the binomial expansion:

\text{ }

\displaystyle \begin{aligned}(8) \ \ \ \ \ \ g(z)^n&=(q+p z)^n \\&\text{ } \\&=\sum \limits_{j=0}^n \binom{n}{j} q^{n-j} \ p^j \ z^j  \end{aligned}

\text{ }

By definition (2), the generating function of X=X_1+\cdots+X_n is:

\text{ }

\text{ }

(9) \ \ \ \ \ \ g(z)^n=\sum \limits_{j=0}^\infty P(X=j) \ z^j

\text{ }

Comparing (8) and (9), we have

\displaystyle (10) \ \ \ \ \ \ P(X=j)=\left\{\begin{matrix}\displaystyle \binom{n}{j} p^j \ q^{n-j}&\ 0 \le j \le n\\{0}&\ j>n \end{matrix}\right.

The probability distribution indicated by (8) and (10) is that of a binomial distribution. Since the probability distribution of a random variable is uniquely determined by its generating function, the independent sum of Bernoulli distributions must ave a Binomial distribution.

\text{ }

Example 2
Suppose that X_1,X_2,\cdots,X_n are independent and have Poisson distributions with parameters \alpha_1,\alpha_2,\cdots,\alpha_n, respectively. Then the independent sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n.

Let g(z) be the generating function of X=X_1+\cdots+X_n. For each i, the generating function of X_i is g_i(z)=e^{\alpha_i (z-1)}. The key to the proof is that the product of the g_i has the same general form as the individual g_i.

\displaystyle \begin{aligned}(11) \ \ \ \ \ \ g(z)&=g_1(z) \cdots g_n(z) \\&\text{ } \\&=e^{\alpha_1 (z-1)} \cdots e^{\alpha_n (z-1)} \\&\text{ } \\&=e^{(\alpha_1+\cdots+\alpha_n)(z-1)} \end{aligned}

The generating function in (11) is that of a Poisson distribution with mean \alpha=\alpha_1+\cdots+\alpha_n. Since the generating function uniquely determines the distribution, we can deduce that the sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n.

\text{ }

Example 3
In rolling a fair die, let X be the number shown on the up face. The associated generating function is:

\displaystyle. \ \ \ \ \ \ g(z)=\frac{1}{6}(z+z^2+z^3+z^4+z^5+z^6)=\frac{z(1-z^6)}{6(1-z)}

The generating function can be further reduced as:

\displaystyle \begin{aligned}. \ \ \ \ \ \ g(z)&=\frac{z(1-z^6)}{6(1-z)} \\&\text{ } \\&=\frac{z(1-z^3)(1+z^3)}{6(1-z)} \\&\text{ } \\&=\frac{z(1-z)(1+z+z^2)(1+z^3)}{6(1-z)} \\&\text{ } \\&=\frac{z(1+z+z^2)(1+z^3)}{6}  \end{aligned}

Suppose that we roll the fair dice 4 times. Let W be the sum of the 4 rolls. Then the generating function of Z is

\displaystyle. \ \ \ \ \ \  g(z)^4=\frac{z^4 (1+z^3)^4 (1+z+z^2)^4}{6^4}

The random variable W ranges from 4 to 24. Thus the probability function ranges from P(W=4) to P(W=24). To find these probabilities, we simply need to decode the generating function g(z)^4. For example, to find P(W=12), we need to find the coefficient of the term z^{12} in the polynomial g(z)^4. To help this decoding, we can expand two of the polynomials in g(z)^4.

\displaystyle \begin{aligned}. \ \ \ \ \ \ g(z)^4&=\frac{z^4 (1+z^3)^4 (1+z+z^2)^4}{6^4} \\&\text{ } \\&=\frac{z^4 \times A \times B}{6^4} \\&\text{ } \\&A=(1+z^3)^4=1+4z^3+6z^6+4z^9+z^{12} \\&\text{ } \\&B=(1+z+z^2)^4=1+4z+10z^2+16z^3+19z^4+16z^5+10z^6+4z^7+z^8  \end{aligned}

Based on the above polynomials, there are three ways of forming z^{12}. They are: (z^4 \times 1 \times z^8), (z^4 \times 4z^3 \times 16z^5), (z^4 \times 6z^6 \times 10z^2). Thus we have:

\displaystyle. \ \ \ \ \ \  P(W=12)=\frac{1}{6^4}(1+4 \times 16+6 \times 10)=\frac{125}{6^4}

To find the other probabilities, we can follow the same decoding process.

————————————————————————————————————
Remark
The probability distribution of a random variable is uniquely determined by its generating function. This fundamental property is useful in determining the distribution of an independent sum. The generating function of the independent sum is simply the product of the individual generating functions. If the product is of a certain distributional form (as in Example 1 and Example 2), then we can deduce that the sum must be of the same distribution.

We can also decode the product of generating functions to obtain the probability function of the independent sum (as in Example 3). The method in Example 3 is quite tedious. But one advantage is that it is a “machine process”, a pretty fool proof process that can be performed mechanically.

The machine process is this: Code the individual probability distribution in a generating function g(z). Then raise it to n. After performing some manipulation to g(z)^n, decode the probabilities from g(z)^n.

As long as we can perform the algebraic manipulation carefully and correctly, this process will be sure to provide the probability distribution of an independent sum.

————————————————————————————————————
The Moment Generating Function
The moment generating function of a random variable X is M_X(t)=E(e^{tX}) on all real numbers t for which the expected value exists. The moments can be computed more directly using an mgf. From the theory of mathematical analysis, it can be shown that if M_X(t) exists on some interval -a<t<a, then the derivatives of M_X(t) of all orders exist at t=0. Furthermore, it can be show that E(X^n)=M_X^{(n)}(0).

Suppose that g(z) is the generating function of a random variable. The following relates the generating function and the moment generating function.

\displaystyle \begin{aligned}. \ \ \ \ \ \ &M_X(t)=g(e^t) \\&\text{ } \\&g(z)=M_X(ln z)  \end{aligned}

————————————————————————————————————

Reference

  1. Feller W. An Introduction to Probability Theory and Its Applications, Third Edition, John Wiley & Sons, New York, 1968

The hazard rate function

In this post, we introduce the hazard rate function using the notions of non-homogeneous Poisson process.

In a Poisson process, changes occur at a constant rate \lambda per unit time. Suppose that we interpret the changes in a Poisson process from a mortality point of view, i.e. a change in the Poisson process mean a termination of a system, be it biological or manufactured, and this Poisson process counts the number of of terminations as they occur. Then the rate of change \lambda is interpreted as a hazard rate (or failure rate or force of mortality). With a constant force of mortality, the time until the next change is exponentially distributed. In this post, we discuss the hazard rate function in a more general setting. The process that counts of the number of terminations will no longer have a constant hazard rate, and instead will have a hazard rate function \lambda(t), a function of time t. Such a counting process is called a non-homogeneous Poisson process. We discuss the survival probability models (the time to the next termination) associated with a non-homogeneous Poisson process. We then discuss several important examples of survival probability models, including the Weibull distribution, the Gompertz distribution and the model based on the Makeham’s law. We aso comment briefly the connection between the hazard rate function and the tail weight of a distribution.

\text{ }

The Poisson Process
We start with the three postulates of a Poisson process. Consider an experiment in which the occurrences of a certain type of events are counted during a given time interval. We call the occurrence of the type of events in question a change. We assume the following three conditions:

  1. The numbers of changes occurring in nonoverlapping intervals are independent.
  2. The probability of two or more changes taking place in a sufficiently small interval is essentially zero.
  3. The probability of exactly one change in the short interval (t,t+\delta) is approximately \lambda \delta where \delta is sufficiently small and \lambda is a positive constant.

\text{ }

When we interpret the Poisson process in a mortality point of view, the constant \lambda is a hazard rate (or force of mortality), which can be interpreted as the rate of failure at the next instant given that the life has survived to time t. With a constant force of mortality, the survival model (the time until the next termination) has an exponential distribution with mean \frac{1}{\lambda}. We wish to relax the constant force of mortality assumption by making \lambda a function of t instead. The remainder of this post is based on the non-homogeneous Poisson process defined below.

\text{ }

The Non-Homogeneous Poisson Process
We modifiy condition 3 above by making \lambda(t) a function of t. We have the following modified counting process.

  1. The numbers of changes occurring in nonoverlapping intervals are independent.
  2. The probability of two or more changes taking place in a sufficiently small interval is essentially zero.
  3. The probability of exactly one change in the short interval (t,t+\delta) is approximately \lambda(t) \delta where \delta is sufficiently small and \lambda(t) is a nonnegative function of t.

\text{ }

We focus on the survival model aspect of such counting processes. Such process can be interpreted as models for the number of changes occurred in a time interval where a change means “termination” or ‘failure” of a system under consideration. The rate of change function \lambda(t) indicated in condition 3 is called the hazard rate function. It is also called the failure rate function in reliability engineering and the force of mortality in life contingency theory.

Based on condition 3 in the non-homogeneous Poisson process, the hazard rate function \lambda(t) can be interpreted as the rate of failure at the next instant given that the life has survived to time t.

Two random variables naturally arise from a non-homogeneous Poisson process are described here. One is the discrete variable N_t, defined as the number of changes in the time interval (0,t). The other is the continuous random variable T, defined as the time until the occurrence of the first change. The probability distribution of T is called a survival model. The following is the link between N_t and T.

\text{ }

\displaystyle \begin{aligned}(1) \ \ \ \ \ \ \ \ \ &P[T > t]=P[N_t=0] \end{aligned}

\text{ }

Note that P[T > t] is the probability that the next change occurs after time t. This means that there is no change within the interval (0,t). We have the following theorems.

\text{ }

Theorem 1.
Let \displaystyle \Lambda(t)=\int_{0}^{t} \lambda(y) dy. Then e^{-\Lambda(t)} is the probability that there is no change in the interval (0,t). That is, \displaystyle P[N_t=0]=e^{-\Lambda(t)}.

Proof. We are interested in finding the probability of zero changes in the interval (0,y+\delta). By condition 1, the numbers of changes in the nonoverlapping intervals (0,y) and (y,y+\delta) are independent. Thus we have:

\text{ }

\displaystyle (2) \ \ \ \ \ \ \ \ P[N_{y+\delta}=0] \approx P[N_y=0] \times [1-\lambda(y) \delta]

\text{ }

Note that by condition 3, the probability of exactly one change in the small interval (y,y+\delta) is \lambda(y) \delta. Thus [1-\lambda(y) \delta] is the probability of no change in the interval (y,y+\delta). Continuing with equation (2), we have the following derivation:

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\frac{P[N_{y+\delta}=0] - P[N_y=0]}{\delta} \approx -\lambda(y) P[N_y=0] \\&\text{ } \\&\frac{d}{dy} P[N_y=0]=-\lambda(y) P[N_y=0] \\&\text{ } \\&\frac{\frac{d}{dy} P[N_y=0]}{P[N_y=0]}=-\lambda(y) \\&\text{ } \\&\int_0^{t} \frac{\frac{d}{dy} P[N_y=0]}{P[N_y=0]} dy=-\int_0^{t} \lambda(y)dy \end{aligned}

\text{ }

Evaluating the integral on the left hand side with the boundary condition of P[N_0=0]=1 produces the following results:

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &ln P[N_t=0]=-\int_0^{t} \lambda(y)dy \\&\text{ } \\&P[N_t=0]=\displaystyle e^{\displaystyle -\int_0^{t} \lambda(y)dy} \end{aligned}

\text{ }

Theorem 2
As discussed above, let T be the length of the interval that is required to observe the first change. Then the following are the distribution function, survival function and pdf of T:

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &F_T(t)=\displaystyle 1-e^{-\int_0^t \lambda(y) dy} \\&\text{ } \\&S_T(t)=\displaystyle e^{-\int_0^t \lambda(y) dy} \\&\text{ } \\&f_T(t)=\displaystyle \lambda(t) e^{-\int_0^t \lambda(y) dy} \end{aligned}

Proof. In Theorem 1, we derive the probability P[N_y=0] for the discrete variable N_y derived from the non-homogeneous Poisson process. We now consider the continuous random variable T, the time until the first change, which is related to N_t by (1). Thus S_T(t)=P[T > t]=P[N_t=0]=e^{-\int_0^t \lambda(y) dy}. The distribution function and density function can be derived accordingly.

\text{ }

Theorem 3
The hazard rate function \lambda(t) is equivalent to each of the following:

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\lambda(t)=\frac{f_T(t)}{1-F_T(t)} \\&\text{ } \\&\lambda(t)=\frac{-S_T^{'}(t)}{S_T(t)} \end{aligned}

\text{ }

Remark
Theorem 1 and Theorem 2 show that in a non-homogeneous Poisson process as described above, the hazard rate function \lambda(t) completely specifies the probability distribution of the survival model T (the time until the first change) . Once the rate of change function \lambda(t) is known in the non-homogeneous Poisson process, we can use it to generate the survival function S_T(t). All of the examples of survival models given below are derived by assuming the functional form of the hazard rate function. The result in Theorem 2 holds even outside the context of a non-homogeneous Poisson process, that is, given the hazard rate function \lambda(t), we can derive the three distributional items S_T(t), F_T(t), f_T(t).

The ratio in Theorem 3 indicates that the probability distribution determines the hazard rate function. In fact, the ratio in Theorem 3 is the usual definition of the hazard rate function. That is, the hazard rate function can be defined as the ratio of the density and the survival function (one minus the cdf). With this definition, we can also recover the survival function. Whenever \displaystyle \lambda(x)=\frac{f_X(x)}{1-F_X(x)}, we can derive:

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &S_X(x)=\displaystyle e^{-\int_0^t \lambda(y) dy} \end{aligned}

\text{ }

As indicated above, the hazard rate function can be interpreted as the failure rate at time t given that the life in question has survived to time t. It is the rate of failure at the next instant given that the life or system being studied has survived up to time t.

It is interesting to note that the function \Lambda(t)=\int_0^t \lambda(y) dy defined in Theorem 1 is called the cumulative hazard rate function. Thus the cumulative hazard rate function is an alternative way of representing the hazard rate function (see the discussion on Weibull distribution below).

——————————————————————————————————————
Examples of Survival Models

–Exponential Distribution–
In many applications, especially those for biological organisms and mechanical systems that wear out over time, the hazard rate \lambda(t) is an increasing function of t. In other words, the older the life in question (the larger the t), the higher chance of failure at the next instant. For humans, the probability of a 85 years old dying in the next year is clearly higher than for a 20 years old. In a Poisson process, the rate of change \lambda(t)=\lambda indicated in condition 3 is a constant. As a result, the time T until the first change derived in Theorem 2 has an exponential distribution with parameter \lambda. In terms of mortality study or reliability study of machines that wear out over time, this is not a realistic model. However, if the mortality or failure is caused by random external events, this could be an appropriate model.

–Weibull Distribution–
This distribution is an excellent model choice for describing the life of manufactured objects. It is defined by the following cumulative hazard rate function:

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\Lambda(t)=\biggl(\frac{t}{\beta}\biggr)^{\alpha} \end{aligned} where \alpha > 0 and \beta>0

\text{ }

As a result, the hazard rate function, the density function and the survival function for the lifetime distribution are:

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\lambda(t)=\frac{\alpha}{\beta} \biggl(\frac{t}{\beta}\biggr)^{\alpha-1} \\&\text{ } \\&f_T(t)=\frac{\alpha}{\beta} \biggl(\frac{t}{\beta}\biggr)^{\alpha-1} \displaystyle e^{\displaystyle -\biggl[\frac{t}{\beta}\biggr]^{\alpha}} \\&\text{ } \\&S_T(t)=\displaystyle e^{\displaystyle -\biggl[\frac{t}{\beta}\biggr]^{\alpha}} \end{aligned}

\text{ }

The parameter \alpha is the shape parameter and \beta is the scale parameter. When \alpha=1, the hazard rate becomes a constant and the Weibull distribution becomes an exponential distribution.

When the parameter \alpha<1, the failure rate decreases over time. One interpretation is that most of the defective items fail early on in the life cycle. Once they they are removed from the population, failure rate decreases over time.

When the parameter 1<\alpha, the failure rate increases with time. This is a good candidate for a model to describe the lifetime of machines or systems that wear out over time.

–The Gompertz Distribution–
The Gompertz law states that the force of mortality or failure rate increases exponentially over time. It describe human mortality quite accurately. The following is the hazard rate function:

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\lambda(t)=\alpha e^{\beta t} \end{aligned} where \alpha>0 and \beta>0.

\text{ }

The following are the cumulative hazard rate function as well as the survival function, distribution function and the pdf of the lifetime distribution T.

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\Lambda(t)=\int_0^t \alpha e^{\beta y} dy=\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta} \\&\text{ } \\&S_T(t)=\displaystyle e^{\displaystyle -\biggl(\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}\biggr)} \\&\text{ } \\&F_T(t)=\displaystyle 1-e^{\displaystyle -\biggl(\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}\biggr)} \\&\text{ } \\&f_T(t)=\displaystyle \alpha \ e^{\beta t} \ e^{\displaystyle -\biggl(\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}\biggr)} \end{aligned}

\text{ }

–Makeham’s Law–
The Makeham’s Law states that the force of mortality is the Gompertz failure rate plus an age-indpendent component that accounts for external causes of mortality. The following is the hazard rate function:

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\lambda(t)=\alpha e^{\beta t}+\mu \end{aligned} where \alpha>0, \beta>0 and \mu>0.

\text{ }

The following are the cumulative hazard rate function as well as the survival function, distribution function and the pdf of the lifetime distribution T.

\text{ }

\displaystyle \begin{aligned}. \ \ \ \ \ \ \ \ \ &\Lambda(t)=\int_0^t (\alpha e^{\beta y}+\mu) dy=\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}+\mu t \\&\text{ } \\&S_T(t)=\displaystyle e^{\displaystyle -\biggl(\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}+\mu t\biggr)} \\&\text{ } \\&F_T(t)=\displaystyle 1-e^{\displaystyle -\biggl(\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}+\mu t\biggr)} \\&\text{ } \\&f_T(t)=\biggl( \alpha e^{\beta t}+\mu t \biggr) \ e^{\displaystyle -\biggl(\frac{\alpha}{\beta} e^{\beta t}-\frac{\alpha}{\beta}+\mu t\biggr)} \end{aligned}

\text{ }

The Tail Weight of a Distribution
The hazard rate function can provide information about the tail of a distribution. If the hazard rate function is decreasing, it is an indication that the distribution has a heavy tail, i.e., the distribution significantly puts more probability on larger values. Conversely, if the hazard rate function is increasing, it is an indication of a lighter tail. In an insurance context, heavy tailed distributions (e.g. the Pareto distribution) are suitable candidates for modeling large insurance losses (see this previous post or [1]).

\text{ }

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

The mean excess loss function

We take a closer look at the mean excess loss function. We start with the following example:

Example
Suppose that an entity is exposed to a random loss X. An insurance policy offers protection against this loss. Under this policy, payment is made to the insured entity subject to a deductible d>0, i.e. when a loss is less than d, no payment is made to the insured entity, and when the loss exceeds d, the insured entity is reimbursed for the amount of the loss in excess of the deductible d. Consider the following two questions:

\text{ }

  1. Of all the losses that are eligible to be reimbursed by the insurer, what is the average payment made by the insurer to the insured?
  2. What is the average payment made by the insurer to the insured entity?

\text{ }

The two questions look similar. The difference between the two questions is subtle but important. In the first question, the average is computed over all losses that are eligible for reimbursement (i.e., the loss exceeds the deductible). This is the average amount the insurer is expected to pay in the event that a payment in excess of the deductible is required to be made. So this average is a per payment average.

In the second question, the average is calculated over all losses (regardless of sizes). When the loss does not reach the deductible, the payment is considered zero and when the loss is in excess of the deductible, the payment is X-d. Thus the average is the average amount the insurer has to pay per loss. So the second question is about a per loss average.

—————————————————————————————————————-

The Mean Excess Loss Function
The average in the first question is called the mean excess loss function. Suppose X is the random loss and d>0. The mean excess loss variable is the conditional variable X-d \ \lvert X>d and the mean excess loss function e_X(d) is defined by:

\text{ }

\displaystyle (1) \ \ \ \ \ e_X(d)=E(X-d \lvert X>d)

\text{ }

In an insurance context, the mean excess loss function is the average payment in excess of a threshold given that the loss exceeds the threshold. In a mortality context, the mean excess loss function is called the mean residual life function and complete expectation of life and can be interpreted as the remaining time until death given that the life in question is alive at age d.

The mean excess loss function is computed by the following depending on whether the loss variable is continuous or discrete.

\text{ }

\displaystyle (2) \ \ \ \ \ e_X(d)=\frac{\int_d^\infty (x-d) \ f_X(x) \ dx}{S_X(d)}

\displaystyle (3) \ \ \ \ \ e_X(d)=\frac{\sum \limits_{x>d} (x-d) \ P(X=x)}{S_X(d)}

\text{ }

The mean excess loss function e_X(d) is defined only when the integral or the sum converges. The following is an equivalent calculation of e_X(d) that may be easier to use in some circumstances.

\text{ }

\displaystyle (4) \ \ \ \ \ e_X(d)=\frac{\int_d^\infty S_X(x) \ dx}{S_X(d)}

\displaystyle (5a) \ \ \ \ \ e_X(d)=\frac{\sum \limits_{x \ge d} S_X(x) }{S_X(d)}

\displaystyle (5b) \ \ \ \ \ e_X(d)=\frac{\biggl(\sum \limits_{x>d} S_X(x)\biggr)+(w+1-d) S_X(w) }{S_X(d)}

\text{ }

In both (5a) and (5b), we assume that the support of X is the set of nonnegative integers. In (5a), we assume that the deductible d is a positive integer. In (5b), the deductible d is free to be any positive number and w is the largest integer such that w \le d. The formulation (4) is obtained by using integration by parts (also see theorem 3.1 in [1]). The formulations of (5a) and (5b) are a result of applying theorem 3.2 in [1].

The mean excess loss function provides information about the tail weight of a distribution, see the previous post The Pareto distribution. Also see Example 3 below.
—————————————————————————————————————-
The Mean in Question 2
The average that we need to compute is the mean of the following random variable. Note that (a)_+ is the function that assigns the value of a whenever a>0 and otherwise assigns the value of zero.

\text{ }

\displaystyle (6) \ \ \ \ \ (X-d)_+=\left\{\begin{matrix}0&\ X<d\\{X-d}&\ X \ge d \end{matrix}\right.

\text{ }

The mean E((X-d)_+) is calculated over all losses. When the loss is less than the deductible d, the insurer has no obligation to make a payment to the insured and the payment is assumed to be zero in the calculation of E[(X-d)_+]. The following is how this expected value is calculated depending on whether the loss X is continuous or discrete.

\text{ }

\displaystyle (7) \ \ \ \ \ E((X-d)_+)=\int_d^\infty (x-d) \ f_X(x) \ dx

\displaystyle (8) \ \ \ \ \ E((X-d)_+)=\sum \limits_{x>d} (x-d) \ P(X=x)

\text{ }

Based on the definitions, the following is how the two averages are related.

\displaystyle E[(X-d)_+]=e_X(d) \ [1-F_X(d)] \ \ \ \text{or} \ \ \ E[(X-d)_+]=e_X(d) \ S_X(d)

—————————————————————————————————————-
The Limited Expected Value
For a given positive constant u, the limited loss variable is defined by

\text{ }

\displaystyle (9) \ \ \ \ \ X \wedge u=\left\{\begin{matrix}X&\ X<u\\{u}&\ X \ge u \end{matrix}\right.

\text{ }

The expected value E(X \wedge u) is called the limited expected value. In an insurance application, the u is a policy limit that sets a maximum on the benefit to be paid. The following is how the limited expected value is calculated depending on whether the loss X is continuous or discrete.

\text{ }

\displaystyle (9) \ \ \ \ \ E(X \wedge u)=\int_{-\infty}^u x \ f_X(x) \ dx+u \ S_X(u)

\displaystyle (10) \ \ \ \ \ E(X \wedge u)=\biggl(\sum \limits_{x < u} x \ P(X=x)\biggr)+u \ S_X(u)

\text{ }

Interestingly, we have the following relation.

\text{ }

\displaystyle (X-d)_+ + (X \wedge d)=X \ \ \ \ \ \ \text{and} \ \ \ \ \ \ E[(X-d)_+] + E(X \wedge d)=E(X)

\text{ }

The above statement indicates that purchasing a policy with a deductible d and another policy with a policy maximum d is equivalent to buying full coverage.

Another way to interpret X \wedge d is that it is the amount of loss that is eliminated by having a deductible in the insurance policy. If the insurance policy pays the loss in full, then the insurance payment is X and the expected amount the insurer is expected to pay is E(X). By having a deductible provision in the policy, the insurer is now only liable for the amount (X-d)_+ and the amount the insurer is expected to pay per loss is E[(X-d)_+]. Consequently E(X \wedge d) is the expected amount of the loss that is eliminated by the deductible provision in the policy. The following summarizes this observation.

\text{ }

\displaystyle (X \wedge d)=X-(X-d)_+ \ \ \ \ \ \ \text{and} \ \ \ \ \ \ E(X \wedge d)=E(X)-E[(X-d)_+]

—————————————————————————————————————-
Example 1
Let the loss random variable X be exponential with pdf f(x)=\alpha e^{-\alpha x}. We have E(X)=\frac{1}{\alpha}. Because of the no memory property of the exponential distribution, given that a loss exceeds the deductible, the mean payment is the same as the original mean. Thus e_X(d)=\frac{1}{\alpha}. Then the per loss average is:

\displaystyle E[(X-d)_+]=e_X(d) \ S(d) = \frac{e^{-\alpha d}}{\alpha}

\text{ }

Thus, with a deductible provision in the policy, the insurer is expected to pay \displaystyle \frac{e^{-\alpha d}}{\alpha} per loss instead of \displaystyle \frac{1}{\alpha}. Thus the expected amount of loss eliminated (from the insurer’s point of view) is \displaystyle E(X \wedge d)=\frac{1-e^{-\alpha d}}{\alpha}.

\text{ }

Example 2
Suppose that the loss variable has a Gamma distribution where the scale parameter is \alpha and the shape parameter is n=2. The pdf is \displaystyle g(x)=\alpha^2 \ x \ e^{-\alpha x}. The insurer’s expected payment without the deductible is E(X)=\frac{2}{\alpha}. The survival function S(x) is:

\displaystyle S(x)=e^{-\alpha x}(1+\alpha x)

\text{ }

For the losses that exceed the deductible, the insurer’s expected payment is:

\displaystyle \begin{aligned}e_X(d)&=\frac{\int_d^{\infty} S(x) \ dx}{S(d)}\\&=\frac{\int_d^{\infty} e^{-\alpha x}(1+\alpha x) \ dx}{e^{-\alpha d}(1+\alpha d)} \\&=\frac{\frac{e^{-\alpha d}}{\alpha}+d e^{-\alpha d}+\frac{e^{-\alpha d}}{\alpha}}{e^{-\alpha d}(1+\alpha d)} \\&=\frac{\frac{2}{\alpha}+d}{1+\alpha d} \end{aligned}

\text{ }

Then the insurer’s expected payment per loss is E[(X-d)_+]:

\displaystyle \begin{aligned}E[(X-d)_+]&=e_X(d) \ S(d) \\&=\frac{\frac{2}{\alpha}+d}{1+\alpha d} \ \ e^{-\alpha d}(1+\alpha d) \\&=e^{-\alpha d} \ \biggl(\frac{2}{\alpha}+d\biggr) \end{aligned}

\text{ }

With a deductible in the policy, the following is the expected amount of loss eliminated (from the insurer’s point of view).

\displaystyle \begin{aligned}E[X \wedge d]&=E(X)-E[(X-d)_+] \\&=\frac{2}{\alpha}-e^{-\alpha d} \ \biggl(\frac{2}{\alpha}+d\biggr) \\&=\frac{2}{\alpha}\biggl(1-e^{-\alpha d}\biggr)-d e^{-\alpha d} \end{aligned}

\text{ }

Example 3
Suppose the loss variable X has a Pareto distribution with the following pdf:

\text{ }

\displaystyle f_X(x)=\frac{\beta \ \alpha^\beta}{(x+\alpha)^{\beta+1}} \ \ \ \ \ x>0

\text{ }

If the insurance policy is to pay the full loss, then the insurer’s expected payment per loss is \displaystyle E(X)=\frac{\alpha}{\beta-1} provided that the shape parameter \beta is larger than one.

The mean excess loss function of the Pareto distribution has a linear form that is increasing (see the previous post The Pareto distribution). The following is the mean excess loss function:

\text{ }

\displaystyle e_X(d)=\frac{1}{\beta-1} \ d +\frac{\alpha}{\beta-1}=\frac{1}{\beta-1} \ d +E(X)

\text{ }

If the loss is modeled by such a distribution, this is an uninsurable risk! First of all, the higher the deductible, the larger the expected payment if such a large loss occurs. The expected payment for large losses is always the unmodified expected E(X) plus a component that is increasing in d.

The increasing mean excess loss function is an indication that the Pareto distribution is a heavy tailed distribution. In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Reference

  1. Bowers N. L., Gerber H. U., Hickman J. C., Jones D. A., Nesbit C. J. Actuarial Mathematics, First Edition., The Society of Actuaries, Itasca, Illinois, 1986
  2. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

Another way to generate the Pareto distribution

The Pareto distribution is a heavy tailed distribution, suitable as candidate for modeling large insurance losses above a threshold. It is a mixture of exponential distributions with Gamma mixing weights. Another way to generate the Pareto distribution is taking the inverse of another distribution (raising another distribution to the power of minus one). Previous discussion on the Pareto distribution can be found here: An example of a mixture and The Pareto distribution.

\text{ }

Let X be a continuous random variable with pdf f_X(x) and with cdf F_X(x) such that F_X(0)=0. Let Y=X^{-1}. Then the resulting distribution for Y is called an inverse. For example, if X has an exponential distribution, then Y is said to have an inverse exponential distribution. The following are the cdf and pdf of Y=X^{-1}.

\text{ }

\displaystyle F_Y(y)=1-F_X(y^{-1}) \ \ \ \ \ \ \ \ f_Y(y)=f_X(y^{-1}) \ y^{-2}

\text{ }

We now show that the Pareto distribution is an inverse. Suppose that X has the pdf f_X(x)=\beta \ x^{\beta-1}, 0<x<1. We show that Y=X^{-1} has a Pareto distribution with scale paramter \alpha=1 and shape parameter \beta>0. Once this base distribution is established, we can relax the scale parameter to have other positive values.

\text{ }

The cdf of X is F_X(x)=x^\beta where 0<x<1. Since the support of X is 0<x<1, the support of Y is y>1. Thus in deriving the cdf F_Y(y), we only need to focus on y>1 (or 0<x<1). The following is the cdf F_Y(y):

\text{ }

\displaystyle F_Y(y)=1-F_X(y^{-1})=1-y^{-\beta}, \ \ \ \ y>1

\text{ }

Upon differentiation, we obtain the pdf:

\displaystyle f_Y(y)=\beta \ y^{-(\beta+1)}=\frac{\beta}{y^{\beta+1}}, \ \ \ y>1

\text{ }

The above pdf is that of a Pareto distribution with scale paramter \alpha=1 and shape parameter \beta. However, the support of this pdf is y>1. In order to have y>0 as the support, we have the following alternative version:

\text{ }

\displaystyle f_Y(y)=\frac{\beta}{(y+1)^{\beta+1}}, \ \ \ y>0

\text{ }

We now transform the above pdf to become a true 2-parameter Pareto pdf by relaxing the scale parameter. The result is the following pdf.

\text{ }

\displaystyle f_Y(y)=\frac{\beta \ \alpha^{\beta}}{(y+\alpha)^{\beta+1}}, \ \ \ y>0

The Pareto distribution

This post takes a closer look at the Pareto distribution. A previous post demonstrates that the Pareto distribution is a mixture of exponential distributions with Gamma mixing weights. We now elaborate more on this point. Through looking at various properties of the Pareto distribution, we also demonstrate that the Pareto distribution is a heavy tailed distribution. In insurance applications, heavy-tailed distributions are essential tools for modeling extreme loss, especially for the more risky types of insurance such as medical malpractice insurance. In financial applications, the study of heavy-tailed distributions provides information about the potential for financial fiasco or financial ruin. The Pareto distribution is a great way to open up a discussion on heavy-tailed distribution.

\text{ }

Update (11/12/2017). This blog post introduces a catalog of many other parametric severity models in addition to Pareto distribution. The link to the catalog is found in that blog post. To go there directly, this is the link.

Update (10/29/2017). This blog post has updated information on Pareto distribution. It also has links to more detailed contents on Pareto distribution in two companion blogs. These links are also given here: more detailed post on Pareto, Pareto Type I and Type II and practice problems on Pareto.

\text{ }

The continuous random variable X with positive support is said to have the Pareto distribution if its probability density function is given by

\displaystyle f_X(x)=\frac{\beta \ \alpha^\beta}{(x+\alpha)^{\beta+1}} \ \ \ \ \ x>0

where \alpha>0 and \beta>0 are constant. The constant \alpha is the scale parameter and \beta is the shape parameter. The following lists several other distributional quantities of the Pareto distribution, which will be used in the discussion below.

\displaystyle S_X(x)=\frac{\alpha^\beta}{(x+\alpha)^\beta}=\biggl(\frac{\alpha}{x+\alpha}\biggr)^\beta \ \ \ \ \ \ \ \ \ \text{survival function}

\displaystyle F_X(x)=1-\biggl(\frac{\alpha}{x+\alpha}\biggr)^\beta \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{distribution function}

\displaystyle E(X)=\frac{\alpha}{\beta-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{mean},\beta>1

\displaystyle E(X^2)=\frac{2 \alpha^2}{(\beta-1)(\beta-2)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{second momemt},\beta>2

\displaystyle Var(X)=\frac{\alpha^2 \beta}{(\beta-1)^2(\beta-2)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{variance},\beta>2

\displaystyle E(X^k)=\frac{k! \alpha^k}{(\beta-1) \cdots (\beta-k)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{higher moments},\beta>k, \text{ k positive integer}

——————————————————————————————————————
The Pareto Distribution as a Mixture
The Pareto pdf indicated above can be obtained by mixing exponential distributions using Gamma distributions as weights. Suppose that X follows an exponential distribution (conditional on a parameter value \theta). The following is the conditional pdf of X.

\displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\theta e^{-\theta x} \ \ \ x>0

There is uncertainty in the parameter, which can be viewed as a random variable \Theta. Suppose that \Theta follows a Gamma distribution with scale parameter \alpha and shape parameter \beta. The following is the pdf of \Theta.

\displaystyle f_{\Theta}(\theta)=\frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta} \ \ \ \theta>0

The unconditional pdf of X is the weighted average of the conditional pdfs with the Gamma pdf as weight.

\displaystyle \begin{aligned}f_X(x)&=\int_0^{\infty} f_{X \lvert \Theta}(x \lvert \theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\theta \ e^{-\theta x}\biggr] \ \biggl[\frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^\beta \ e^{-\theta(x+\alpha)} \ d \theta \\&=\frac{\alpha^\beta}{\Gamma(\beta)} \frac{\Gamma(\beta+1)}{(x+\alpha)^{\beta+1}} \int_0^{\infty} \frac{(x+\alpha)^{\beta+1}}{\Gamma(\beta+1)} \ \theta^{\beta+1-1} \ e^{-\theta(x+\alpha)} \ d \theta \\&=\frac{\beta \ \alpha^\beta}{(x+\alpha)^{\beta+1}} \end{aligned}

In the following discussion, X will denote the Pareto distribution as defined above. As will be shown below, the exponential distribution is considered a light tailed distribution. Yet mixing exponentials produces the heavy tailed Pareto distribution. Mixture distributions tend to heavy tailed (see [1]). The Pareto distribution is a handy example.

——————————————————————————————————————

The Tail Weight of the Pareto Distribution
When a distribution significantly puts more probability on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). According to [1], there are four ways to look for indication that a distribution is heavy tailed.

  1. Existence of moments.
  2. Speed of decay of the survival function to zero.
  3. Hazard rate function.
  4. Mean excess loss function.

Existence of moments
Note that the existence of the Pareto higher moments E(X^k) is capped by the shape parameter \beta. In particular, the mean E(X)=\frac{\alpha}{\beta-1} does not exist for \beta \le 1. If the Pareto distribution is to model a random loss, and if the mean is infinite (when \beta=1), the risk is uninsurable! On the other hand, when \beta=2, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

For a given random variable Z, the existence of all moments E(Z^k), for all positive integers k, indicates with a light (right) tail for the distribution of Z. The existence of positive moments exists only up to a certain value of a positive integer k is an indication that the distribution has a heavy right tail. In contrast, the exponential distribution and the Gamma distribution are considered to have light tails since all moments exist.

The speed of decay of the survival function
The survival function S_X(x)=P(X>x) captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed.

The following is a comparison of a Pareto survival function and an exponential survival function. The Pareto survival function has parameters (\alpha=2 and \beta=2). The two survival functions are set to have the same 75th percentile (x=2).

\displaystyle \begin{pmatrix} \text{x}&\text{Pareto }S_X(x)&\text{Exponential }S_Y(x)&\displaystyle \frac{S_X(x)}{S_Y(x)} \\\text{ }&\text{ }&\text{ }&\text{ } \\{2}&0.25&0.25&1 \\{10}&0.027777778&0.000976563&28  \\{20}&0.008264463&9.54 \times 10^{-7}&8666 \\{30}&0.00390625&9.31 \times 10^{-10}&4194304 \\{40}&0.002267574&9.09 \times 10^{-13}&2.49 \times 10^{9} \\{60}&0.001040583&8.67 \times 10^{-19}&1.20 \times 10^{15} \\{80}&0.000594884&8.27 \times 10^{-25}&7.19 \times 10^{20} \\{100}&0.000384468&7.89 \times 10^{-31}&4.87 \times 10^{26} \\{120}&0.000268745&7.52 \times 10^{-37}&3.57 \times 10^{32} \\{140}&0.000198373&7.17 \times 10^{-43}&2.76 \times 10^{38} \\{160}&0.000152416&6.84 \times 10^{-49}&2.23 \times 10^{44} \\{180}&0.000120758&6.53 \times 10^{-55}&1.85 \times 10^{50}  \end{pmatrix}

Note that at the large values, the Pareto right tails retain much more probability. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. If a random loss is a heavy tailed phenomenon that is described by the above Pareto survival function (\alpha=2 and \beta=2), then the above exponential survival function is woefully inadequate as a model for this phenomenon even though it may be a good model for describing the loss up to the 75th percentile. It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

\displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\alpha^\beta}{(x+\alpha)^\beta}}{e^{-\lambda x}}=\frac{\alpha^\beta e^{\lambda x}}{(x+\alpha)^\beta} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in x. This ratio goes to infinity as x \rightarrow \infty.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator. We have the same conclusion in comparing the Pareto distribution and the Gamma distribution, that the Pareto is heavier in the tails. In comparing the tail weight, it is equivalent to consider the ratio of density functions (due to the L’Hopital’s rule).

\displaystyle \lim_{x \rightarrow \infty} \frac{S_1(x)}{S_2(x)}=\lim_{x \rightarrow \infty} \frac{S_1^{'}(x)}{S_2^{'}(x)}=\lim_{x \rightarrow \infty} \frac{f_1(x)}{f_2(x)}

The Hazard Rate Function
The hazard rate function h_X(x) of a random variable X is defined as the ratio of the density function and the survival function.

\displaystyle h_X(x)=\frac{f_X(x)}{S_X(s)}

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged x will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for x units of time. The following is the hazard rate function of the Pareto distribution.

\displaystyle \begin{aligned}h_X(x)&=\frac{f_X(s)}{S_X(x)} \\&=\frac{\beta}{x+\alpha}  \end{aligned}

The interesting point is that the Pareto hazard rate function is an decreasing function in x. Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. One key characteristic of hazard rate function is that it can generate the survival function.

\displaystyle S_X(x)=e^{\displaystyle -\int_0^x h_X(t) \ dt}

Thus if the hazard rate function is decreasing in x, then the survival function will decay more slowly to zero. To see this, let H_X(x)=\int_0^x h_X(t) \ dt, which is called the cumulative hazard rate function. As indicated above, the survival function can be generated by e^{-H_X(x)}. If h_X(x) is decreasing in x, H_X(x) is smaller than H_Y(x) where h_Y(x) is constant in x or increasing in x. Consequently e^{-H_X(x)} is decaying to zero much more slowly than e^{-H_Y(x)}.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The Mean Excess Loss Function
Suppose that a property owner is exposed to a random loss Y. The property owner buys an insurance policy with a deductible d such that the insurer will pay a claim in the amount of Y-d if a loss occurs with Y>d. The insuerer will pay nothing if the loss is below the deductible. Whenever a loss is above d, what is the average claim the insurer will have to pay? This is one way to look at mean excess loss function, which represents the expected excess loss over a threshold conditional on the event that the threshold has been exceeded.

Given a loss variable Y and given a deductible d>0, the mean excess loss function is e_Y(d)=E(Y-d \lvert X>d). For a continuous random variable, it is computed by

\displaystyle e_Y(d)=\frac{\int_d^{\infty} (y-d) \ f_Y(y) \ dy}{S_Y(d)}

Applying the technique of integration by parts produces the following formula:

\displaystyle e_Y(d)=\frac{\int_d^{\infty} S_Y(y) \ dy}{S_Y(d)}

It turns out that the mean excess loss function is one more way to examine the tail property of a distribution. The following is the mean excess loss function of the Pareto distribution:

\displaystyle e_X(d)=\frac{d+\alpha}{\beta-1}=\frac{1}{\beta-1} \ d + \frac{\alpha}{\beta-1}

Note that the Pareto mean excess loss function is a linear increasing function of the deductible d. This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If a random loss is modeled by such a distribution, it is a catastrophic risk situation. In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

——————————————————————————————————————
The Pareto distribution has many economic applications. Since it is a heavy tailed distribution, it is a good candidate for modeling income above a theoretical value and the distribution of insurance claims above a threshold value.

——————————————————————————————————————

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004