# The Negative Binomial Distribution

A counting distribution is a discrete distribution with probabilities only on the nonnegative integers. Such distributions are important in insurance applications since they can be used to model the number of events such as losses to the insured or claims to the insurer. Though playing a prominent role in statistical theory, the Poisson distribution is not appropriate in all situations, since it requires that the mean and the variance are equaled. Thus the negative binomial distribution is an excellent alternative to the Poisson distribution, especially in the cases where the observed variance is greater than the observed mean.

The negative binomial distribution arises naturally from a probability experiment of performing a series of independent Bernoulli trials until the occurrence of the rth success where r is a positive integer. From this starting point, we discuss three ways to define the distribution. We then discuss several basic properties of the negative binomial distribution. Emphasis is placed on the close connection between the Poisson distribution and the negative binomial distribution.

________________________________________________________________________

Definitions
We define three versions of the negative binomial distribution. The first two versions arise from the view point of performing a series of independent Bernoulli trials until the rth success where r is a positive integer. A Bernoulli trial is a probability experiment whose outcome is random such that there are two possible outcomes (success or failure).

Let $X_1$ be the number of Bernoulli trials required for the rth success to occur where r is a positive integer. Let $p$ is the probability of success in each trial. The following is the probability function of $X_1$:

$\displaystyle (1) \ \ \ \ \ P(X_1=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots$

The idea for $(1)$ is that for $X_1=x$ to happen, there must be $r-1$ successes in the first $x-1$ trials and one additional success occurring in the last trial (the $x$th trial).

A more common version of the negative binomial distribution is the number of Bernoulli trials in excess of r in order to produce the rth success. In other words, we consider the number of failures before the occurrence of the rth success. Let $X_2$ be this random variable. The following is the probability function of $X_2$:

$\displaystyle (2) \ \ \ \ \ P(X_2=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots$

The idea for $(2)$ is that there are $x+r$ trials and in the first $x+r-1$ trials, there are $x$ failures (or equivalently $r-1$ successes).

In both $(1)$ and $(2)$, the binomial coefficient is defined by

$\displaystyle (3) \ \ \ \ \ \binom{y}{k}=\frac{y!}{k! \ (y-k)!}=\frac{y(y-1) \cdots (y-(k-1))}{k!}$

where $y$ is a positive integer and $k$ is a nonnegative integer. However, the right-hand-side of $(3)$ can be calculated even if $y$ is not a positive integer. Thus the binomial coefficient $\displaystyle \binom{y}{k}$ can be expanded to work for all real number $y$. However $k$ must still be nonnegative integer.

$\displaystyle (4) \ \ \ \ \ \binom{y}{k}=\frac{y(y-1) \cdots (y-(k-1))}{k!}$

For convenience, we let $\displaystyle \binom{y}{0}=1$. When the real number $y>k-1$, the binomial coefficient in $(4)$ can be expressed as:

$\displaystyle (5) \ \ \ \ \ \binom{y}{k}=\frac{\Gamma(y+1)}{\Gamma(k+1) \Gamma(y-k+1)}$

where $\Gamma(\cdot)$ is the gamma function.

With the more relaxed notion of binomial coefficient, the probability function in $(2)$ above can be defined for all real number r. Thus the general version of the negative binomial distribution has two parameters r and $p$, both real numbers, such that $0. The following is its probability function.

$\displaystyle (6) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots$

Whenever r in $(6)$ is a real number that is not a positive integer, the interpretation of counting the number of failures until the occurrence of the rth success is no longer important. Instead we can think of it simply as a count distribution.

The following alternative parametrization of the negative binomial distribution is also useful.

$\displaystyle (6a) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} \biggl(\frac{\alpha}{\alpha+1}\biggr)^r \biggl(\frac{1}{\alpha+1}\biggr)^x \ \ \ \ \ \ \ x=0,1,2,\cdots$

The parameters in this alternative parametrization are r and $\alpha>0$. Clearly, the ratio $\frac{\alpha}{\alpha+1}$ takes the place of $p$ in $(6)$. Unless stated otherwise, we use the parametrization of $(6)$.
________________________________________________________________________

What is negative about the negative binomial distribution?
What is negative about this distribution? What is binomial about this distribution? The name is suggested by the fact that the binomial coefficient in $(6)$ can be rearranged as follows:

\displaystyle \begin{aligned}(7) \ \ \ \ \ \binom{x+r-1}{x}&=\frac{(x+r-1)(x+r-2) \cdots r}{x!} \\&=(-1)^x \frac{(-r-(x-1))(-r-(x-2)) \cdots (-r)}{x!} \\&=(-1)^x \frac{(-r)(-r-1) \cdots (-r-(x-1))}{x!} \\&=(-1)^x \binom{-r}{x} \end{aligned}

The calculation in $(7)$ can be used to verify that $(6)$ is indeed a probability function, that is, all the probabilities sum to 1.

\displaystyle \begin{aligned}(8) \ \ \ \ \ 1&=p^r p^{-r}\\&=p^r (1-q)^{-r} \\&=p^r \sum \limits_{x=0}^\infty \binom{-r}{x} (-q)^x \ \ \ \ \ \ \ \ (8.1) \\&=p^r \sum \limits_{x=0}^\infty (-1)^x \binom{-r}{x} q^x \\&=\sum \limits_{x=0}^\infty \binom{x+r-1}{x} p^r q^x \end{aligned}

In $(8)$, we take $q=1-p$. The step $(8.1)$ above uses the following formula known as the Newton’s binomial formula.

$\displaystyle (9) \ \ \ \ \ (1+t)^w=\sum \limits_{k=0}^\infty \binom{w}{k} t^k$

________________________________________________________________________

The Generating Function
By definition, the following is the generating function of the negative binomial distribution, using :

$\displaystyle (10) \ \ \ \ \ g(z)=\sum \limits_{x=0}^\infty \binom{r+x-1}{x} p^r q^x z^x$

where $q=1-p$. Using a similar calculation as in $(8)$, the generating function can be simplified as:

$\displaystyle (11) \ \ \ \ \ g(z)=p^r (1-q z)^{-r}=\frac{p^r}{(1-q z)^r}=\frac{p^r}{(1-(1-p) z)^r}; \ \ \ \ \ z<\frac{1}{1-p}$

As a result, the moment generating function of the negative binomial distribution is:

$\displaystyle (12) \ \ \ \ \ M(t)=\frac{p^r}{(1-(1-p) e^t)^r}; \ \ \ \ \ \ \ t<-ln(1-p)$

________________________________________________________________________

Independent Sum

One useful property of the negative binomial distribution is that the independent sum of negative binomial random variables, all with the same parameter $p$, also has a negative binomial distribution. Let $Y=Y_1+Y_2+\cdots+Y_n$ be an independent sum such that each $X_i$ has a negative binomial distribution with parameters $r_i$ and $p$. Then the sum $Y=Y_1+Y_2+\cdots+Y_n$ has a negative binomial distribution with parameters $r=r_1+\cdots+r_n$ and $p$.

Note that the generating function of an independent sum is the product of the individual generating functions. The following shows that the product of the individual generating functions is of the same form as $(11)$, thus proving the above assertion.

$\displaystyle (13) \ \ \ \ \ h(z)=\frac{p^{\sum \limits_{i=1}^n r_i}}{(1-(1-p) z)^{\sum \limits_{i=1}^n r_i}}$
________________________________________________________________________

Mean and Variance
The mean and variance can be obtained from the generating function. From $E(X)=g'(1)$ and $E(X^2)=g'(1)+g^{(2)}(1)$, we have:

$\displaystyle (14) \ \ \ \ \ E(X)=\frac{r(1-p)}{p} \ \ \ \ \ \ \ \ \ \ \ \ \ Var(X)=\frac{r(1-p)}{p^2}$

Note that $Var(X)=\frac{1}{p} E(X)>E(X)$. Thus when the sample data suggest that the variance is greater than the mean, the negative binomial distribution is an excellent alternative to the Poisson distribution. For example, suppose that the sample mean and the sample variance are 3.6 and 7.1. In exploring the possibility of fitting the data using the negative binomial distribution, we would be interested in the negative binomial distribution with this mean and variance. Then plugging these into $(14)$ produces the negative binomial distribution with $r=3.7$ and $p=0.507$.
________________________________________________________________________

The Poisson-Gamma Mixture
One important application of the negative binomial distribution is that it is a mixture of a family of Poisson distributions with Gamma mixing weights. Thus the negative binomial distribution can be viewed as a generalization of the Poisson distribution. The negative binomial distribution can be viewed as a Poisson distribution where the Poisson parameter is itself a random variable, distributed according to a Gamma distribution. Thus the negative binomial distribution is known as a Poisson-Gamma mixture.

In an insurance application, the negative binomial distribution can be used as a model for claim frequency when the risks are not homogeneous. Let $N$ has a Poisson distribution with parameter $\theta$, which can be interpreted as the number of claims in a fixed period of time from an insured in a large pool of insureds. There is uncertainty in the parameter $\theta$, reflecting the risk characteristic of the insured. Some insureds are poor risks (with large $\theta$) and some are good risks (with small $\theta$). Thus the parameter $\theta$ should be regarded as a random variable $\Theta$. The following is the conditional distribution of $N$ (conditional on $\Theta=\theta$):

$\displaystyle (15) \ \ \ \ \ P(N=n \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots$

Suppose that $\Theta$ has a Gamma distribution with scale parameter $\alpha$ and shape parameter $\beta$. The following is the probability density function of $\Theta$.

$\displaystyle (16) \ \ \ \ \ g(\theta)=\frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta} \ \ \ \ \ \ \ \ \ \ \theta>0$

Then the joint density of $N$ and $\Theta$ is:

$\displaystyle (17) \ \ \ \ \ P(N=n \lvert \Theta=\theta) \ g(\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta}$

The unconditional distribution of $N$ is obtained by summing out $\theta$ in $(17)$.

\displaystyle \begin{aligned}(18) \ \ \ \ \ P(N=n)&=\int_0^\infty P(N=n \lvert \Theta=\theta) \ g(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta} \ d \theta \\&=\int_0^\infty \frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \int_0^\infty \frac{(\alpha+1)^{n+\beta}}{\Gamma(n+\beta)} \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \\&=\frac{\Gamma(n+\beta)}{\Gamma(n+1) \ \Gamma(\beta)} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \\&=\binom{n+\beta-1}{n} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots \end{aligned}

Note that the integral in the fourth step in $(18)$ is 1.0 since the integrand is the pdf of a Gamma distribution. The above probability function is that of a negative binomial distribution. It is of the same form as $(6a)$. Equivalently, it is also of the form $(6)$ with parameter $r=\beta$ and $p=\frac{\alpha}{\alpha+1}$.

The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional distribution of $N$ is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable $\Theta$ has the effect of increasing the unconditional variance of the mixture distribution of $N$. The variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter $\Theta$ (see The variance of a mixture).

________________________________________________________________________

The Poisson Distribution as Limit of Negative Binomial
There is another connection to the Poisson distribution, that is, the Poisson distribution is a limiting case of the negative binomial distribution. We show that the generating function of the Poisson distribution can be obtained by taking the limit of the negative binomial generating function as $r \rightarrow \infty$. Interestingly, the Poisson distribution is also the limit of the binomial distribution.

In this section, we use the negative binomial parametrization of $(6a)$. By replacing $\frac{\alpha}{\alpha+1}$ for $p$, the following are the mean, variance, and the generating function for the probability function in $(6a)$:

\displaystyle \begin{aligned}(19) \ \ \ \ \ \ &E(X)=\frac{r}{\alpha} \\&\text{ }\\&Var(X)=\frac{\alpha+1}{\alpha} \ \frac{r}{\alpha}=\frac{r(\alpha+1)}{\alpha^2} \\&\text{ } \\&g(z)=\frac{1}{[1-\frac{1}{\alpha}(z-1)]^r} \ \ \ \ \ \ \ z<\alpha+1 \end{aligned}

Let r goes to infinity and $\displaystyle \frac{1}{\alpha}$ goes to zero and at the same time keeping their product constant. Thus $\displaystyle \mu=\frac{r}{\alpha}$ is constant (this is the mean of the negative binomial distribution). We show the following:

$\displaystyle (20) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}=e^{\mu (z-1)}$

The right-hand side of $(20)$ is the generating function of the Poisson distribution with mean $\mu$. The generating function in the left-hand side is that of a negative binomial distribution with mean $\displaystyle \mu=\frac{r}{\alpha}$. The following is the derivation of $(20)$.

\displaystyle \begin{aligned}(21) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(ln[1-\frac{\mu}{r}(z-1)]^{-r}\biggr)} \\&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(-r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \\&=e^{\displaystyle \biggl(\lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \end{aligned}

We now focus on the limit in the exponent.

\displaystyle \begin{aligned}(22) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]&=\lim \limits_{r \rightarrow \infty} \frac{ln(1-\frac{\mu}{r} (z-1))^{-1}}{r^{-1}} \\&=\lim \limits_{r \rightarrow \infty} \frac{(1-\frac{\mu}{r} (z-1)) \ \mu (z-1) r^{-2}}{r^{-2}} \\&=\mu (z-1) \end{aligned}

The middle step in $(22)$ uses the L’Hopital’s Rule. The result in $(20)$ is obtained by combining $(21)$ and $(22)$.

________________________________________________________________________

Reference

1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

# The mean excess loss function

We take a closer look at the mean excess loss function. We start with the following example:

Example
Suppose that an entity is exposed to a random loss $X$. An insurance policy offers protection against this loss. Under this policy, payment is made to the insured entity subject to a deductible $d>0$, i.e. when a loss is less than $d$, no payment is made to the insured entity, and when the loss exceeds $d$, the insured entity is reimbursed for the amount of the loss in excess of the deductible $d$. Consider the following two questions:

$\text{ }$

1. Of all the losses that are eligible to be reimbursed by the insurer, what is the average payment made by the insurer to the insured?
2. What is the average payment made by the insurer to the insured entity?

$\text{ }$

The two questions look similar. The difference between the two questions is subtle but important. In the first question, the average is computed over all losses that are eligible for reimbursement (i.e., the loss exceeds the deductible). This is the average amount the insurer is expected to pay in the event that a payment in excess of the deductible is required to be made. So this average is a per payment average.

In the second question, the average is calculated over all losses (regardless of sizes). When the loss does not reach the deductible, the payment is considered zero and when the loss is in excess of the deductible, the payment is $X-d$. Thus the average is the average amount the insurer has to pay per loss. So the second question is about a per loss average.

—————————————————————————————————————-

The Mean Excess Loss Function
The average in the first question is called the mean excess loss function. Suppose $X$ is the random loss and $d>0$. The mean excess loss variable is the conditional variable $X-d \ \lvert X>d$ and the mean excess loss function $e_X(d)$ is defined by:

$\text{ }$

$\displaystyle (1) \ \ \ \ \ e_X(d)=E(X-d \lvert X>d)$

$\text{ }$

In an insurance context, the mean excess loss function is the average payment in excess of a threshold given that the loss exceeds the threshold. In a mortality context, the mean excess loss function is called the mean residual life function and complete expectation of life and can be interpreted as the remaining time until death given that the life in question is alive at age $d$.

The mean excess loss function is computed by the following depending on whether the loss variable is continuous or discrete.

$\text{ }$

$\displaystyle (2) \ \ \ \ \ e_X(d)=\frac{\int_d^\infty (x-d) \ f_X(x) \ dx}{S_X(d)}$

$\displaystyle (3) \ \ \ \ \ e_X(d)=\frac{\sum \limits_{x>d} (x-d) \ P(X=x)}{S_X(d)}$

$\text{ }$

The mean excess loss function $e_X(d)$ is defined only when the integral or the sum converges. The following is an equivalent calculation of $e_X(d)$ that may be easier to use in some circumstances.

$\text{ }$

$\displaystyle (4) \ \ \ \ \ e_X(d)=\frac{\int_d^\infty S_X(x) \ dx}{S_X(d)}$

$\displaystyle (5a) \ \ \ \ \ e_X(d)=\frac{\sum \limits_{x \ge d} S_X(x) }{S_X(d)}$

$\displaystyle (5b) \ \ \ \ \ e_X(d)=\frac{\biggl(\sum \limits_{x>d} S_X(x)\biggr)+(w+1-d) S_X(w) }{S_X(d)}$

$\text{ }$

In both $(5a)$ and $(5b)$, we assume that the support of $X$ is the set of nonnegative integers. In $(5a)$, we assume that the deductible $d$ is a positive integer. In $(5b)$, the deductible $d$ is free to be any positive number and $w$ is the largest integer such that $w \le d$. The formulation $(4)$ is obtained by using integration by parts (also see theorem 3.1 in [1]). The formulations of $(5a)$ and $(5b)$ are a result of applying theorem 3.2 in [1].

The mean excess loss function provides information about the tail weight of a distribution, see the previous post The Pareto distribution. Also see Example 3 below.
—————————————————————————————————————-
The Mean in Question 2
The average that we need to compute is the mean of the following random variable. Note that $(a)_+$ is the function that assigns the value of $a$ whenever $a>0$ and otherwise assigns the value of zero.

$\text{ }$

$\displaystyle (6) \ \ \ \ \ (X-d)_+=\left\{\begin{matrix}0&\ X

$\text{ }$

The mean $E((X-d)_+)$ is calculated over all losses. When the loss is less than the deductible $d$, the insurer has no obligation to make a payment to the insured and the payment is assumed to be zero in the calculation of $E[(X-d)_+]$. The following is how this expected value is calculated depending on whether the loss $X$ is continuous or discrete.

$\text{ }$

$\displaystyle (7) \ \ \ \ \ E((X-d)_+)=\int_d^\infty (x-d) \ f_X(x) \ dx$

$\displaystyle (8) \ \ \ \ \ E((X-d)_+)=\sum \limits_{x>d} (x-d) \ P(X=x)$

$\text{ }$

Based on the definitions, the following is how the two averages are related.

$\displaystyle E[(X-d)_+]=e_X(d) \ [1-F_X(d)] \ \ \ \text{or} \ \ \ E[(X-d)_+]=e_X(d) \ S_X(d)$

—————————————————————————————————————-
The Limited Expected Value
For a given positive constant $u$, the limited loss variable is defined by

$\text{ }$

$\displaystyle (9) \ \ \ \ \ X \wedge u=\left\{\begin{matrix}X&\ X

$\text{ }$

The expected value $E(X \wedge u)$ is called the limited expected value. In an insurance application, the $u$ is a policy limit that sets a maximum on the benefit to be paid. The following is how the limited expected value is calculated depending on whether the loss $X$ is continuous or discrete.

$\text{ }$

$\displaystyle (9) \ \ \ \ \ E(X \wedge u)=\int_{-\infty}^u x \ f_X(x) \ dx+u \ S_X(u)$

$\displaystyle (10) \ \ \ \ \ E(X \wedge u)=\biggl(\sum \limits_{x < u} x \ P(X=x)\biggr)+u \ S_X(u)$

$\text{ }$

Interestingly, we have the following relation.

$\text{ }$

$\displaystyle (X-d)_+ + (X \wedge d)=X \ \ \ \ \ \ \text{and} \ \ \ \ \ \ E[(X-d)_+] + E(X \wedge d)=E(X)$

$\text{ }$

The above statement indicates that purchasing a policy with a deductible $d$ and another policy with a policy maximum $d$ is equivalent to buying full coverage.

Another way to interpret $X \wedge d$ is that it is the amount of loss that is eliminated by having a deductible in the insurance policy. If the insurance policy pays the loss in full, then the insurance payment is $X$ and the expected amount the insurer is expected to pay is $E(X)$. By having a deductible provision in the policy, the insurer is now only liable for the amount $(X-d)_+$ and the amount the insurer is expected to pay per loss is $E[(X-d)_+]$. Consequently $E(X \wedge d)$ is the expected amount of the loss that is eliminated by the deductible provision in the policy. The following summarizes this observation.

$\text{ }$

$\displaystyle (X \wedge d)=X-(X-d)_+ \ \ \ \ \ \ \text{and} \ \ \ \ \ \ E(X \wedge d)=E(X)-E[(X-d)_+]$

—————————————————————————————————————-
Example 1
Let the loss random variable $X$ be exponential with pdf $f(x)=\alpha e^{-\alpha x}$. We have $E(X)=\frac{1}{\alpha}$. Because of the no memory property of the exponential distribution, given that a loss exceeds the deductible, the mean payment is the same as the original mean. Thus $e_X(d)=\frac{1}{\alpha}$. Then the per loss average is:

$\displaystyle E[(X-d)_+]=e_X(d) \ S(d) = \frac{e^{-\alpha d}}{\alpha}$

$\text{ }$

Thus, with a deductible provision in the policy, the insurer is expected to pay $\displaystyle \frac{e^{-\alpha d}}{\alpha}$ per loss instead of $\displaystyle \frac{1}{\alpha}$. Thus the expected amount of loss eliminated (from the insurer’s point of view) is $\displaystyle E(X \wedge d)=\frac{1-e^{-\alpha d}}{\alpha}$.

$\text{ }$

Example 2
Suppose that the loss variable has a Gamma distribution where the scale parameter is $\alpha$ and the shape parameter is $n=2$. The pdf is $\displaystyle g(x)=\alpha^2 \ x \ e^{-\alpha x}$. The insurer’s expected payment without the deductible is $E(X)=\frac{2}{\alpha}$. The survival function $S(x)$ is:

$\displaystyle S(x)=e^{-\alpha x}(1+\alpha x)$

$\text{ }$

For the losses that exceed the deductible, the insurer’s expected payment is:

\displaystyle \begin{aligned}e_X(d)&=\frac{\int_d^{\infty} S(x) \ dx}{S(d)}\\&=\frac{\int_d^{\infty} e^{-\alpha x}(1+\alpha x) \ dx}{e^{-\alpha d}(1+\alpha d)} \\&=\frac{\frac{e^{-\alpha d}}{\alpha}+d e^{-\alpha d}+\frac{e^{-\alpha d}}{\alpha}}{e^{-\alpha d}(1+\alpha d)} \\&=\frac{\frac{2}{\alpha}+d}{1+\alpha d} \end{aligned}

$\text{ }$

Then the insurer’s expected payment per loss is $E[(X-d)_+]$:

\displaystyle \begin{aligned}E[(X-d)_+]&=e_X(d) \ S(d) \\&=\frac{\frac{2}{\alpha}+d}{1+\alpha d} \ \ e^{-\alpha d}(1+\alpha d) \\&=e^{-\alpha d} \ \biggl(\frac{2}{\alpha}+d\biggr) \end{aligned}

$\text{ }$

With a deductible in the policy, the following is the expected amount of loss eliminated (from the insurer’s point of view).

\displaystyle \begin{aligned}E[X \wedge d]&=E(X)-E[(X-d)_+] \\&=\frac{2}{\alpha}-e^{-\alpha d} \ \biggl(\frac{2}{\alpha}+d\biggr) \\&=\frac{2}{\alpha}\biggl(1-e^{-\alpha d}\biggr)-d e^{-\alpha d} \end{aligned}

$\text{ }$

Example 3
Suppose the loss variable $X$ has a Pareto distribution with the following pdf:

$\text{ }$

$\displaystyle f_X(x)=\frac{\beta \ \alpha^\beta}{(x+\alpha)^{\beta+1}} \ \ \ \ \ x>0$

$\text{ }$

If the insurance policy is to pay the full loss, then the insurer’s expected payment per loss is $\displaystyle E(X)=\frac{\alpha}{\beta-1}$ provided that the shape parameter $\beta$ is larger than one.

The mean excess loss function of the Pareto distribution has a linear form that is increasing (see the previous post The Pareto distribution). The following is the mean excess loss function:

$\text{ }$

$\displaystyle e_X(d)=\frac{1}{\beta-1} \ d +\frac{\alpha}{\beta-1}=\frac{1}{\beta-1} \ d +E(X)$

$\text{ }$

If the loss is modeled by such a distribution, this is an uninsurable risk! First of all, the higher the deductible, the larger the expected payment if such a large loss occurs. The expected payment for large losses is always the unmodified expected $E(X)$ plus a component that is increasing in $d$.

The increasing mean excess loss function is an indication that the Pareto distribution is a heavy tailed distribution. In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

Reference

1. Bowers N. L., Gerber H. U., Hickman J. C., Jones D. A., Nesbit C. J. Actuarial Mathematics, First Edition., The Society of Actuaries, Itasca, Illinois, 1986
2. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

# The Pareto distribution

This post takes a closer look at the Pareto distribution. A previous post demonstrates that the Pareto distribution is a mixture of exponential distributions with Gamma mixing weights. We now elaborate more on this point. Through looking at various properties of the Pareto distribution, we also demonstrate that the Pareto distribution is a heavy tailed distribution. In insurance applications, heavy-tailed distributions are essential tools for modeling extreme loss, especially for the more risky types of insurance such as medical malpractice insurance. In financial applications, the study of heavy-tailed distributions provides information about the potential for financial fiasco or financial ruin. The Pareto distribution is a great way to open up a discussion on heavy-tailed distribution.

$\text{ }$

Update (11/12/2017). This blog post introduces a catalog of many other parametric severity models in addition to Pareto distribution. The link to the catalog is found in that blog post. To go there directly, this is the link.

Update (10/29/2017). This blog post has updated information on Pareto distribution. It also has links to more detailed contents on Pareto distribution in two companion blogs. These links are also given here: more detailed post on Pareto, Pareto Type I and Type II and practice problems on Pareto.

$\text{ }$

The continuous random variable $X$ with positive support is said to have the Pareto distribution if its probability density function is given by

$\displaystyle f_X(x)=\frac{\beta \ \alpha^\beta}{(x+\alpha)^{\beta+1}} \ \ \ \ \ x>0$

where $\alpha>0$ and $\beta>0$ are constant. The constant $\alpha$ is the scale parameter and $\beta$ is the shape parameter. The following lists several other distributional quantities of the Pareto distribution, which will be used in the discussion below.

$\displaystyle S_X(x)=\frac{\alpha^\beta}{(x+\alpha)^\beta}=\biggl(\frac{\alpha}{x+\alpha}\biggr)^\beta \ \ \ \ \ \ \ \ \ \text{survival function}$

$\displaystyle F_X(x)=1-\biggl(\frac{\alpha}{x+\alpha}\biggr)^\beta \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{distribution function}$

$\displaystyle E(X)=\frac{\alpha}{\beta-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{mean},\beta>1$

$\displaystyle E(X^2)=\frac{2 \alpha^2}{(\beta-1)(\beta-2)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{second momemt},\beta>2$

$\displaystyle Var(X)=\frac{\alpha^2 \beta}{(\beta-1)^2(\beta-2)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{variance},\beta>2$

$\displaystyle E(X^k)=\frac{k! \alpha^k}{(\beta-1) \cdots (\beta-k)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{higher moments},\beta>k, \text{ k positive integer}$

——————————————————————————————————————
The Pareto Distribution as a Mixture
The Pareto pdf indicated above can be obtained by mixing exponential distributions using Gamma distributions as weights. Suppose that $X$ follows an exponential distribution (conditional on a parameter value $\theta$). The following is the conditional pdf of $X$.

$\displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\theta e^{-\theta x} \ \ \ x>0$

There is uncertainty in the parameter, which can be viewed as a random variable $\Theta$. Suppose that $\Theta$ follows a Gamma distribution with scale parameter $\alpha$ and shape parameter $\beta$. The following is the pdf of $\Theta$.

$\displaystyle f_{\Theta}(\theta)=\frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta} \ \ \ \theta>0$

The unconditional pdf of $X$ is the weighted average of the conditional pdfs with the Gamma pdf as weight.

\displaystyle \begin{aligned}f_X(x)&=\int_0^{\infty} f_{X \lvert \Theta}(x \lvert \theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\theta \ e^{-\theta x}\biggr] \ \biggl[\frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^\beta \ e^{-\theta(x+\alpha)} \ d \theta \\&=\frac{\alpha^\beta}{\Gamma(\beta)} \frac{\Gamma(\beta+1)}{(x+\alpha)^{\beta+1}} \int_0^{\infty} \frac{(x+\alpha)^{\beta+1}}{\Gamma(\beta+1)} \ \theta^{\beta+1-1} \ e^{-\theta(x+\alpha)} \ d \theta \\&=\frac{\beta \ \alpha^\beta}{(x+\alpha)^{\beta+1}} \end{aligned}

In the following discussion, $X$ will denote the Pareto distribution as defined above. As will be shown below, the exponential distribution is considered a light tailed distribution. Yet mixing exponentials produces the heavy tailed Pareto distribution. Mixture distributions tend to heavy tailed (see [1]). The Pareto distribution is a handy example.

——————————————————————————————————————

The Tail Weight of the Pareto Distribution
When a distribution significantly puts more probability on larger values, the distribution is said to be a heavy tailed distribution (or said to have a larger tail weight). According to [1], there are four ways to look for indication that a distribution is heavy tailed.

1. Existence of moments.
2. Speed of decay of the survival function to zero.
3. Hazard rate function.
4. Mean excess loss function.

Existence of moments
Note that the existence of the Pareto higher moments $E(X^k)$ is capped by the shape parameter $\beta$. In particular, the mean $E(X)=\frac{\alpha}{\beta-1}$ does not exist for $\beta \le 1$. If the Pareto distribution is to model a random loss, and if the mean is infinite (when $\beta=1$), the risk is uninsurable! On the other hand, when $\beta=2$, the Pareto variance does not exist. This shows that for a heavy tailed distribution, the variance may not be a good measure of risk.

For a given random variable $Z$, the existence of all moments $E(Z^k)$, for all positive integers $k$, indicates with a light (right) tail for the distribution of $Z$. The existence of positive moments exists only up to a certain value of a positive integer $k$ is an indication that the distribution has a heavy right tail. In contrast, the exponential distribution and the Gamma distribution are considered to have light tails since all moments exist.

The speed of decay of the survival function
The survival function $S_X(x)=P(X>x)$ captures the probability of the tail of a distribution. If a distribution whose survival function decays slowly to zero (equivalently the cdf goes slowly to one), it is another indication that the distribution is heavy tailed.

The following is a comparison of a Pareto survival function and an exponential survival function. The Pareto survival function has parameters ($\alpha=2$ and $\beta=2$). The two survival functions are set to have the same 75th percentile ($x=2$).

$\displaystyle \begin{pmatrix} \text{x}&\text{Pareto }S_X(x)&\text{Exponential }S_Y(x)&\displaystyle \frac{S_X(x)}{S_Y(x)} \\\text{ }&\text{ }&\text{ }&\text{ } \\{2}&0.25&0.25&1 \\{10}&0.027777778&0.000976563&28 \\{20}&0.008264463&9.54 \times 10^{-7}&8666 \\{30}&0.00390625&9.31 \times 10^{-10}&4194304 \\{40}&0.002267574&9.09 \times 10^{-13}&2.49 \times 10^{9} \\{60}&0.001040583&8.67 \times 10^{-19}&1.20 \times 10^{15} \\{80}&0.000594884&8.27 \times 10^{-25}&7.19 \times 10^{20} \\{100}&0.000384468&7.89 \times 10^{-31}&4.87 \times 10^{26} \\{120}&0.000268745&7.52 \times 10^{-37}&3.57 \times 10^{32} \\{140}&0.000198373&7.17 \times 10^{-43}&2.76 \times 10^{38} \\{160}&0.000152416&6.84 \times 10^{-49}&2.23 \times 10^{44} \\{180}&0.000120758&6.53 \times 10^{-55}&1.85 \times 10^{50} \end{pmatrix}$

Note that at the large values, the Pareto right tails retain much more probability. This is also confirmed by the ratio of the two survival functions, with the ratio approaching infinity. If a random loss is a heavy tailed phenomenon that is described by the above Pareto survival function ($\alpha=2$ and $\beta=2$), then the above exponential survival function is woefully inadequate as a model for this phenomenon even though it may be a good model for describing the loss up to the 75th percentile. It is the large right tail that is problematic (and catastrophic)!

Since the Pareto survival function and the exponential survival function have closed forms, We can also look at their ratio.

$\displaystyle \frac{\text{pareto survival}}{\text{exponential survival}}=\frac{\displaystyle \frac{\alpha^\beta}{(x+\alpha)^\beta}}{e^{-\lambda x}}=\frac{\alpha^\beta e^{\lambda x}}{(x+\alpha)^\beta} \longrightarrow \infty \ \text{ as } x \longrightarrow \infty$

In the above ratio, the numerator has an exponential function with a positive quantity in the exponent, while the denominator has a polynomial in $x$. This ratio goes to infinity as $x \rightarrow \infty$.

In general, whenever the ratio of two survival functions diverges to infinity, it is an indication that the distribution in the numerator of the ratio has a heavier tail. When the ratio goes to infinity, the survival function in the numerator is said to decay slowly to zero as compared to the denominator. We have the same conclusion in comparing the Pareto distribution and the Gamma distribution, that the Pareto is heavier in the tails. In comparing the tail weight, it is equivalent to consider the ratio of density functions (due to the L’Hopital’s rule).

$\displaystyle \lim_{x \rightarrow \infty} \frac{S_1(x)}{S_2(x)}=\lim_{x \rightarrow \infty} \frac{S_1^{'}(x)}{S_2^{'}(x)}=\lim_{x \rightarrow \infty} \frac{f_1(x)}{f_2(x)}$

The Hazard Rate Function
The hazard rate function $h_X(x)$ of a random variable $X$ is defined as the ratio of the density function and the survival function.

$\displaystyle h_X(x)=\frac{f_X(x)}{S_X(s)}$

The hazard rate is called the force of mortality in a life contingency context and can be interpreted as the rate that a person aged $x$ will die in the next instant. The hazard rate is called the failure rate in reliability theory and can be interpreted as the rate that a machine will fail at the next instant given that it has been functioning for $x$ units of time. The following is the hazard rate function of the Pareto distribution.

\displaystyle \begin{aligned}h_X(x)&=\frac{f_X(s)}{S_X(x)} \\&=\frac{\beta}{x+\alpha} \end{aligned}

The interesting point is that the Pareto hazard rate function is an decreasing function in $x$. Another indication of heavy tail weight is that the distribution has a decreasing hazard rate function. One key characteristic of hazard rate function is that it can generate the survival function.

$\displaystyle S_X(x)=e^{\displaystyle -\int_0^x h_X(t) \ dt}$

Thus if the hazard rate function is decreasing in $x$, then the survival function will decay more slowly to zero. To see this, let $H_X(x)=\int_0^x h_X(t) \ dt$, which is called the cumulative hazard rate function. As indicated above, the survival function can be generated by $e^{-H_X(x)}$. If $h_X(x)$ is decreasing in $x$, $H_X(x)$ is smaller than $H_Y(x)$ where $h_Y(x)$ is constant in $x$ or increasing in $x$. Consequently $e^{-H_X(x)}$ is decaying to zero much more slowly than $e^{-H_Y(x)}$.

In contrast, the exponential distribution has a constant hazard rate function, making it a medium tailed distribution. As explained above, any distribution having an increasing hazard rate function is a light tailed distribution.

The Mean Excess Loss Function
Suppose that a property owner is exposed to a random loss $Y$. The property owner buys an insurance policy with a deductible $d$ such that the insurer will pay a claim in the amount of $Y-d$ if a loss occurs with $Y>d$. The insuerer will pay nothing if the loss is below the deductible. Whenever a loss is above $d$, what is the average claim the insurer will have to pay? This is one way to look at mean excess loss function, which represents the expected excess loss over a threshold conditional on the event that the threshold has been exceeded.

Given a loss variable $Y$ and given a deductible $d>0$, the mean excess loss function is $e_Y(d)=E(Y-d \lvert X>d)$. For a continuous random variable, it is computed by

$\displaystyle e_Y(d)=\frac{\int_d^{\infty} (y-d) \ f_Y(y) \ dy}{S_Y(d)}$

Applying the technique of integration by parts produces the following formula:

$\displaystyle e_Y(d)=\frac{\int_d^{\infty} S_Y(y) \ dy}{S_Y(d)}$

It turns out that the mean excess loss function is one more way to examine the tail property of a distribution. The following is the mean excess loss function of the Pareto distribution:

$\displaystyle e_X(d)=\frac{d+\alpha}{\beta-1}=\frac{1}{\beta-1} \ d + \frac{\alpha}{\beta-1}$

Note that the Pareto mean excess loss function is a linear increasing function of the deductible $d$. This means that the larger the deductible, the larger the expected claim if such a large loss occurs! If a random loss is modeled by such a distribution, it is a catastrophic risk situation. In general, an increasing mean excess loss function is an indication of a heavy tailed distribution. On the other hand, a decreasing mean excess loss function indicates a light tailed distribution. The exponential distribution has a constant mean excess loss function and is considered a medium tailed distribution.

——————————————————————————————————————
The Pareto distribution has many economic applications. Since it is a heavy tailed distribution, it is a good candidate for modeling income above a theoretical value and the distribution of insurance claims above a threshold value.

——————————————————————————————————————

Reference

1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

# The variance of a mixture

Suppose $X$ is a mixture distribution that is the result of mixing a family of conditional distributions indexed by a parameter random variable $\Theta$. The uncertainty in the parameter variable $\Theta$ has the effect of increasing the unconditional variance of the mixture $X$. Thus, $Var(X)$ is not simply the weighted average of the conditional variance $Var(X \lvert \Theta)$. The unconditional variance $Var(X)$ is the sum of two components. They are:

$\displaystyle Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]$

The above relationship is called the law of total variance, which is the proper way of computing the unconditional variance $Var(X)$. The first component $E[Var(X \lvert \Theta)]$ is called the expected value of conditional variances, which is the weighted average of the conditional variances. The second component $Var[E(X \lvert \Theta)]$ is called the variance of the conditional means, which represents the additional variance as a result of the uncertainty in the parameter $\Theta$.

We use an example of a two-point mixture to illustrate the law of total variance. The example is followed by a proof of the total law of variance.

Example
Let $U$ be the uniform distribution on the unit interval $(0, 1)$. Suppose that a large population of insureds is composed of “high risk” and “low risk” individuals. The proportion of insured classified as “low risk” is $p$ where $0. The random loss amount $X$ of a “low risk” insured is $U$. The random loss amount $X$ of a “high risk” insured is $U$ shifted by a positive constant $w>0$, i.e. $w+U$. What is the variance of the loss amount of an insured randomly selected from this population?

For convenience, we use $\Theta$ as a parameter to indicate the risk class ($\Theta=1$ is “low risk” and $\Theta=2$ is “high risk”). The following shows the relevant conditional distributional quantities of $X$.

$\displaystyle E(X \lvert \Theta=1)=\frac{1}{2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ E(X \lvert \Theta=2)=w+\frac{1}{2}$

$\displaystyle Var(X \lvert \Theta=1)=\frac{1}{12} \ \ \ \ \ \ \ \ \ \ \ \ Var(X \lvert \Theta=2)=\frac{1}{12}$

The unconditional mean loss is the weighted average of the conditional mean loss amounts. However, the same idea does not work for variance.

\displaystyle \begin{aligned}E[X]&=p \times E(X \lvert \Theta=1)+(1-p) \times E(X \lvert \Theta=2) \\&=p \times \frac{1}{2}+(1-p) \times (w+\frac{1}{2}) \\&=\frac{1}{2}+(1-p) \ w \end{aligned}

\displaystyle \begin{aligned}Var[X]&\ne p \times Var(X \lvert \Theta=1)+(1-p) \times Var(X \lvert \Theta=2)=\frac{1}{12} \end{aligned}

The conditional variance is the same for both risk classes since the “high risk” loss is a shifted distribution of the “low risk” loss. However, the unconditional variance is more than $\frac{1}{12}$ since the mean loss for the two casses are different (heterogeneous risks across the classes). The uncertainty in the risk classes (i.e. uncertainty in the parameter $\Theta$) introduces additional variance in the loss for a randomly selected insured. The unconditional variance $Var(X)$ is the sum of the following two components:

\displaystyle \begin{aligned}E[Var(X \lvert \Theta)]&=p \times Var(X \lvert \Theta=1)+(1-p) \times Var(X \lvert \Theta=2) \\&=p \times \frac{1}{12}+(1-p) \times \frac{1}{12} \\&=\frac{1}{12} \end{aligned}

\displaystyle \begin{aligned}Var[E(X \lvert \Theta)]&=p \times E(X \lvert \Theta=1)^2+(1-p) \times E(X \lvert \Theta=2)^2 \\&\ \ \ -\biggl(p \times E(X \lvert \Theta=1)+(1-p) \times E(X \lvert \Theta=2)\biggr)^2 \\&=p \times \biggl(\frac{1}{2}\biggr)^2+(1-p) \times \biggl(w+\frac{1}{2}\biggr)^2 \\&\ \ \ -\biggl(p \times \frac{1}{2}+(1-p) \times (w+\frac{1}{2})\biggr)^2 \\&=p \ (1-p) \ w^2 \end{aligned}

\displaystyle \begin{aligned}Var(X)&=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)] \\&=\frac{1}{12}+p \ (1-p) \ w^2 \end{aligned}

The additional variance is in the amount of $p(1-p)w^2$. This is the variance of the conditional means of the risk classes. Note that $w$ is the additional mean loss for a “high risk” insured. The higher the additional mean loss $w$, the more heterogeneous in risk between the two classes, hence the larger the dispersion in unconditional loss.

The total law of variance gives the unconditional variance of a random variable $X$ that is indexed by another random variable $\Theta$. The unconditional variance of $X$ is the sum of two components, namely, the expected value of conditional variances and the variance of the conditional means. The formula is:

$\displaystyle Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]$

The following is the derivation of the formula:

\displaystyle \begin{aligned}Var[X]&=E[X^2]-E[X]^2 \\&=E[E(X^2 \lvert \Theta)]-E[E(X \lvert \Theta)]^2 \\&=E\left\{Var(X \lvert \Theta)+E(X \lvert \Theta)^2 \right\}-E[E(X \lvert \Theta)]^2 \\&=E[Var(X \lvert \Theta)]+E[E(X \lvert \Theta)^2]-E[E(X \lvert \Theta)]^2 \\&=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)] \end{aligned}

See this blog post for practice problems on mixture distributions.

Reference

1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

# An example of a mixture

We use an example to motivate the definition of a mixture distribution.

Example 1

Suppose that the loss arising from an insured randomly selected from a large group of insureds follow an exponential distribution with probability density function (pdf) $f_X(x)=\theta e^{-\theta x}$, $x>0$, where $\theta$ is a parameter that is a positive constant. The mean claim cost for this randomly selected insured is $\frac{1}{\theta}$. So the parameter $\theta$ reflects the risk characteristics of the insured. Since the population of insureds is large, there is uncertainty in the parameter $\theta$. It is more appropriate to regard $\theta$ as a random variable in order to capture the wide range of risk characteristics across the individuals in the population. As a result, the pdf indicated above is not an unconditional pdf, but, rather, a conditional pdf of $X$. The below pdf is conditional on a realized value of the random variable $\Theta$.

$\displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\theta e^{-\theta x}, \ \ \ \ \ x>0$

What about the marginal (unconditional) pdf of $X$? Let’s assume that the pdf of $\Theta$ is given by $\displaystyle f_\Theta(\theta)=\frac{1}{2} \ \theta^2 \ e^{-\theta}$. Then the unconditional pdf of $X$ is the weighted average of the conditional pdf.

\displaystyle \begin{aligned}f_X(x)&=\int_0^{\infty} f_{X \lvert \Theta}(x \lvert \theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\theta \ e^{-\theta x}\biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{1}{2} \ \theta^3 \ e^{-\theta(x+1)} \ d \theta \\&=\frac{1}{2} \frac{6}{(x+1)^4} \int_0^{\infty} \frac{(x+1)^4}{3!} \ \theta^{4-1} \ e^{-\theta(x+1)} \ d \theta \\&=\frac{3}{(x+1)^4} \end{aligned}

Several other distributional quantities are also weighted averages, which include the unconditional mean, and the second moment.

\displaystyle \begin{aligned}E(X)&=\int_0^{\infty} E(X \lvert \Theta=\theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\frac{1}{\theta} \biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{1}{2} \ \theta \ e^{-\theta} \ d \theta \\&=\frac{1}{2} \end{aligned}

\displaystyle \begin{aligned}E(X^2)&=\int_0^{\infty} E(X^2 \lvert \Theta=\theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\frac{2}{\theta^2} \biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} e^{-\theta} \ d \theta \\&=1 \end{aligned}

As a result, the unconditional variance is $Var(X)=1-\frac{1}{4}=\frac{3}{4}$. Note that the unconditional variance is not the weighted average of the conditional variance. The weighted average of the conditional variance only produces $\frac{1}{2}$.

\displaystyle \begin{aligned}E[Var(X \lvert \Theta)]&=\int_0^{\infty} Var(X \lvert \Theta=\theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\frac{1}{\theta^2} \biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{1}{2} \ e^{-\theta} \ d \theta \\&=\frac{1}{2} \end{aligned}

It turns out that the unconditional variance has two components, the expected value of the conditional variances and the variance of the conditional means. In this example, the former is $\frac{1}{2}$ and the latter is $\frac{1}{4}$. The additional variance in the amount of $\frac{1}{4}$ is a reflection that there is uncertainty in the parameter $\theta$.

\displaystyle \begin{aligned}Var(X)&=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)] \\&=\frac{1}{2}+\frac{1}{4}\\&=\frac{3}{4} \end{aligned}

——————————————————————————————————————-

The Definition of Mixture
The unconditional pdf $f_X(x)$ derived in Example 1 is that of a Pareto distribution. Thus the Pareto distribution is a continuous mixture of exponential distributions with Gamma mixing weights.

Mathematically speaking, a mixture arises when a probability density function $f(x \lvert \theta)$ depends on a parameter $\theta$ that is uncertain and is itself a random variable with density $g(\theta)$. Then taking the weighted average of $f(x \lvert \theta)$ with $g(\theta)$ as weight produces the mixture distribution.

A continuous random variable $X$ is said to be a mixture if its probability density function $f_X(x)$ is a weighted average of a family of probability density functions $f(x \lvert \theta)$. The random variable $\Theta$ is said to be the mixing random variable and its pdf $g(\theta)$ is said to be the mixing weight. An equivalent definition of mixture is that the distribution function $F_X(x)$ is a weighted average of a family of distribution functions indexed by a mixing variable. Thus $X$ is a mixture if one of the following holds.

$\displaystyle f_X(x)=\int_{-\infty}^{\infty} f(x \lvert \theta) \ g(\theta) \ d \theta$

$\displaystyle F_X(x)=\int_{-\infty}^{\infty} F(x \lvert \theta) \ g(\theta) \ d \theta$

Similarly, a discrete random variable is a mixture if its probability function (or distribution function) is a weighted sum of a family of probability functions (or distribution functions). Thus $X$ is a mixture if one of the following holds.

$\displaystyle P(X=x)=\sum \limits_{y} P(X=x \lvert Y=y) \ P(Y=y)$

$\displaystyle P(X \le x)=\sum \limits_{y} P(X \le x \lvert Y=y) \ P(Y=y)$

See this blog post for practice problems on mixture distributions.

Reference

1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

# A basic look at joint distributions

This is a discussion of how to work with joint distributions of two random variables. We limit the discussion on continuous random variables. The discussion of the discrete case is similar (for the most part replacing the integral signs with summation signs). Suppose $X$ and $Y$ are continuous random variables where $f_{X,Y}(x,y)$ is the joint probability density function. What this means is that $f_{X,Y}(x,y)$ satisfies the following two properties:

• for each point $(x,y)$ in the Euclidean plane, $f_{X,Y}(x,y)$ is a nonnegative real number,
• $\displaystyle \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dx \ dy=1$.

Because of the second bullet point, the function $f_{X,Y}(x,y)$ must be an integrable function. We will not overly focus on this point and instead be satisfied with knowing that it is possible to integrate $f_{X,Y}(x,y)$ over the entire $xy$ plane and its many reasonable subregions.

Another way to think about $f_{X,Y}(x,y)$ is that it assigns the density to each point in the $xy$ plane (i.e. it tells us how much weight is assigned to each point). Consequently, if we want to know the probability that $(X,Y)$ falls in the region $A$, we simply evaluate the following integral:

$\displaystyle \int_{A} f_{X,Y}(x,y) \ dx \ dy$.

For instance, to find $P(X and $P(X+Y \le z)$, where $z>0$, we evaluate the integral over the regions $x and $x+y \le z$, respectively. The integrals are:

$\displaystyle P(X

$\displaystyle P(X+Y \le z)=\int_{-\infty}^{\infty} \int_{-\infty}^{x} f_{X,Y}(x,y) \ dy \ dx$

Note that $P(X+Y \le z)$ is the distribution function $F_Z(z)=P(X+Y \le z)$ where $Z=X+Y$. Then the pdf of $Z$ is obtained by differentiation, i.e. $f_Z(z)=F_Z^{'}(z)$.

In practice, all integrals involving the density functions need be taken only over those $x$ and $y$ values where the density is positive.

——————————————————————————————————————–

Marginal Density

The joint density function $f_{X,Y}(x,y)$ describes how the two variables behave in relation to one another. The marginal probability density function (marginal pdf) is of interest if we are only concerned in one of the variables. To obtain the marginal pdf of $X$, we simply integrate $f_{X,Y}(x,y)$ and sum out the other variable. The following integral produces the marginal pdf of $X$:

$\displaystyle f_X(x)=\int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dy$

The marginal pdf of $X$ is obtained by summing all the density along the vertical line that meets the $x$ axis at the point $(x,0)$ (see Figure 1). Thus $f_X(x)$ represents the sum total of all density $f_{X,Y}(x,y)$ along a vertical line.

Obviously, if we find the marginal pdf for each vertical line and sum all the marginal pdfs, the result will be 1.0. Thus $f_X(x)$ can be regarded as a single-variable pdf.

\displaystyle \begin{aligned}\int_{-\infty}^{\infty}f_X(x) \ dx&=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dy \ dx=1 \\&\text{ } \end{aligned}

The same can be said for the marginal pdf of the other variable $Y$, except that $f_Y(y)$ is the sum (integral in this case) of all the density on a horizontal line that meets the $y$ axis at the point $(0,y)$.

$\displaystyle f_Y(y)=\int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dx$

——————————————————————————————————————–

Example 1

Let $X$ and $Y$ be jointly distributed according to the following pdf:

$\displaystyle f_{X,Y}(x,y)=y^2 \ e^{-y(x+1)}, \text{ where } x>0,y>0$

The following derives the marginal pdfs for $X$ and $Y$:

\displaystyle \begin{aligned}f_X(x)&=\int_0^{\infty} y^2 \ e^{-y(x+1)} \ dy \\&\text{ } \\&=\frac{2}{(x+1)^3} \int_0^{\infty} \frac{(x+1)^3}{2!} y^{3-1} \ e^{-y(x+1)} \ dy \\&\text{ } \\&=\frac{2}{(x+1)^3} \end{aligned}

\displaystyle \begin{aligned}f_Y(y)&=\int_0^{\infty} y^2 \ e^{-y(x+1)} \ dx \\&\text{ } \\&=y \ e^{-y} \int_0^{\infty} y \ e^{-y x} \ dx \\&\text{ } \\&=y \ e^{-y} \end{aligned}

In the middle step of the derivation of $f_X(x)$, the integrand is the Gamma pdf with parameters $x+1$ and 3, hence the integral in that step becomes 1. In the middle step for $f_Y(y)$, the integrand is the pdf of an exponential distribution.

——————————————————————————————————————–

Conditional Density

Now consider the joint density $f_{X,Y}(x,y)$ restricted to a vertical line, treating the vertical line as a probability distribution. In essense, we are restricting our focus on one particular realized value of $X$. Given a realized value $x$ of $X$, how do we describe the behavior of the other variable $Y$? Since the marginal pdf $f_X(x)$ is the sum total of all density on a vertical line, we express the conditional density as joint density $f_{X,Y}(x,y)$ as a fraction of $f_X(x)$.

$\displaystyle f_{Y \lvert X}(y \lvert x)=\frac{f_{X,Y}(x,y)}{f_X(x)}$

It is easy to see that $f_{Y \lvert X}(y \lvert x)$ is a probability density function of $Y$. When we already know that $X$ has a realized value, this pdf tells us information about how $Y$ behaves. Thus this pdf is called the conditional pdf of $Y$ given $X=x$.

Given a realized value $x$ of $X$, we may want to know the conditional mean and the higher moments of $Y$.

$\displaystyle E(Y \lvert X=x)=\int_{-\infty}^{\infty} y \ f_{Y \lvert X}(y \lvert x) \ dy$

$\displaystyle E(Y^n \lvert X=x)=\int_{-\infty}^{\infty} y^n \ f_{Y \lvert X}(y \lvert x) \ dy \text{ where } n>1$

In particular, the conditional variance of $Y$ is:

$\displaystyle Var(Y \lvert X=x)=E(Y^2 \lvert X=x)-E(Y \lvert X=x)^2$

The discussion for the conditional density of $X$ given a realized value $y$ of $Y$ is similar, except that we restrict the joint density $f_{X,Y}(x,y)$ on a horizontal line. We have the following information about the conditional distribution of $X$ given a realized value $Y=y$.

$\displaystyle f_{X \lvert Y}(x \lvert y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}$

$\displaystyle E(X \lvert Y=y)=\int_{-\infty}^{\infty} x \ f_{X \lvert Y}(x \lvert y) \ dx$

$\displaystyle E(X^n \lvert Y=y)=\int_{-\infty}^{\infty} x^n \ f_{X \lvert Y}(x \lvert y) \ dx \text{ where } n>1$

In particular, the conditional variance of $X$ is:

$\displaystyle Var(X \lvert Y=y)=E(X^2 \lvert Y=y)-E(X \lvert Y=y)^2$

——————————————————————————————————————–

Example 1 (Continued)

The following derives the conditional density functions:

\displaystyle \begin{aligned}f_{Y \lvert X}(y \lvert x)&=\frac{f_{X,Y}(x,y)}{f_X(x)} \\&\text{ } \\&=\displaystyle \frac{y^2 e^{-y(x+1)}}{\frac{2}{(x+1)^3}} \\&\text{ } \\&=\frac{(x+1)^3}{2!} \ y^2 \ e^{-y(x+1)} \end{aligned}

\displaystyle \begin{aligned}f_{X \lvert Y}(x \lvert y)&=\frac{f_{X,Y}(x,y)}{f_Y(y)} \\&\text{ } \\&=\displaystyle \frac{y^2 e^{-y(x+1)}}{y \ e^{-y}} \\&\text{ } \\&=y \ e^{-y \ x} \end{aligned}

The conditional density $f_{Y \lvert X}(y \lvert x)$ is that of a Gamma distribution with parameters $x+1$ and 3. So given a realized value $x$ of $X$, $Y$ has a Gamma distribution whose scale parameter is $x+1$ and whose shape parameter is 3. On the other hand, the conditional density $f_{X \lvert Y}(x \lvert y)$ is that of an exponential distribution. Given a realized value $y$ of $Y$, $X$ has an exponential distribution with parameter $y$. Since the conditional distributions are familiar parametric distributions, we have the following conditional means and conditional variances.

$\displaystyle E(Y \lvert X=x)=\frac{3}{x+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ Var(Y \lvert X=x)=\frac{3}{(x+1)^2}$

$\displaystyle E(X \lvert Y=y)=\frac{1}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Var(X \lvert Y=y)=\frac{1}{y^2}$

Note that both conditional means are decreasing functions. The larger the realized value of $X$, the smaller the mean $E(Y \lvert X=x)$. Likewise, the larger the realized value of $Y$, the smaller the mean $E(X \lvert Y=y)$. It appears that $X$ and $Y$ moves opposite of each other. This is also confirmed by the fact that $Cov(X,Y)=-1$.
——————————————————————————————————————–

Mixture Distributions

In the preceding discussion, the conditional distributions are derived from the joint distributions and the marginal distributions. In some applications, it is the opposite: we know the conditional distribution of one variable given the other variable and construct the joint distributions. We have the following:

\displaystyle \begin{aligned}f_{X,Y}(x,y)&=f_{Y \lvert X}(y \lvert x) \ f_X(x) \\&\text{ } \\&=f_{X \lvert Y}(x \lvert y) \ f_Y(y) \end{aligned}

The form of the joint pdf indicated above has an interesting interpretation as a mixture. Using an insurance example, suppose that $f_{X \lvert Y}(x \lvert y)$ is a model of the claim cost of a randomly selected insured where $y$ is a realized value of a parameter $Y$ that is to indicate the risk characteristics of an insured. The members of this large population have a wide variety of risk characteristics and the random variable $Y$ is to capture the risk charateristics across the entire population. Consequently, the unconditional claim cost for a randomly selected insured is:

$\displaystyle f_X(x)=\int_{-\infty}^{\infty} f_{X \lvert Y}(x \lvert y) \ f_Y(y) \ dy$

Note that the above unconditional pdf $f_X(x)$ is a weighted average of conditional pdfs. Thus the distribution derived in this manner is called a mixture distribution. The pdf $f_Y(y)$ is called the mixture weight or mixing weight. Some distributional quantities of a mixture distribution are also the weighted average of the conditional counterpart. These include the distribution function, mean, higher moments. Thus we have;

$\displaystyle F_X(x)=\int_{-\infty}^{\infty} F_{X \lvert Y}(x \lvert y) \ f_Y(y) \ dy$

$\displaystyle E(X)=\int_{-\infty}^{\infty} E(X \lvert Y=y) \ f_Y(y) \ dy$

$\displaystyle E(X^k)=\int_{-\infty}^{\infty} E(X^k \lvert Y=y) \ f_Y(y) \ dy$

In the above derivations, the cumulative distribution function $F_X(x)$ and the moments of $E(X^k)$ are weighted averages of their conditional counterparts. However, the variance $Var(X)$ cannot be the weighted average of conditional variances. To find out why, see the post The variance of a mixture.