Derive some facts of the negative binomial distribution

The previous post called The Negative Binomial Distribution gives a fairly comprehensive discussion of the negative binomial distribution. In this post, we fill in some of the details that are glossed over in that previous post. We derive the following points:

  • Discuss the several versions of the negative binomial distribution.
  • The negative binomial probabilities sum to one, i.e., the negative binomial probability function is a valid one.
  • Derive the moment generating function of the negative binomial distribution.
  • Derive the first and second moments and the variance of the negative binomial distribution.
  • An observation about independent sum of negative binomial distributions.

________________________________________________________________________

Three versions

The negative binomial distribution has two parameters r and p, where r is a positive real number and 0<p<1. The first two versions arise from the case that r is a positive integer, which can be interpreted as the random experiment of a sequence of independent Bernoulli trials until the rth success (the trials have the same probability of success p). In this interpretation, there are two ways of recording the random experiment:

    X = the number of Bernoulli trials required to get the rth success.
    Y = the number of Bernoulli trials that end in failure before getting the rth success.

The other parameter p is the probability of success in each Bernoulli trial. The notation \binom{m}{n} is the binomial coefficient where m and n are non-negative integers and m \ge n is defined as:

    \displaystyle \binom{m}{n}=\frac{m!}{n! \ (m-n)!}=\frac{m(m-1) \cdots (m-(n-1))}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)

With this in mind, the following are the probability functions of the random variables X and Y.

    \displaystyle P(X=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The thought process for (1) is that for the event X=x to happen, there can only be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial). The thought process for (2) is that for the event Y=y to happen, there are y+r trials (y failures and r successes). In the first y+r-1 trials, there can be only y failures (or equivalently r-1 successes). Note that X=Y+r. Thus knowing the mean of Y will derive the mean of X, a fact we will use below.

Instead of memorizing the probability functions (1) and (2), it is better to understand and remember the thought processes involved. Because of the natural interpretation of performing Bernoulli trials until the rth success, it is a good idea to introduce the negative binomial distribution via the distributions described by (1) and (2), i.e., the case where the parameter r is a positive integer. When r=1, the random experiment is a sequence of independent Bernoulli trials until the first success (this is called the geometric distribution).

Of course, (1) and (2) can also simply be used as counting distributions without any connection with a series of Bernoulli trials (e.g. used in an insurance context as the number of losses or claims arising from a group of insurance policies).

The binomial coefficient in (0) is defined when both numbers are non-negative integers and that the top one is greater than or equal to the bottom one. However, the rightmost term in (0) can be calculated even when the top number m is not a non-negative integer. Thus when m is any real number, the rightmost term (0) can be calculated provided that the bottom number n is a positive integer. For convenience we define \binom{m}{0}=1. With this in mind, the binomial coefficient \binom{m}{n} is defined for any real number m and any non-negative integer n.

The third version of the negative binomial distribution arises from the relaxation of the binomial coefficient \binom{m}{n} just discussed. With this in mind, the probability function in (2) can be defined for any positive real number r:

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

where \displaystyle \binom{y+r-1}{y}=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!}.

Of course when r is a positive integer, versions (2) and (3) are identical. When r is a positive real number but is not an integer, the distribution cannot be interpreted as the number of failures until the occurrence of rth success. Instead, it is used as a counting distribution.

________________________________________________________________________

The probabilities sum to one

Do the probabilities in (1), (2) or (3) sum to one? For the interpretations of (1) and (2), is it possible to repeatedly perform Bernoulli trials and never get the rth success? For r=1, is it possible to never even get a success? In tossing a fair coin repeatedly, soon enough you will get a head and even if r is a large number, you will eventually get r number of heads. Here we wish to prove this fact mathematically.

To show that (1), (2) and (3) are indeed probability functions, we use a fact concerning Maclaurin’s series expansion of the function (1-x)^{-r}, a fact that is covered in a calculus course. In the following two results, r is a fixed positive real number and y is any non-negative integer:

    \displaystyle \binom{y+r-1}{y}=(-1)^y \ \binom{-r}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    \displaystyle \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y=(1-x)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

The result (4) is to rearrange the binomial coefficient in probability function (3) to another binomial coefficient with a negative number. This is why there is the word “negative” in negative binomial distribution. The result (5) is the Maclaurin’s series expansion for the function (1-x)^{-r}. We first derive these two facts and then use them to show that the negative binomial probabilities in (3) sum to one. The following derives (4).

    \displaystyle \begin{aligned} \binom{y+r-1}{y}&=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!} \\&=(-1)^y \ \frac{(-r)(-r-1) \cdots (-r-(y-1))}{y!} \\&=(-1)^y \ \binom{-r}{y}  \end{aligned}

To derive (5), let f(x)=(1-x)^{-r}. Based on a theorem that can be found in most calculus text, the function f(x) has the following Maclaurin’s series expansion (Maclaurin’s series is simply Taylor’s series with center = 0).

    \displaystyle (1-x)^{-r}=f(0)+f^{'}(0)x+\frac{f^{(2)}(0)}{2!}x^2+\frac{f^{(3)}(0)}{3!}x^3+\cdots + \frac{f^{(n)}(0)}{n!}x^n+\cdots

where -1<x<1. Now, filling in the derivatives f^{(n)}(0), we have the following derivation.

    \displaystyle \begin{aligned} (1-x)^{-r}&=1+rx+\frac{(r+1)r}{2!}x^2+\frac{(r+2)(r+1)r}{3!}x^3 \\& \ \ \ \ \ \ \ \ +\cdots+\frac{(r+y-1)(r+y-2) \cdots (r+1)r}{y!}x^y +\cdots \\&=1+(-1)^1 (-r)x+(-1)^2\frac{(-r)(-r-1)}{2!}x^2 \\& \ \ \ \ \ \ +(-1)^3 \frac{(-r)(-r-1)(-r-2)}{3!}x^3 +\cdots \\& \ \ \ \ \ \ +(-1)^y \frac{(-r)(-r-1) \cdots (-r-y+2)(-r-y+1)}{y!}x^y +\cdots  \\&=(-1)^0 \binom{-r}{0}x^0 +(-1)^1 \binom{-r}{1}x^1+(-1)^2 \binom{-r}{2}x^2 \\& \ \ \ \ \ \ +(-1)^3 \binom{-r}{3}x^3+\cdots +(-1)^y \binom{-r}{y}x^y+\cdots    \\&=\sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y \end{aligned}

We can now show that the negative binomial probabilities in (3) sum to one. Let q=1-p.

    \displaystyle \begin{aligned} \sum \limits_{y=0}^\infty \binom{y+r-1}{y} \ p^r \ q^y &=p^r \ \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ q^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ (1-q)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5)\\&=p^r p^{-r} \\&=1 \end{aligned}

________________________________________________________________________

The moment generating function

We now derive the moment generating function of the negative binomial distribution according to (3). The moment generation function is M(t)=E(e^{tY}) over all real numbers t for which M(t) is defined. The following derivation does the job.

    \displaystyle \begin{aligned} M(t)&=E(e^{tY}) \\&=\sum \limits_{y=0}^\infty \ e^{t y} \ \binom{y+r-1}{y} \ p^r \ (1-p)^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ \binom{y+r-1}{y} \ [(1-p) e^t]^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ (-1)^y \binom{-r}{y} \ [(1-p) e^t]^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ [1-(1-p) \ e^t]^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5) \\&=\frac{p^r}{[1-(1-p) \ e^t]^{r}}\end{aligned}

The above moment generating function works for the negative binomial distribution with respect to (3) and thus to (2). For the distribution in (1), note that X=Y+r. Thus E(e^{tX})=E(e^{t(Y+r)})=e^{tr} \ E(e^{tY}). The moment generating function of (1) is simply the above moment generating function multiplied by the factor e^{tr}. To summarize, the moment generating functions for the three versions are:

    \displaystyle M_X(t)=E[e^{tX}]=\frac{p^r \ e^{tr}}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (1)

    \displaystyle M_Y(t)=E[e^{tY}]=\frac{p^r}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (2) \text{ and } (3)

The domain of the moment generating function is the set of all t that for which M_X(t) or M_Y(t) is defined and is positive. Based on the form that it takes, we focus on making sure that 1-(1-p) \ e^t>0. This leads to the domain t<-\text{ln}(1-p).

________________________________________________________________________

The mean and the variance

With the moment generating function derived in the above section, we can now focus on finding the moments of the negative binomial distribution. To find the moments, simply take the derivatives of the moment generating function and evaluate at t=0. For the distribution represented by the probability function in (3), we calculate the following:

    E(Y)=M_Y^{'}(0)

    E(Y^2)=M_Y^{(2)}(0)

    Var(Y)=E(Y^2)-E(Y)^2

After taking the first and second derivatives and evaluate at t=0, the first and the second moments are:

    \displaystyle E(Y)=r \ \frac{1-p}{p}

    \displaystyle E(Y^2)=\frac{r(1-p)[1+(1-p)]}{p^2}

The following derives the variance.

    \displaystyle \begin{aligned} Var(Y)&=E(Y^2)-E(Y)^2 \\&=\frac{r(1-p)[1+(1-p)]}{p^2}-\frac{(1-p)^2}{p^2} \\&=\frac{r(1-p)[1+r(1-p)-r(1-p)]}{p^2} \\&=\frac{r(1-p)}{p^2}  \end{aligned}

The above formula is the variance for the three versions (1), (2) and (3). Note that Var(Y)>E(Y). In contrast, the variance of the Poisson distribution is identical to its mean. Thus in the situation where the variance of observed data is greater than the sample mean, the negative binomial distribution should be a better fit than the Poisson distribution.

________________________________________________________________________

The independent sum

There is an easy consequence that follows from the moment generating function derived above. The sum of several independent negative binomial distributions is also a negative binomial distribution. For example, suppose T_1,T_2, \cdots,T_n are independent negative binomial random variables (version (3)). Suppose each T_j has parameters r_j and p (the second parameter is identical). The moment generating function of the independent sum is the product of the individual moment generating functions. Thus the following is the moment generating function of T=T_1+\cdots+T_n.

    \displaystyle M_T(t)=E[e^{tT}]=\frac{p^g}{[1-(1-p) \ e^t]^{g}}

where g=r_1+\cdots+r_n. The moment generating function uniquely identifies the distribution. The above M_T(t) is that of a negative binomial distribution with parameters g and p according to (3).

A special case is that the sum of n independent geometric distributions is a negative binomial distribution with the r parameter being r=n. The following is the moment generating function of the sum W of n independent geometric distributions.

    \displaystyle M_W(t)=E[e^{tW}]=\frac{p^n}{[1-(1-p) \ e^t]^{n}}

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Getting from Binomial to Poisson

In many binomial problems, the number of Bernoulli trials n is large, relatively speaking, and the probability of success p is small such that n p is of moderate magnitude. For example, consider problems that deal with rare events where the probability of occurrence is small (as a concrete example, counting the number of people with July 1 as birthday out of a random sample of 1000 people). It is often convenient to approximate such binomial problems using the Poisson distribution. The justification for using the Poisson approximation is that the Poisson distribution is a limiting case of the binomial distribution. Now that cheap computing power is widely available, it is quite easy to use computer or other computing devices to obtain exact binomial probabiities for experiments up to 1000 trials or more. Though the Poisson approximation may no longer be necessary for such problems, knowing how to get from binomial to Poisson is important for understanding the Poisson distribution itself.

Consider a counting process that describes the occurrences of a certain type of events of interest in a unit time interval subject to three simplifying assumptions (discussed below). We are interested in counting the number of occurrences of the event of interest in a unit time interval. As a concrete example, consider the number of cars arriving at an observation point in a certain highway in a period of time, say one hour. We wish to model the probability distribution of how many cars that will arrive at the observation point in this particular highway in one hour. Let X be the random variable described by this probability distribution. We wish to konw the probability that there are k cars arriving in one hour. We start with using a binomial distribution as an approximation to the probability P(X=k). We will see that upon letting n \rightarrow \infty, the P(X=k) is a Poisson probability.

Suppose that we know E(X)=\alpha, perhaps an average obtained after observing cars at the observation points for many hours. The simplifying assumptions alluded to earlier are the following:

  1. The numbers of cars arriving in nonoverlapping time intervals are independent.
  2. The probability of one car arriving in a very short time interval of length h is \alpha h.
  3. The probability of having more than one car arriving in a very short time interval is esstentially zero.

Assumption 1 means that a large number of cars arriving in one period does not imply fewer cars will arrival in the next period and vice versa. In other words, the number of cars that arrive in any one given moment does affect the number of cars that will arrive subsequently. Knowing how many cars arriving in one minute will not help predict the number of cars arriving at the next minute. Assumption 2 means that the rate of cars arriving is dependent only on the length of the time interval and not on when the time interval occurs (e.g. not on whether it is at the beginning of the hour or toward the end of the hour). The assumptions 2 and 3 allow us to think of a very short period of time as a Bernoulli trial. Thinking of the arrival of a car as a success, each short time interval will result in only one success or one failure.

To start, we can break up the hour into 60 minutes (into 60 Bernoulli trials). We then consider the binomial distribution with n=60 and p=\frac{\alpha}{60}. So the following is an approximation to our desired probability distribution.

\displaystyle (1) \ \ \ \ \ P(X=k) \approx \binom{60}{k} \biggl(\frac{\alpha}{60}\biggr)^k \biggr(1-\frac{\alpha}{60}\biggr)^{60-k} \ \ \ \ \ k=0,1,2,\cdots, 60

Conceivably, there can be more than 1 car arriving in a minute and observing cars in a one-minute interval may not be a Bernoulli trial. For a one-minute interval to qualify as a Bernoulli trial, there is either no car arriving or 1 car arriving in that one minute. So we can break up an hour into 3,600 seconds (into 3,600 Bernoulli trials). We now consider the binomial distribution with n=3600 and p=\frac{\alpha}{3600}.

\displaystyle (2) \ \ \ \ \ P(X=k) \approx \binom{3600}{k} \biggl(\frac{\alpha}{3600}\biggr)^k \biggr(1-\frac{\alpha}{3600}\biggr)^{3600-k} \ \ \ \ \ k=0,1,2,\cdots, 3600

It is also conceivable that more than 1 car can arrive in one second and observing cars in one-second interval may still not qualify as a Bernoulli trial. So we need to get more granular. We can divide up the hour into n equal subintervals, each of length \frac{1}{n}. The assumptions 2 and 3 ensure that each subinterval is a Bernoulli trial (either it is a success or a failure; one car arriving or no car arriving). Assumption 1 tells us that all the n subintervals are independent. So breaking up the hour into n moments and counting the number of moments that are successes will result in a binomial distribution with parameters n and p=\frac{\alpha}{n}. So we are ready to proceed with the following approximation to our probability distribution P(X=k).

\displaystyle (3) \ \ \ \ \ P(X=k) \approx \binom{n}{k} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k} \ \ \ \ \ k=0,1,2,\cdots, n

As we get more granular, n \rightarrow \infty. We show that the limit of the binomial probability in (3) is the Poisson distribution with parameter \alpha. We show the following.

\displaystyle (4) \ \ \ \ \ P(X=k) = \lim \limits_{n \rightarrow \infty} \binom{n}{k} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k}=\frac{e^{-\alpha} \alpha^k}{k!} \ \ \ \ \ \ k=0,1,2,\cdots

In the derivation of (4), we need the following two mathematical tools. The statement (5) is one of the definitions of the mathematical constant e. In the statement (6), the integer n in the numerator is greater than the integer k in the denominator. It says that whenever we work with such a ratio of two factorials, the result is the product of n with the smaller integers down to (n-(k-1)). There are exactly k terms in the product.

\displaystyle (5) \ \ \ \ \ \lim \limits_{n \rightarrow \infty} \biggl(1+\frac{x}{n}\biggr)^n=e^x

\displaystyle (6) \ \ \ \ \ \frac{n!}{(n-k)!}=n(n-1)(n-2) \cdots (n-k+1) \ \ \ \ \ \ \ \  k<n

The following is the derivation of (4).

\displaystyle \begin{aligned}(7) \ \ \ \ \  P(X=k)&=\lim \limits_{n \rightarrow \infty} \binom{n}{k} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k} \\&=\lim \limits_{n \rightarrow \infty} \ \frac{n!}{k! (n-k)!} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k} \\&=\lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \biggl(\frac{\alpha^k}{k!}\biggr) \biggr(1-\frac{\alpha}{n}\biggr)^{n} \biggr(1-\frac{\alpha}{n}\biggr)^{-k} \\&=\biggl(\frac{\alpha^k}{k!}\biggr) \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \\&\times \ \ \ \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{n} \ \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{-k} \\&=\frac{e^{-\alpha} \alpha^k}{k!} \end{aligned}

In (7), we have \displaystyle \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k}=1. The reason being that the numerator is a polynomial where the leading term is n^k. Upon dividing by n^k and taking the limit, we get 1. Based on (5), we have \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{n}=e^{-\alpha}. For the last limit in the derivation we have \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{-k}=1.

We conclude with some comments. As the above derivation shows, the Poisson distribution is at heart a binomial distribution. When we divide the unit time interval into more and more subintervals (as the subintervals get more and more granular), the resulting binomial distribution behaves more and more like the Poisson distribution.

The three assumtions used in the derivation are called the Poisson postulates, which are the underlying assumptions that govern a Poisson process. Such a random process describes the occurrences of some type of events that are of interest (e.g. the arrivals of cars in our example) in a fixed period of time. The positive constant \alpha indicated in Assumption 2 is the parameter of the Poisson process, which can be interpreted as the rate of occurrences of the event of interest (or rate of changes, or rate of arrivals) in a unit time interval, meaning that the positive constant \alpha is the mean number of occurrences in the unit time interval. The derivation in (7) shows that whenever a certain type of events occurs according to a Poisson process with parameter \alpha, the counting variable of the number of occurrences in the unit time interval is distributed according to the Poisson distribution as indicated in (4).

If we observe the occurrences of events over intervals of length other than unit length, say, in an interval of length t, the counting process is governed by the same three postulates, with the modification to Assumption 2 that the rate of changes of the process is now \alpha t. The mean number of occurrences in the time interval of length t is now \alpha t. The Assumption 2 now states that for any very short time interval of length h (and that is also a subinterval of the interval of length t under observation), the probability of having one occurrence of event in this short interval is (\alpha t)h. Applyng the same derivation, it can be shown that the number of occurrences (X_t) in a time interval of length t has the Poisson distribution with the following probability mass function.

\displaystyle (8) \ \ \ \ \ P(X_t=k)=\frac{e^{-\alpha t} \ (\alpha t)^k}{k!} \ \ \ \ \ \ \ \  k=0,1,2,\cdots

Relating Binomial and Negative Binomial

The negative binomial distribution has a natural intepretation as a waiting time until the arrival of the rth success (when the parameter r is a positive integer). The waiting time refers to the number of independent Bernoulli trials needed to reach the rth success. This interpretation of the negative binomial distribution gives us a good way of relating it to the binomial distribution. For example, if the rth success takes place after k failed Bernoulli trials (for a total of k+r trials), then there can be at most r-1 successes in the first k+r trials. This tells us that the survival function of the negative binomial distribution is the cumulative distribution function (cdf) of a binomial distribution. In this post, we gives the details behind this observation. A previous post on the negative binomial distribution is found here.

A random experiment resulting in two distinct outcomes (success or failure) is called a Bernoulli trial (e.g. head or tail in a coin toss, whether or not the birthday of a customer is the first of January, whether an insurance claim is above or below a given threshold etc). Suppose a series of independent Bernoulli trials are performed until reaching the rth success where the probability of success in each trial is p. Let X_r be the number of failures before the occurrence of the rth success. The following is the probablity mass function of X_r.

\displaystyle (1) \ \ \ \ P(X_r)=\binom{k+r-1}{k} p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,3,\cdots

Be definition, the survival function and cdf of X_r are:

\displaystyle (2) \ \ \ \ P(X_r > k)=\sum \limits_{j=k+1}^\infty \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

\displaystyle (3) \ \ \ \ P(X_r \le k)=\sum \limits_{j=0}^k \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

For each positive integer k, let Y_{r+k} be the number of successes in performing a sequence of r+k independent Bernoulli trials where p is the probability of success. In other words, Y_{r+k} has a binomial distribution with parameters r+k and p.

If the random experiment requires more than k failures to reach the rth success, there are at most r-1 successes in the first k+r trails. Thus the survival function of X_r is the same as the cdf of a binomial distribution. Equivalently, the cdf of X_r is the same as the survival function of a binomial distribution. We have the following:

\displaystyle \begin{aligned}(4) \ \ \ \ P(X_r > k)&=P(Y_{k+r} \le r-1) \\&=\sum \limits_{j=0}^{r-1} \binom{k+r}{j} p^j (1-p)^{k+r-j} \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

\displaystyle \begin{aligned}(5) \ \ \ \ P(X_r \le k)&=P(Y_{k+r} > r-1) \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

Remark
The relation (4) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Recall that a Gamma distribution with shape parameter \alpha and scale parameter n, where n is a positive integer, can be interpreted as the waiting time until the nth change in a Poisson process. Thus, if the nth change takes place after time t, there can be at most n-1 arrivals in the time interval [0,t]. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The relation (4) is analogous to the following relation.

\displaystyle (5) \ \ \ \ \int_t^\infty \frac{\alpha^n}{(n-1)!} \ x^{n-1} \ e^{-\alpha x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\alpha t} \ (\alpha t)^j}{j!}

A previous post on the negative binomial distribution is found here.

The Negative Binomial Distribution

A counting distribution is a discrete distribution with probabilities only on the nonnegative integers. Such distributions are important in insurance applications since they can be used to model the number of events such as losses to the insured or claims to the insurer. Though playing a prominent role in statistical theory, the Poisson distribution is not appropriate in all situations, since it requires that the mean and the variance are equaled. Thus the negative binomial distribution is an excellent alternative to the Poisson distribution, especially in the cases where the observed variance is greater than the observed mean.

The negative binomial distribution arises naturally from a probability experiment of performing a series of independent Bernoulli trials until the occurrence of the rth success where r is a positive integer. From this starting point, we discuss three ways to define the distribution. We then discuss several basic properties of the negative binomial distribution. Emphasis is placed on the close connection between the Poisson distribution and the negative binomial distribution.

________________________________________________________________________

Definitions
We define three versions of the negative binomial distribution. The first two versions arise from the view point of performing a series of independent Bernoulli trials until the rth success where r is a positive integer. A Bernoulli trial is a probability experiment whose outcome is random such that there are two possible outcomes (success or failure).

Let X_1 be the number of Bernoulli trials required for the rth success to occur where r is a positive integer. Let p is the probability of success in each trial. The following is the probability function of X_1:

\displaystyle (1) \ \ \ \ \ P(X_1=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots

The idea for (1) is that for X_1=x to happen, there must be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial).

A more common version of the negative binomial distribution is the number of Bernoulli trials in excess of r in order to produce the rth success. In other words, we consider the number of failures before the occurrence of the rth success. Let X_2 be this random variable. The following is the probability function of X_2:

\displaystyle (2) \ \ \ \ \ P(X_2=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

The idea for (2) is that there are x+r trials and in the first x+r-1 trials, there are x failures (or equivalently r-1 successes).

In both (1) and (2), the binomial coefficient is defined by

\displaystyle (3) \ \ \ \ \ \binom{y}{k}=\frac{y!}{k! \ (y-k)!}=\frac{y(y-1) \cdots (y-(k-1))}{k!}

where y is a positive integer and k is a nonnegative integer. However, the right-hand-side of (3) can be calculated even if y is not a positive integer. Thus the binomial coefficient \displaystyle \binom{y}{k} can be expanded to work for all real number y. However k must still be nonnegative integer.

\displaystyle (4) \ \ \ \ \ \binom{y}{k}=\frac{y(y-1) \cdots (y-(k-1))}{k!}

For convenience, we let \displaystyle \binom{y}{0}=1. When the real number y>k-1, the binomial coefficient in (4) can be expressed as:

\displaystyle (5) \ \ \ \ \ \binom{y}{k}=\frac{\Gamma(y+1)}{\Gamma(k+1) \Gamma(y-k+1)}

where \Gamma(\cdot) is the gamma function.

With the more relaxed notion of binomial coefficient, the probability function in (2) above can be defined for all real number r. Thus the general version of the negative binomial distribution has two parameters r and p, both real numbers, such that 0<p<1. The following is its probability function.

\displaystyle (6) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

Whenever r in (6) is a real number that is not a positive integer, the interpretation of counting the number of failures until the occurrence of the rth success is no longer important. Instead we can think of it simply as a count distribution.

The following alternative parametrization of the negative binomial distribution is also useful.

\displaystyle (6a) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} \biggl(\frac{\alpha}{\alpha+1}\biggr)^r \biggl(\frac{1}{\alpha+1}\biggr)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

The parameters in this alternative parametrization are r and \alpha>0. Clearly, the ratio \frac{\alpha}{\alpha+1} takes the place of p in (6). Unless stated otherwise, we use the parametrization of (6).
________________________________________________________________________

What is negative about the negative binomial distribution?
What is negative about this distribution? What is binomial about this distribution? The name is suggested by the fact that the binomial coefficient in (6) can be rearranged as follows:

\displaystyle \begin{aligned}(7) \ \ \ \ \ \binom{x+r-1}{x}&=\frac{(x+r-1)(x+r-2) \cdots r}{x!} \\&=(-1)^x \frac{(-r-(x-1))(-r-(x-2)) \cdots (-r)}{x!} \\&=(-1)^x \frac{(-r)(-r-1) \cdots (-r-(x-1))}{x!} \\&=(-1)^x \binom{-r}{x} \end{aligned}

The calculation in (7) can be used to verify that (6) is indeed a probability function, that is, all the probabilities sum to 1.

\displaystyle \begin{aligned}(8) \ \ \ \ \ 1&=p^r p^{-r}\\&=p^r (1-q)^{-r} \\&=p^r \sum \limits_{x=0}^\infty \binom{-r}{x} (-q)^x \ \ \ \ \ \ \ \ (8.1) \\&=p^r \sum \limits_{x=0}^\infty (-1)^x \binom{-r}{x} q^x \\&=\sum \limits_{x=0}^\infty \binom{x+r-1}{x} p^r q^x \end{aligned}

In (8), we take q=1-p. The step (8.1) above uses the following formula known as the Newton’s binomial formula.

\displaystyle (9) \ \ \ \ \ (1+t)^w=\sum \limits_{k=0}^\infty \binom{w}{k} t^k

________________________________________________________________________

The Generating Function
By definition, the following is the generating function of the negative binomial distribution, using :

\displaystyle (10) \ \ \ \ \ g(z)=\sum \limits_{x=0}^\infty \binom{r+x-1}{x} p^r q^x z^x

where q=1-p. Using a similar calculation as in (8), the generating function can be simplified as:

\displaystyle (11) \ \ \ \ \ g(z)=p^r (1-q z)^{-r}=\frac{p^r}{(1-q z)^r}=\frac{p^r}{(1-(1-p) z)^r}; \ \ \ \ \ z<\frac{1}{1-p}

As a result, the moment generating function of the negative binomial distribution is:

\displaystyle (12) \ \ \ \ \ M(t)=\frac{p^r}{(1-(1-p) e^t)^r}; \ \ \ \ \ \ \ t<-ln(1-p)

________________________________________________________________________

Independent Sum

One useful property of the negative binomial distribution is that the independent sum of negative binomial random variables, all with the same parameter p, also has a negative binomial distribution. Let Y=Y_1+Y_2+\cdots+Y_n be an independent sum such that each X_i has a negative binomial distribution with parameters r_i and p. Then the sum Y=Y_1+Y_2+\cdots+Y_n has a negative binomial distribution with parameters r=r_1+\cdots+r_n and p.

Note that the generating function of an independent sum is the product of the individual generating functions. The following shows that the product of the individual generating functions is of the same form as (11), thus proving the above assertion.

\displaystyle (13) \ \ \ \ \ h(z)=\frac{p^{\sum \limits_{i=1}^n r_i}}{(1-(1-p) z)^{\sum \limits_{i=1}^n r_i}}
________________________________________________________________________

Mean and Variance
The mean and variance can be obtained from the generating function. From E(X)=g'(1) and E(X^2)=g'(1)+g^{(2)}(1), we have:

\displaystyle (14) \ \ \ \ \ E(X)=\frac{r(1-p)}{p} \ \ \ \ \ \ \ \ \ \ \ \ \ Var(X)=\frac{r(1-p)}{p^2}

Note that Var(X)=\frac{1}{p} E(X)>E(X). Thus when the sample data suggest that the variance is greater than the mean, the negative binomial distribution is an excellent alternative to the Poisson distribution. For example, suppose that the sample mean and the sample variance are 3.6 and 7.1. In exploring the possibility of fitting the data using the negative binomial distribution, we would be interested in the negative binomial distribution with this mean and variance. Then plugging these into (14) produces the negative binomial distribution with r=3.7 and p=0.507.
________________________________________________________________________

The Poisson-Gamma Mixture
One important application of the negative binomial distribution is that it is a mixture of a family of Poisson distributions with Gamma mixing weights. Thus the negative binomial distribution can be viewed as a generalization of the Poisson distribution. The negative binomial distribution can be viewed as a Poisson distribution where the Poisson parameter is itself a random variable, distributed according to a Gamma distribution. Thus the negative binomial distribution is known as a Poisson-Gamma mixture.

In an insurance application, the negative binomial distribution can be used as a model for claim frequency when the risks are not homogeneous. Let N has a Poisson distribution with parameter \theta, which can be interpreted as the number of claims in a fixed period of time from an insured in a large pool of insureds. There is uncertainty in the parameter \theta, reflecting the risk characteristic of the insured. Some insureds are poor risks (with large \theta) and some are good risks (with small \theta). Thus the parameter \theta should be regarded as a random variable \Theta. The following is the conditional distribution of N (conditional on \Theta=\theta):

\displaystyle (15) \ \ \ \ \ P(N=n \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots

Suppose that \Theta has a Gamma distribution with scale parameter \alpha and shape parameter \beta. The following is the probability density function of \Theta.

\displaystyle (16) \ \ \ \ \ g(\theta)=\frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta} \ \ \ \ \ \ \ \ \ \ \theta>0

Then the joint density of N and \Theta is:

\displaystyle (17) \ \ \ \ \ P(N=n \lvert \Theta=\theta) \ g(\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta}

The unconditional distribution of N is obtained by summing out \theta in (17).

\displaystyle \begin{aligned}(18) \ \ \ \ \ P(N=n)&=\int_0^\infty P(N=n \lvert \Theta=\theta) \ g(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta} \ d \theta \\&=\int_0^\infty \frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \int_0^\infty \frac{(\alpha+1)^{n+\beta}}{\Gamma(n+\beta)} \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \\&=\frac{\Gamma(n+\beta)}{\Gamma(n+1) \ \Gamma(\beta)} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \\&=\binom{n+\beta-1}{n} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots \end{aligned}

Note that the integral in the fourth step in (18) is 1.0 since the integrand is the pdf of a Gamma distribution. The above probability function is that of a negative binomial distribution. It is of the same form as (6a). Equivalently, it is also of the form (6) with parameter r=\beta and p=\frac{\alpha}{\alpha+1}.

The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional distribution of N is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable \Theta has the effect of increasing the unconditional variance of the mixture distribution of N. The variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter \Theta (see The variance of a mixture).

________________________________________________________________________

The Poisson Distribution as Limit of Negative Binomial
There is another connection to the Poisson distribution, that is, the Poisson distribution is a limiting case of the negative binomial distribution. We show that the generating function of the Poisson distribution can be obtained by taking the limit of the negative binomial generating function as r \rightarrow \infty. Interestingly, the Poisson distribution is also the limit of the binomial distribution.

In this section, we use the negative binomial parametrization of (6a). By replacing \frac{\alpha}{\alpha+1} for p, the following are the mean, variance, and the generating function for the probability function in (6a):

\displaystyle \begin{aligned}(19) \ \ \ \ \ \ &E(X)=\frac{r}{\alpha} \\&\text{ }\\&Var(X)=\frac{\alpha+1}{\alpha} \ \frac{r}{\alpha}=\frac{r(\alpha+1)}{\alpha^2} \\&\text{ } \\&g(z)=\frac{1}{[1-\frac{1}{\alpha}(z-1)]^r} \ \ \ \ \ \ \ z<\alpha+1 \end{aligned}

Let r goes to infinity and \displaystyle \frac{1}{\alpha} goes to zero and at the same time keeping their product constant. Thus \displaystyle \mu=\frac{r}{\alpha} is constant (this is the mean of the negative binomial distribution). We show the following:

\displaystyle (20) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}=e^{\mu (z-1)}

The right-hand side of (20) is the generating function of the Poisson distribution with mean \mu. The generating function in the left-hand side is that of a negative binomial distribution with mean \displaystyle \mu=\frac{r}{\alpha}. The following is the derivation of (20).

\displaystyle \begin{aligned}(21) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(ln[1-\frac{\mu}{r}(z-1)]^{-r}\biggr)} \\&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(-r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \\&=e^{\displaystyle \biggl(\lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \end{aligned}

We now focus on the limit in the exponent.

\displaystyle \begin{aligned}(22) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]&=\lim \limits_{r \rightarrow \infty} \frac{ln(1-\frac{\mu}{r} (z-1))^{-1}}{r^{-1}} \\&=\lim \limits_{r \rightarrow \infty} \frac{(1-\frac{\mu}{r} (z-1)) \ \mu (z-1) r^{-2}}{r^{-2}} \\&=\mu (z-1) \end{aligned}

The middle step in (22) uses the L’Hopital’s Rule. The result in (20) is obtained by combining (21) and (22).

________________________________________________________________________

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

Splitting a Poisson Distribution

We consider a remarkable property of the Poisson distribution that has a connection to the multinomial distribution. We start with the following examples.

Example 1
Suppose that the arrivals of customers in a gift shop at an airport follow a Poisson distribution with a mean of \alpha=5 per 10 minutes. Furthermore, suppose that each arrival can be classified into one of three distinct types – type 1 (no purchase), type 2 (purchase under $20), and type 3 (purchase over $20). Records show that about 25% of the customers are of type 1. The percentages of type 2 and type 3 are 60% and 15%, respectively. What is the probability distribution of the number of customers per hour of each type?

Example 2
Roll a fair die N times where N is random and follows a Poisson distribution with parameter \alpha. For each i=1,2,3,4,5,6, let N_i be the number of times the upside of the die is i. What is the probability distribution of each N_i? What is the joint distribution of N_1,N_2,N_3,N_4,N_5,N_6?

In Example 1, the stream of customers arrive according to a Poisson distribution. It can be shown that the stream of each type of customers also has a Poisson distribution. One way to view this example is that we can split the Poisson distribution into three Poisson distributions.

Example 2 also describes a splitting process, i.e. splitting a Poisson variable into 6 different Poisson variables. We can also view Example 2 as a multinomial distribution where the number of trials is not fixed but is random and follows a Poisson distribution. If the number of rolls of the die is fixed in Example 2 (say 10), then each N_i would be a binomial distribution. Yet, with the number of trials being Poisson, each N_i has a Poisson distribution with mean \displaystyle \frac{\alpha}{6}. In this post, we describe this Poisson splitting process in terms of a “random” multinomial distribution (the view point of Example 2).

________________________________________________________________________

Suppose we have a multinomial experiment with parameters N, r, p_1, \cdots, p_r, where

  • N is the number of multinomial trials,
  • r is the number of distinct possible outcomes in each trial (type 1 through type r),
  • the p_i are the probabilities of the r possible outcomes in each trial.

Suppose that N follows a Poisson distribution with parameter \alpha. For each i=1, \cdots, r, let N_i be the number of occurrences of the i^{th} type of outcomes in the N trials. Then N_1,N_2,\cdots,N_r are mutually independent Poisson random variables with parameters \alpha p_1,\alpha p_2,\cdots,\alpha p_r, respectively.

The variables N_1,N_2,\cdots,N_r have a multinomial distribution and their joint probability function is:

\displaystyle (1) \ \ \ \ P(N_1=n_1,N_2=n_2,\cdots,N_r=n_r)=\frac{N!}{n_1! n_2! \cdots n_r!} \ p_1^{n_1} p_2^{n_2} \cdots p_r^{n_r}

where n_i are nonnegative integers such that N=n_1+n_2+\cdots+n_r.

Since the total number of multinomial trials N is not fixed and is random, (1) is not the end of the story. The following is the joint probability function of N_1,N_2,\cdots,N_r:

\displaystyle \begin{aligned}(2) \ \ \ \ P(N_1=n_1,N_2=n_2,\cdots,N_r=n_r)&=P(N_1=n_1,N_2=n_2,\cdots,N_r=n_r \lvert N=\sum \limits_{k=0}^r n_k) \\&\ \ \ \ \ \times P(N=\sum \limits_{k=0}^r n_k) \\&\text{ } \\&=\frac{(\sum \limits_{k=0}^r n_k)!}{n_1! \ n_2! \ \cdots \ n_r!} \ p_1^{n_1} \ p_2^{n_2} \ \cdots \ p_r^{n_r} \ \times \frac{e^{-\alpha} \alpha^{\sum \limits_{k=0}^r n_k}}{(\sum \limits_{k=0}^r n_k)!} \\&\text{ } \\&=\frac{e^{-\alpha p_1} \ (\alpha p_1)^{n_1}}{n_1!} \ \frac{e^{-\alpha p_2} \ (\alpha p_2)^{n_2}}{n_2!} \ \cdots \ \frac{e^{-\alpha p_r} \ (\alpha p_r)^{n_r}}{n_r!} \end{aligned}

To obtain the marginal probability function of N_j, j=1,2,\cdots,r, we sum out the other variables N_k=n_k (k \ne j) in (2) and obtain the following:

\displaystyle (3) \ \ \ \ P(N_j=n_j)=\frac{e^{-\alpha p_j} \ (\alpha p_j)^{n_j}}{n_j!}

Thus we can conclude that N_j, j=1,2,\cdots,r, has a Poisson distribution with parameter \alpha p_j. Furrthermore, the joint probability function of N_1,N_2,\cdots,N_r is the product of the marginal probability functions. Thus we can conclude that N_1,N_2,\cdots,N_r are mutually independent.

________________________________________________________________________
Example 1
Let N_1,N_2,N_3 be the number of customers per hour of type 1, type 2, and type 3, respectively. Here, we attempt to split a Poisson distribution with mean 30 per hour (based on 5 per 10 minutes). Thus N_1,N_2,N_3 are mutually independent Poisson variables with means 30 \times 0.25=7.5, 30 \times 0.60=18, 30 \times 0.15=4.5, respectively.

Example 2
As indicated earlier, each N_i, i=1,2,3,4,5,6, has a Poisson distribution with mean \frac{\alpha}{6}. According to (2), the joint probability function of N_1,N_2,N_3,N_4,N_5,N_6 is simply the product of the six marginal Poisson probability functions.

The Poisson Distribution

Let \alpha be a positive constant. Consider the following probability distribution:

\displaystyle (1) \ \ \ \ \ P(X=j)=\frac{e^{-\alpha} \alpha^j}{j!} \ \ \ \ \ j=0,1,2,\cdots

The above distribution is said to be a Poisson distribution with parameter \alpha. The Poisson distribution is usually used to model the random number of events occurring in a fixed time interval. As will be shown below, E(X)=\alpha. Thus the parameter \alpha is the rate of occurrence of the random events; it indicates on average how many events occur per unit of time. Examples of random events that may be modeled by the Poisson distribution include the number of alpha particles emitted by a radioactive substance counted in a prescribed area during a fixed period of time, the number of auto accidents in a fixed period of time or the number of losses arising from a group of insureds during a policy period.

Each of the above examples can be thought of as a process that generates a number of arrivals or changes in a fixed period of time. If such a counting process leads to a Poisson distribution, then the process is said to be a Poisson process.

We now discuss some basic properties of the Poisson distribution. Using the Taylor series expansion of e^{\alpha}, the following shows that (1) is indeed a probability distribution.

\displaystyle . \ \ \ \ \ \ \ \sum \limits_{j=0}^\infty \frac{e^{-\alpha} \alpha^j}{j!}=e^{-\alpha} \sum \limits_{j=0}^\infty \frac{\alpha^j}{j!}=e^{-\alpha} e^{\alpha}=1

The generating function of the Poisson distribution is g(z)=e^{\alpha (z-1)} (see The generating function). The mean and variance can be calculated using the generating function.

\displaystyle \begin{aligned}(2) \ \ \ \ \ &E(X)=g'(1)=\alpha \\&\text{ } \\&E[X(X-1)]=g^{(2)}(1)=\alpha^2 \\&\text{ } \\&Var(X)=E[X(X-1)]+E(X)-E(X)^2=\alpha^2+\alpha-\alpha^2=\alpha \end{aligned}

The Poisson distribution can also be interpreted as an approximation to the binomial distribution. It is well known that the Poisson distribution is the limiting case of binomial distributions (see [1] or this post).

\displaystyle (3) \ \ \ \ \ \lim \limits_{n \rightarrow \infty} \binom{n}{j} \biggl(\frac{\alpha}{n}\biggr)^j \biggl(1-\frac{\alpha}{n}\biggr)^{n-j}=\frac{e^{-\alpha} \alpha^j}{j!}

One application of (3) is that we can use Poisson probabilities to approximate Binomial probabilities. The approximation is reasonably good when the number of trials n in a binomial distribution is large and the probability of success p is small. The binomial mean is n p and the variance is n p (1-p). When p is small, 1-p is close to 1 and the binomial variance is approximately np \approx n p (1-p). Whenever the mean of a discrete distribution is approximately equaled to the mean, the Poisson approximation is quite good. As a rule of thumb, we can use Poisson to approximate binomial if n \le 100 and p \le 0.01.

As an example, we use the Poisson distribution to estimate the probability that at most 1 person out of 1000 will have a birthday on the New Year Day. Let n=1000 and p=365^{-1}. So we use the Poisson distribution with \alpha=1000 \times 365^{-1}. The following is an estimate using the Poisson distribution.

\displaystyle . \ \ \ \ \ \ \ P(X \le 1)=e^{-\alpha}+\alpha e^{-\alpha}=(1+\alpha) e^{-\alpha}=0.2415

Another useful property is that the independent sum of Poisson distributions also has a Poisson distribution. Specifically, if each X_i has a Poisson distribution with parameter \alpha_i, then the independent sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n. One way to see this is that the product of Poisson generating functions has the same general form as g(z)=e^{\alpha (z-1)} (see The generating function). One interpretation of this property is that when merging several arrival processes, each of which follow a Poisson distribution, the result is still a Poisson distribution.

For example, suppose that in an airline ticket counter, the arrival of first class customers follows a Poisson process with a mean arrival rate of 8 per 15 minutes and the arrival of customers flying coach follows a Poisson distribution with a mean rate of 12 per 15 minutes. Then the arrival of customers of either types has a Poisson distribution with a mean rate of 20 per 15 minutes or 80 per hour.

A Poisson distribution with a large mean can be thought of as an independent sum of Poisson distributions. For example, a Poisson distribution with a mean of 50 is the independent sum of 50 Poisson distributions each with mean 1. Because of the central limit theorem, when the mean is large, we can approximate the Poisson using the normal distribution.

In addition to merging several Poisson distributions into one combined Poisson distribution, we can also split a Poisson into several Poisson distributions. For example, suppose that a stream of customers arrives according to a Poisson distribution with parameter \alpha and each customer can be classified into one of two types (e.g. no purchase vs. purchase) with probabilities p_1 and p_2, respectively. Then the number of “no purchase” customers and the number of “purchase” customers are independent Poisson random variables with parameters \alpha p_1 and \alpha p_2, respectively. For more details on the splitting of Poisson, see Splitting a Poisson Distribution.

Reference

  1. Feller W. An Introduction to Probability Theory and Its Applications, Third Edition, John Wiley & Sons, New York, 1968

The generating function

Consider the function g(z)=\displaystyle e^{\alpha (z-1)} where \alpha is a positive constant. The following shows the derivatives of this function.

\displaystyle \begin{aligned}. \ \ \ \ \ \ &g(z)=e^{\alpha (z-1)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g(0)=e^{-\alpha} \\&\text{ } \\&g'(z)=e^{\alpha (z-1)} \ \alpha \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g'(0)=e^{-\alpha} \ \alpha \\&\text{ } \\&g^{(2)}(z)=e^{\alpha (z-1)} \ \alpha^2 \ \ \ \ \ \ \ \ \ \ \ \ \ \ g^{(2)}(0)=2! \ \frac{e^{-\alpha} \ \alpha^2}{2!} \\&\text{ } \\&g^{(3)}(z)=e^{\alpha (z-1)} \ \alpha^3 \ \ \ \ \ \ \ \ \ \ \ \ \ \ g^{(3)}(0)=3! \ \frac{e^{-\alpha} \ \alpha^3}{3!} \\&\text{ } \\&\ \ \ \ \ \ \ \ \cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots \\&\text{ } \\&g^{(n)}(z)=e^{\alpha (z-1)} \ \alpha^n \ \ \ \ \ \ \ \ \ \ \ \ \ \ g^{(n)}(0)=n! \ \frac{e^{-\alpha} \ \alpha^n}{n!} \end{aligned}

Note that the derivative of g(z) at each order is a multiple of a Poisson probability. Thus the Poisson distribution is coded by the function g(z)=\displaystyle e^{\alpha (z-1)}. Because of this reason, such a function is called a generating function (or probability generating function). This post discusses some basic facts about the generating function (gf) and its cousin, the moment generating function (mgf). One important characteristic is that these functions generate probabilities and moments. Another important characteristic is that there is a one-to-one correspondence between a probability distribution and its generating function and moment generating function, i.e. two random variables with different cumulative distribution functions cannot have the same gf or mgf. In some situations, this fact is useful in working with independent sum of random variables.

————————————————————————————————————

The Generating Function
Suppose that X is a random variable that takes only nonegative integer values with the probability function given by

\text{ }

(1) \ \ \ \ \ \ P(X=j)=a_j, \ \ \ \ j=0,1,2,\cdots

\text{ }

The idea of the generating function is that we use a power series to capture the entire probability distribution. The following defines the generating function that is associated with the above sequence a_j, .

(2) \ \ \ \ \ \ g(z)=a_0+a_1 \ z+a_2 \ z^2+ \cdots=\sum \limits_{j=0}^\infty a_j \ z^j

\text{ }

Since the elements of the sequence a_j are probabilities, we can also call g(z) the generating function of the probability distribution defined by the sequence in (1). The generating function g(z) is defined wherever the power series converges. It is clear that at the minimum, the power series in (2) converges for \lvert z \lvert \le 1.

We discuss the following three properties of generating functions:

  1. The generating function completely determines the distribution.
  2. The moments of the distribution can be derived from the derivatives of the generating function.
  3. The generating function of a sum of independent random variables is the product of the individual generating functions.

The Poisson generating function at the beginning of the post is an example demonstrating property 1 (see Example 0 below for the derivation of the generating function). In some cases, the probability distribution of an independent sum can be deduced from the product of the individual generating functions. Some examples are given below.

————————————————————————————————————
Generating Probabilities
We now discuss the property 1 indicated above. To see that g(z) generates the probabilities, let’s look at the derivatives of g(z):

\displaystyle \begin{aligned}(3) \ \ \ \ \ \ &g'(z)=a_1+2 a_2 \ z+3 a_3 \ z^2+\cdots=\sum \limits_{j=1}^\infty j a_j \ z^{j-1} \\&\text{ } \\&g^{(2)}(z)=2 a_2+6 a_3 \ z+ 12 a_4 \ z^2=\sum \limits_{j=2}^\infty j (j-1) a_j \ z^{j-2} \\&\text{ } \\&g^{(3)}(z)=6 a_3+ 24 a_4 \ z+60 a_5 \ z^2=\sum \limits_{j=3}^\infty j (j-1)(j-2) a_j \ z^{j-3} \\&\text{ } \\&\ \ \ \ \ \ \ \ \cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots \\&\text{ } \\&g^{(n)}(z)=\sum \limits_{j=n}^\infty j(j-1) \cdots (j-n+1) a_j \ z^{j-n}=\sum \limits_{j=n}^\infty \binom{j}{n} n! \ a_j \ z^{j-n} \end{aligned}

\text{ }

By letting z=0 above, all the terms vanishes except for the constant term. We have:

\text{ }

(4) \ \ \ \ \ \ g^{(n)}(0)=n! \ a_n=n! \ P(X=n)

\text{ }

Thus the generating function is a compact way of encoding the probability distribution. The probability distribution determines the generating function as seen in (2). On the other hand, (3) and (4) demonstrate that the generating function also determines the probability distribution.

————————————————————————————————————
Generating Moments
The generating function also determines the moments (property 2 indicated above). For example, we have:

\displaystyle \begin{aligned}(5) \ \ \ \ \ \ &g'(1)=0 \ a_0+a_1+2 a_2+3 a_3+\cdots=\sum \limits_{j=0}^\infty j a_j=E(X) \\&\text{ } \\&g^{(2)}(1)=0 a_0 + 0 a_1+2 a_2+6 a_3+ 12 a_4+\cdots=\sum \limits_{j=0}^\infty j (j-1) a_j=E[X(X-1)] \\&\text{ } \\&E(X)=g'(1) \\&\text{ } \\&E(X^2)=g'(1)+g^{(2)}(1) \end{aligned}

\text{ }

Note that g^{(n)}(1)=E[X(X-1) \cdots (X-(n-1))]. Thus the higher moment E(X^n) can be expressed in terms of g^{(n)}(1) and g^{(k)}(1) where k<n.
————————————————————————————————————
More General Definitions
Note that the definition in (2) can also be interpreted as the mathematical expectation of z^X, i.e., g(z)=E(z^X). This provides a way to define the generating function for random variables that may take on values outside of the nonnegative integers. The following is a more general definition of the generating function of the random variable X, which is defined for all z where the expectation exists.

\text{ }

(6) \ \ \ \ \ \ g(z)=E(z^X)

\text{ }

————————————————————————————————————
The Generating Function of Independent Sum
Let X_1,X_2,\cdots,X_n be independent random variables with generating functions g_1,g_2,\cdots,g_n, respectively. Then the generating function of X_1+X_2+\cdots+X_n is given by the product g_1 \cdot g_2 \cdots g_n.

Let g(z) be the generating function of the independent sum X_1+X_2+\cdots+X_n. The following derives g(z). Note that the general form of generating function (6) is used.

\displaystyle \begin{aligned}(7) \ \ \ \ \ \ g(z)&=E(z^{X_1+\cdots+X_n}) \\&\text{ } \\&=E(z^{X_1} \cdots z^{X_n}) \\&\text{ } \\&=E(z^{X_1}) \cdots E(z^{X_n}) \\&\text{ } \\&=g_1(z) \cdots g_n(z) \end{aligned}

The probability distribution of a random variable is uniquely determined by its generating function. In particular, the generating function g(z) of the independent sum X_1+X_2+\cdots+X_n that is derived in (7) is unique. So if the generating function is of a particular distribution, we can deduce that the distribution of the sum must be of the same distribution. See the examples below.

————————————————————————————————————
Example 0
In this example, we derive the generating function of the Poisson distribution. Based on the definition, we have:

\displaystyle \begin{aligned}. \ \ \ \ \ \ g(z)&=\sum \limits_{j=0}^\infty \frac{e^{-\alpha} \alpha^j}{j!} \ z^j \\&\text{ } \\&=\sum \limits_{j=0}^\infty \frac{e^{-\alpha} (\alpha z)^j}{j!}  \\&\text{ } \\&=\frac{e^{-\alpha}}{e^{- \alpha z}} \sum \limits_{j=0}^\infty \frac{e^{-\alpha z} (\alpha z)^j}{j!} \\&\text{ } \\&=e^{\alpha (z-1)} \end{aligned}

\text{ }

Example 1
Suppose that X_1,X_2,\cdots,X_n are independent random variables where each X_i has a Bernoulli distribution with probability of success p. Let q=1-p. The following is the generating function for each X_i.

\text{ }

. \ \ \ \ \ \ g(z)=q+p z

\text{ }

Then the generating function of the sum X=X_1+\cdots+X_n is g(z)^n=(q+p z)^n. The following is the binomial expansion:

\text{ }

\displaystyle \begin{aligned}(8) \ \ \ \ \ \ g(z)^n&=(q+p z)^n \\&\text{ } \\&=\sum \limits_{j=0}^n \binom{n}{j} q^{n-j} \ p^j \ z^j  \end{aligned}

\text{ }

By definition (2), the generating function of X=X_1+\cdots+X_n is:

\text{ }

\text{ }

(9) \ \ \ \ \ \ g(z)^n=\sum \limits_{j=0}^\infty P(X=j) \ z^j

\text{ }

Comparing (8) and (9), we have

\displaystyle (10) \ \ \ \ \ \ P(X=j)=\left\{\begin{matrix}\displaystyle \binom{n}{j} p^j \ q^{n-j}&\ 0 \le j \le n\\{0}&\ j>n \end{matrix}\right.

The probability distribution indicated by (8) and (10) is that of a binomial distribution. Since the probability distribution of a random variable is uniquely determined by its generating function, the independent sum of Bernoulli distributions must ave a Binomial distribution.

\text{ }

Example 2
Suppose that X_1,X_2,\cdots,X_n are independent and have Poisson distributions with parameters \alpha_1,\alpha_2,\cdots,\alpha_n, respectively. Then the independent sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n.

Let g(z) be the generating function of X=X_1+\cdots+X_n. For each i, the generating function of X_i is g_i(z)=e^{\alpha_i (z-1)}. The key to the proof is that the product of the g_i has the same general form as the individual g_i.

\displaystyle \begin{aligned}(11) \ \ \ \ \ \ g(z)&=g_1(z) \cdots g_n(z) \\&\text{ } \\&=e^{\alpha_1 (z-1)} \cdots e^{\alpha_n (z-1)} \\&\text{ } \\&=e^{(\alpha_1+\cdots+\alpha_n)(z-1)} \end{aligned}

The generating function in (11) is that of a Poisson distribution with mean \alpha=\alpha_1+\cdots+\alpha_n. Since the generating function uniquely determines the distribution, we can deduce that the sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n.

\text{ }

Example 3
In rolling a fair die, let X be the number shown on the up face. The associated generating function is:

\displaystyle. \ \ \ \ \ \ g(z)=\frac{1}{6}(z+z^2+z^3+z^4+z^5+z^6)=\frac{z(1-z^6)}{6(1-z)}

The generating function can be further reduced as:

\displaystyle \begin{aligned}. \ \ \ \ \ \ g(z)&=\frac{z(1-z^6)}{6(1-z)} \\&\text{ } \\&=\frac{z(1-z^3)(1+z^3)}{6(1-z)} \\&\text{ } \\&=\frac{z(1-z)(1+z+z^2)(1+z^3)}{6(1-z)} \\&\text{ } \\&=\frac{z(1+z+z^2)(1+z^3)}{6}  \end{aligned}

Suppose that we roll the fair dice 4 times. Let W be the sum of the 4 rolls. Then the generating function of Z is

\displaystyle. \ \ \ \ \ \  g(z)^4=\frac{z^4 (1+z^3)^4 (1+z+z^2)^4}{6^4}

The random variable W ranges from 4 to 24. Thus the probability function ranges from P(W=4) to P(W=24). To find these probabilities, we simply need to decode the generating function g(z)^4. For example, to find P(W=12), we need to find the coefficient of the term z^{12} in the polynomial g(z)^4. To help this decoding, we can expand two of the polynomials in g(z)^4.

\displaystyle \begin{aligned}. \ \ \ \ \ \ g(z)^4&=\frac{z^4 (1+z^3)^4 (1+z+z^2)^4}{6^4} \\&\text{ } \\&=\frac{z^4 \times A \times B}{6^4} \\&\text{ } \\&A=(1+z^3)^4=1+4z^3+6z^6+4z^9+z^{12} \\&\text{ } \\&B=(1+z+z^2)^4=1+4z+10z^2+16z^3+19z^4+16z^5+10z^6+4z^7+z^8  \end{aligned}

Based on the above polynomials, there are three ways of forming z^{12}. They are: (z^4 \times 1 \times z^8), (z^4 \times 4z^3 \times 16z^5), (z^4 \times 6z^6 \times 10z^2). Thus we have:

\displaystyle. \ \ \ \ \ \  P(W=12)=\frac{1}{6^4}(1+4 \times 16+6 \times 10)=\frac{125}{6^4}

To find the other probabilities, we can follow the same decoding process.

————————————————————————————————————
Remark
The probability distribution of a random variable is uniquely determined by its generating function. This fundamental property is useful in determining the distribution of an independent sum. The generating function of the independent sum is simply the product of the individual generating functions. If the product is of a certain distributional form (as in Example 1 and Example 2), then we can deduce that the sum must be of the same distribution.

We can also decode the product of generating functions to obtain the probability function of the independent sum (as in Example 3). The method in Example 3 is quite tedious. But one advantage is that it is a “machine process”, a pretty fool proof process that can be performed mechanically.

The machine process is this: Code the individual probability distribution in a generating function g(z). Then raise it to n. After performing some manipulation to g(z)^n, decode the probabilities from g(z)^n.

As long as we can perform the algebraic manipulation carefully and correctly, this process will be sure to provide the probability distribution of an independent sum.

————————————————————————————————————
The Moment Generating Function
The moment generating function of a random variable X is M_X(t)=E(e^{tX}) on all real numbers t for which the expected value exists. The moments can be computed more directly using an mgf. From the theory of mathematical analysis, it can be shown that if M_X(t) exists on some interval -a<t<a, then the derivatives of M_X(t) of all orders exist at t=0. Furthermore, it can be show that E(X^n)=M_X^{(n)}(0).

Suppose that g(z) is the generating function of a random variable. The following relates the generating function and the moment generating function.

\displaystyle \begin{aligned}. \ \ \ \ \ \ &M_X(t)=g(e^t) \\&\text{ } \\&g(z)=M_X(ln z)  \end{aligned}

————————————————————————————————————

Reference

  1. Feller W. An Introduction to Probability Theory and Its Applications, Third Edition, John Wiley & Sons, New York, 1968

The variance of a mixture

Suppose X is a mixture distribution that is the result of mixing a family of conditional distributions indexed by a parameter random variable \Theta. The uncertainty in the parameter variable \Theta has the effect of increasing the unconditional variance of the mixture X. Thus, Var(X) is not simply the weighted average of the conditional variance Var(X \lvert \Theta). The unconditional variance Var(X) is the sum of two components. They are:

\displaystyle Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]

The above relationship is called the law of total variance, which is the proper way of computing the unconditional variance Var(X). The first component E[Var(X \lvert \Theta)] is called the expected value of conditional variances, which is the weighted average of the conditional variances. The second component Var[E(X \lvert \Theta)] is called the variance of the conditional means, which represents the additional variance as a result of the uncertainty in the parameter \Theta.

We use an example of a two-point mixture to illustrate the law of total variance. The example is followed by a proof of the total law of variance.

Example
Let U be the uniform distribution on the unit interval (0, 1). Suppose that a large population of insureds is composed of “high risk” and “low risk” individuals. The proportion of insured classified as “low risk” is p where 0<p<1. The random loss amount X of a “low risk” insured is U. The random loss amount X of a “high risk” insured is U shifted by a positive constant w>0, i.e. w+U. What is the variance of the loss amount of an insured randomly selected from this population?

For convenience, we use \Theta as a parameter to indicate the risk class (\Theta=1 is “low risk” and \Theta=2 is “high risk”). The following shows the relevant conditional distributional quantities of X.

\displaystyle E(X \lvert \Theta=1)=\frac{1}{2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ E(X \lvert \Theta=2)=w+\frac{1}{2}

\displaystyle Var(X \lvert \Theta=1)=\frac{1}{12} \ \ \ \ \ \ \ \ \ \ \ \ Var(X \lvert \Theta=2)=\frac{1}{12}

The unconditional mean loss is the weighted average of the conditional mean loss amounts. However, the same idea does not work for variance.

\displaystyle \begin{aligned}E[X]&=p \times E(X \lvert \Theta=1)+(1-p) \times E(X \lvert \Theta=2) \\&=p \times \frac{1}{2}+(1-p) \times (w+\frac{1}{2}) \\&=\frac{1}{2}+(1-p) \ w \end{aligned}

\displaystyle \begin{aligned}Var[X]&\ne p \times Var(X \lvert \Theta=1)+(1-p) \times Var(X \lvert \Theta=2)=\frac{1}{12}  \end{aligned}

The conditional variance is the same for both risk classes since the “high risk” loss is a shifted distribution of the “low risk” loss. However, the unconditional variance is more than \frac{1}{12} since the mean loss for the two casses are different (heterogeneous risks across the classes). The uncertainty in the risk classes (i.e. uncertainty in the parameter \Theta) introduces additional variance in the loss for a randomly selected insured. The unconditional variance Var(X) is the sum of the following two components:

\displaystyle \begin{aligned}E[Var(X \lvert \Theta)]&=p \times Var(X \lvert \Theta=1)+(1-p) \times Var(X \lvert \Theta=2) \\&=p \times \frac{1}{12}+(1-p) \times \frac{1}{12} \\&=\frac{1}{12} \end{aligned}

\displaystyle \begin{aligned}Var[E(X \lvert \Theta)]&=p \times E(X \lvert \Theta=1)^2+(1-p) \times E(X \lvert \Theta=2)^2 \\&\ \ \ -\biggl(p \times E(X \lvert \Theta=1)+(1-p) \times E(X \lvert \Theta=2)\biggr)^2 \\&=p \times \biggl(\frac{1}{2}\biggr)^2+(1-p) \times \biggl(w+\frac{1}{2}\biggr)^2 \\&\ \ \ -\biggl(p \times \frac{1}{2}+(1-p) \times (w+\frac{1}{2})\biggr)^2 \\&=p \ (1-p) \ w^2 \end{aligned}

\displaystyle \begin{aligned}Var(X)&=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)] \\&=\frac{1}{12}+p \ (1-p) \ w^2  \end{aligned}

The additional variance is in the amount of p(1-p)w^2. This is the variance of the conditional means of the risk classes. Note that w is the additional mean loss for a “high risk” insured. The higher the additional mean loss w, the more heterogeneous in risk between the two classes, hence the larger the dispersion in unconditional loss.

The total law of variance gives the unconditional variance of a random variable X that is indexed by another random variable \Theta. The unconditional variance of X is the sum of two components, namely, the expected value of conditional variances and the variance of the conditional means. The formula is:

\displaystyle Var(X)=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)]

The following is the derivation of the formula:

\displaystyle \begin{aligned}Var[X]&=E[X^2]-E[X]^2 \\&=E[E(X^2 \lvert \Theta)]-E[E(X \lvert \Theta)]^2 \\&=E\left\{Var(X \lvert \Theta)+E(X \lvert \Theta)^2 \right\}-E[E(X \lvert \Theta)]^2 \\&=E[Var(X \lvert \Theta)]+E[E(X \lvert \Theta)^2]-E[E(X \lvert \Theta)]^2 \\&=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)] \end{aligned}

Additional Practice
See this blog post for practice problems on mixture distributions.

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

An example of a mixture

We use an example to motivate the definition of a mixture distribution.

Example 1

Suppose that the loss arising from an insured randomly selected from a large group of insureds follow an exponential distribution with probability density function (pdf) f_X(x)=\theta e^{-\theta x}, x>0, where \theta is a parameter that is a positive constant. The mean claim cost for this randomly selected insured is \frac{1}{\theta}. So the parameter \theta reflects the risk characteristics of the insured. Since the population of insureds is large, there is uncertainty in the parameter \theta. It is more appropriate to regard \theta as a random variable in order to capture the wide range of risk characteristics across the individuals in the population. As a result, the pdf indicated above is not an unconditional pdf, but, rather, a conditional pdf of X. The below pdf is conditional on a realized value of the random variable \Theta.

    \displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\theta e^{-\theta x}, \ \ \ \ \ x>0

What about the marginal (unconditional) pdf of X? Let’s assume that the pdf of \Theta is given by \displaystyle f_\Theta(\theta)=\frac{1}{2} \ \theta^2 \ e^{-\theta}. Then the unconditional pdf of X is the weighted average of the conditional pdf.

    \displaystyle \begin{aligned}f_X(x)&=\int_0^{\infty} f_{X \lvert \Theta}(x \lvert \theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\theta \ e^{-\theta x}\biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{1}{2} \ \theta^3 \ e^{-\theta(x+1)} \ d \theta \\&=\frac{1}{2} \frac{6}{(x+1)^4} \int_0^{\infty} \frac{(x+1)^4}{3!} \ \theta^{4-1} \ e^{-\theta(x+1)} \ d \theta \\&=\frac{3}{(x+1)^4} \end{aligned}

Several other distributional quantities are also weighted averages, which include the unconditional mean, and the second moment.

    \displaystyle \begin{aligned}E(X)&=\int_0^{\infty} E(X \lvert \Theta=\theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\frac{1}{\theta} \biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{1}{2} \ \theta \ e^{-\theta} \ d \theta \\&=\frac{1}{2} \end{aligned}

    \displaystyle \begin{aligned}E(X^2)&=\int_0^{\infty} E(X^2 \lvert \Theta=\theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\frac{2}{\theta^2} \biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} e^{-\theta} \ d \theta \\&=1 \end{aligned}

As a result, the unconditional variance is Var(X)=1-\frac{1}{4}=\frac{3}{4}. Note that the unconditional variance is not the weighted average of the conditional variance. The weighted average of the conditional variance only produces \frac{1}{2}.

\displaystyle \begin{aligned}E[Var(X \lvert \Theta)]&=\int_0^{\infty} Var(X \lvert \Theta=\theta) \ f_\Theta(\theta) \ d \theta \\&=\int_0^{\infty} \biggl[\frac{1}{\theta^2} \biggr] \ \biggl[\frac{1}{2} \ \theta^2 \ e^{-\theta}\biggr] \ d \theta \\&=\int_0^{\infty} \frac{1}{2} \ e^{-\theta} \ d \theta \\&=\frac{1}{2} \end{aligned}

It turns out that the unconditional variance has two components, the expected value of the conditional variances and the variance of the conditional means. In this example, the former is \frac{1}{2} and the latter is \frac{1}{4}. The additional variance in the amount of \frac{1}{4} is a reflection that there is uncertainty in the parameter \theta.

\displaystyle \begin{aligned}Var(X)&=E[Var(X \lvert \Theta)]+Var[E(X \lvert \Theta)] \\&=\frac{1}{2}+\frac{1}{4}\\&=\frac{3}{4}  \end{aligned}

——————————————————————————————————————-

The Definition of Mixture
The unconditional pdf f_X(x) derived in Example 1 is that of a Pareto distribution. Thus the Pareto distribution is a continuous mixture of exponential distributions with Gamma mixing weights.

Mathematically speaking, a mixture arises when a probability density function f(x \lvert \theta) depends on a parameter \theta that is uncertain and is itself a random variable with density g(\theta). Then taking the weighted average of f(x \lvert \theta) with g(\theta) as weight produces the mixture distribution.

A continuous random variable X is said to be a mixture if its probability density function f_X(x) is a weighted average of a family of probability density functions f(x \lvert \theta). The random variable \Theta is said to be the mixing random variable and its pdf g(\theta) is said to be the mixing weight. An equivalent definition of mixture is that the distribution function F_X(x) is a weighted average of a family of distribution functions indexed by a mixing variable. Thus X is a mixture if one of the following holds.

\displaystyle f_X(x)=\int_{-\infty}^{\infty} f(x \lvert \theta) \ g(\theta) \ d \theta

\displaystyle F_X(x)=\int_{-\infty}^{\infty} F(x \lvert \theta) \ g(\theta) \ d \theta

Similarly, a discrete random variable is a mixture if its probability function (or distribution function) is a weighted sum of a family of probability functions (or distribution functions). Thus X is a mixture if one of the following holds.

\displaystyle P(X=x)=\sum \limits_{y} P(X=x \lvert Y=y) \ P(Y=y)

\displaystyle P(X \le x)=\sum \limits_{y} P(X \le x \lvert Y=y) \ P(Y=y)

Additional Practice
See this blog post for practice problems on mixture distributions.

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

A basic look at joint distributions

This is a discussion of how to work with joint distributions of two random variables. We limit the discussion on continuous random variables. The discussion of the discrete case is similar (for the most part replacing the integral signs with summation signs). Suppose X and Y are continuous random variables where f_{X,Y}(x,y) is the joint probability density function. What this means is that f_{X,Y}(x,y) satisfies the following two properties:

  • for each point (x,y) in the Euclidean plane, f_{X,Y}(x,y) is a nonnegative real number,
  • \displaystyle \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dx \ dy=1.

Because of the second bullet point, the function f_{X,Y}(x,y) must be an integrable function. We will not overly focus on this point and instead be satisfied with knowing that it is possible to integrate f_{X,Y}(x,y) over the entire xy plane and its many reasonable subregions.

Another way to think about f_{X,Y}(x,y) is that it assigns the density to each point in the xy plane (i.e. it tells us how much weight is assigned to each point). Consequently, if we want to know the probability that (X,Y) falls in the region A, we simply evaluate the following integral:

    \displaystyle \int_{A} f_{X,Y}(x,y) \ dx \ dy.

For instance, to find P(X<Y) and P(X+Y \le z), where z>0, we evaluate the integral over the regions x<y and x+y \le z, respectively. The integrals are:

    \displaystyle P(X<Y)=\int_{-\infty}^{\infty} \int_{x}^{\infty} f_{X,Y}(x,y) \ dy \ dx

    \displaystyle P(X+Y \le z)=\int_{-\infty}^{\infty} \int_{-\infty}^{x} f_{X,Y}(x,y) \ dy \ dx

Note that P(X+Y \le z) is the distribution function F_Z(z)=P(X+Y \le z) where Z=X+Y. Then the pdf of Z is obtained by differentiation, i.e. f_Z(z)=F_Z^{'}(z).

In practice, all integrals involving the density functions need be taken only over those x and y values where the density is positive.

——————————————————————————————————————–

Marginal Density

The joint density function f_{X,Y}(x,y) describes how the two variables behave in relation to one another. The marginal probability density function (marginal pdf) is of interest if we are only concerned in one of the variables. To obtain the marginal pdf of X, we simply integrate f_{X,Y}(x,y) and sum out the other variable. The following integral produces the marginal pdf of X:

    \displaystyle f_X(x)=\int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dy

The marginal pdf of X is obtained by summing all the density along the vertical line that meets the x axis at the point (x,0) (see Figure 1). Thus f_X(x) represents the sum total of all density f_{X,Y}(x,y) along a vertical line.

Obviously, if we find the marginal pdf for each vertical line and sum all the marginal pdfs, the result will be 1.0. Thus f_X(x) can be regarded as a single-variable pdf.

    \displaystyle \begin{aligned}\int_{-\infty}^{\infty}f_X(x) \ dx&=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dy \ dx=1 \\&\text{ } \end{aligned}

The same can be said for the marginal pdf of the other variable Y, except that f_Y(y) is the sum (integral in this case) of all the density on a horizontal line that meets the y axis at the point (0,y).

    \displaystyle f_Y(y)=\int_{-\infty}^{\infty} f_{X,Y}(x,y) \ dx

——————————————————————————————————————–

Example 1

Let X and Y be jointly distributed according to the following pdf:

    \displaystyle f_{X,Y}(x,y)=y^2 \ e^{-y(x+1)}, \text{ where } x>0,y>0

The following derives the marginal pdfs for X and Y:

    \displaystyle \begin{aligned}f_X(x)&=\int_0^{\infty} y^2 \ e^{-y(x+1)} \ dy \\&\text{ } \\&=\frac{2}{(x+1)^3} \int_0^{\infty} \frac{(x+1)^3}{2!} y^{3-1} \ e^{-y(x+1)} \ dy \\&\text{ } \\&=\frac{2}{(x+1)^3} \end{aligned}

    \displaystyle \begin{aligned}f_Y(y)&=\int_0^{\infty} y^2 \ e^{-y(x+1)} \ dx \\&\text{ } \\&=y \ e^{-y} \int_0^{\infty} y \ e^{-y x} \ dx \\&\text{ } \\&=y \ e^{-y} \end{aligned}

In the middle step of the derivation of f_X(x), the integrand is the Gamma pdf with parameters x+1 and 3, hence the integral in that step becomes 1. In the middle step for f_Y(y), the integrand is the pdf of an exponential distribution.

——————————————————————————————————————–

Conditional Density

Now consider the joint density f_{X,Y}(x,y) restricted to a vertical line, treating the vertical line as a probability distribution. In essense, we are restricting our focus on one particular realized value of X. Given a realized value x of X, how do we describe the behavior of the other variable Y? Since the marginal pdf f_X(x) is the sum total of all density on a vertical line, we express the conditional density as joint density f_{X,Y}(x,y) as a fraction of f_X(x).

    \displaystyle f_{Y \lvert X}(y \lvert x)=\frac{f_{X,Y}(x,y)}{f_X(x)}

It is easy to see that f_{Y \lvert X}(y \lvert x) is a probability density function of Y. When we already know that X has a realized value, this pdf tells us information about how Y behaves. Thus this pdf is called the conditional pdf of Y given X=x.

Given a realized value x of X, we may want to know the conditional mean and the higher moments of Y.

    \displaystyle E(Y \lvert X=x)=\int_{-\infty}^{\infty} y \ f_{Y \lvert X}(y \lvert x) \ dy

    \displaystyle E(Y^n \lvert X=x)=\int_{-\infty}^{\infty} y^n \ f_{Y \lvert X}(y \lvert x) \ dy \text{ where } n>1

In particular, the conditional variance of Y is:

    \displaystyle Var(Y \lvert X=x)=E(Y^2 \lvert X=x)-E(Y \lvert X=x)^2

The discussion for the conditional density of X given a realized value y of Y is similar, except that we restrict the joint density f_{X,Y}(x,y) on a horizontal line. We have the following information about the conditional distribution of X given a realized value Y=y.

    \displaystyle f_{X \lvert Y}(x \lvert y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}

    \displaystyle E(X \lvert Y=y)=\int_{-\infty}^{\infty} x \ f_{X \lvert Y}(x \lvert y) \ dx

    \displaystyle E(X^n \lvert Y=y)=\int_{-\infty}^{\infty} x^n \ f_{X \lvert Y}(x \lvert y) \ dx \text{ where } n>1

In particular, the conditional variance of X is:

    \displaystyle Var(X \lvert Y=y)=E(X^2 \lvert Y=y)-E(X \lvert Y=y)^2

——————————————————————————————————————–

Example 1 (Continued)

The following derives the conditional density functions:

    \displaystyle \begin{aligned}f_{Y \lvert X}(y \lvert x)&=\frac{f_{X,Y}(x,y)}{f_X(x)} \\&\text{ } \\&=\displaystyle \frac{y^2 e^{-y(x+1)}}{\frac{2}{(x+1)^3}}  \\&\text{ } \\&=\frac{(x+1)^3}{2!} \ y^2 \ e^{-y(x+1)} \end{aligned}

    \displaystyle \begin{aligned}f_{X \lvert Y}(x \lvert y)&=\frac{f_{X,Y}(x,y)}{f_Y(y)} \\&\text{ } \\&=\displaystyle \frac{y^2 e^{-y(x+1)}}{y \ e^{-y}}  \\&\text{ } \\&=y \ e^{-y \ x} \end{aligned}

The conditional density f_{Y \lvert X}(y \lvert x) is that of a Gamma distribution with parameters x+1 and 3. So given a realized value x of X, Y has a Gamma distribution whose scale parameter is x+1 and whose shape parameter is 3. On the other hand, the conditional density f_{X \lvert Y}(x \lvert y) is that of an exponential distribution. Given a realized value y of Y, X has an exponential distribution with parameter y. Since the conditional distributions are familiar parametric distributions, we have the following conditional means and conditional variances.

    \displaystyle E(Y \lvert X=x)=\frac{3}{x+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ Var(Y \lvert X=x)=\frac{3}{(x+1)^2}

    \displaystyle E(X \lvert Y=y)=\frac{1}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Var(X \lvert Y=y)=\frac{1}{y^2}

Note that both conditional means are decreasing functions. The larger the realized value of X, the smaller the mean E(Y \lvert X=x). Likewise, the larger the realized value of Y, the smaller the mean E(X \lvert Y=y). It appears that X and Y moves opposite of each other. This is also confirmed by the fact that Cov(X,Y)=-1.
——————————————————————————————————————–

Mixture Distributions

In the preceding discussion, the conditional distributions are derived from the joint distributions and the marginal distributions. In some applications, it is the opposite: we know the conditional distribution of one variable given the other variable and construct the joint distributions. We have the following:

    \displaystyle \begin{aligned}f_{X,Y}(x,y)&=f_{Y \lvert X}(y \lvert x) \ f_X(x) \\&\text{ } \\&=f_{X \lvert Y}(x \lvert y) \ f_Y(y) \end{aligned}

The form of the joint pdf indicated above has an interesting interpretation as a mixture. Using an insurance example, suppose that f_{X \lvert Y}(x \lvert y) is a model of the claim cost of a randomly selected insured where y is a realized value of a parameter Y that is to indicate the risk characteristics of an insured. The members of this large population have a wide variety of risk characteristics and the random variable Y is to capture the risk charateristics across the entire population. Consequently, the unconditional claim cost for a randomly selected insured is:

    \displaystyle f_X(x)=\int_{-\infty}^{\infty} f_{X \lvert Y}(x \lvert y) \ f_Y(y) \ dy

Note that the above unconditional pdf f_X(x) is a weighted average of conditional pdfs. Thus the distribution derived in this manner is called a mixture distribution. The pdf f_Y(y) is called the mixture weight or mixing weight. Some distributional quantities of a mixture distribution are also the weighted average of the conditional counterpart. These include the distribution function, mean, higher moments. Thus we have;

    \displaystyle F_X(x)=\int_{-\infty}^{\infty} F_{X \lvert Y}(x \lvert y) \ f_Y(y) \ dy

    \displaystyle E(X)=\int_{-\infty}^{\infty} E(X \lvert Y=y) \ f_Y(y) \ dy

    \displaystyle E(X^k)=\int_{-\infty}^{\infty} E(X^k \lvert Y=y) \ f_Y(y) \ dy

In the above derivations, the cumulative distribution function F_X(x) and the moments of E(X^k) are weighted averages of their conditional counterparts. However, the variance Var(X) cannot be the weighted average of conditional variances. To find out why, see the post The variance of a mixture.