Bayesian Regression with Count Data

Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The third of these is regression with count data (also called Poisson regression).

A regression with count data behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables X\mathbf{X} by the estimated coefficients β\boldsymbol{\beta}, plus an intercept α\alpha. However, instead of returning a continuous value yy, such as linear regression, it returns the natural log of yy.

We use regression with count data when our dependent variable is restricted to positive integers, i.e. yZ+y \in \mathbb{Z}^+. See the figure below for a graphical intuition of the exponential function:

using Plots, LaTeXStrings

plot(exp, -6, 6, label=false,
    xlabel=L"x", ylabel=L"e^x")

Exponential Function

As we can see, the exponential function is basically a mapping of any real number to a positive real number in the range between 0 and ++\infty (non-inclusive):

Exponential(x)={R[,+]}{R[0,+]} \text{Exponential}(x) = \{ \mathbb{R} \in [- \infty , + \infty] \} \to \{ \mathbb{R} \in [0, + \infty] \}

That is, the exponential function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted to taking positive values only. That is why it is used when we need a model to have a positive-only dependent variable. This is the case of a dependent variable for count data.

Comparison with Linear Regression

Linear regression follows the following mathematical formulation:

Linear=θ0+θ1x1+θ2x2++θnxn \text{Linear} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n
  • θ\theta - model parameters

    • θ0\theta_0 - intercept

    • θ1,θ2,\theta_1, \theta_2, \dots - independent variables x1,x2,x_1, x_2, \dots coefficients

  • nn - total number of independent variables

Regression with count data would add the exponential function to the linear term:

log(y)=θ0θ1x1θ2x2θnxn \log(y) = \theta_0 \cdot \theta_1 x_1 \cdot \theta_2 x_2 \cdot \dots \cdot \theta_n x_n

which is the same as:

y=e(θ0+θ1x1+θ2x2++θnxn) y = e^{(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n)}

Bayesian Regression with Count Data

We can model regression with count data in two ways. The first option with a Poisson likelihood function and the second option with a negative binomial likelihood function.

With the Poisson likelihood we model a discrete and positive dependent variable yy by assuming that a given number of independent yy events will occur with a known constant average rate.

In a negative binomial likelihood, model a discrete and positive dependent variable yy by assuming that a given number nn of independent yy events will occur by asking a yes-no question for each nn with probability pp until kk success(es) is obtained. Note that it becomes identical to the Poisson likelihood when at the limit of kk \to \infty. This makes the negative binomial a robust option to replace a Poisson likelihood to model phenomena with a overdispersion (excess expected variation in data). This occurs due to the Poisson likelihood making an assumption that the dependent variable yy has the same mean and variance, while in the negative binomial likelihood the mean and the variance do not need to be equal.

Using Poisson Likelihood

yPoisson(e(α+Xβ))αNormal(μα,σα)βNormal(μβ,σβ) \begin{aligned} \mathbf{y} &\sim \text{Poisson}\left( e^{(\alpha + \mathbf{X} \cdot \boldsymbol{\beta})} \right) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned}

where:

  • y\mathbf{y} – discrete and positive dependent variable.

  • ee – exponential.

  • α\alpha – intercept.

  • β\boldsymbol{\beta} – coefficient vector.

  • X\mathbf{X} – data matrix.

As we can see, the linear predictor α+Xβ\alpha + \mathbf{X} \cdot \boldsymbol{\beta} is the logarithm of the value of yy. So we need to apply the exponential function the values of the linear predictor:

log(y)=α+Xβy=eαXβy=eαe(Xβ) \begin{aligned} \log(\mathbf{y}) &= \alpha + \mathbf{X} \cdot \boldsymbol{\beta} \\ \mathbf{y} &= e^{\alpha \mathbf{X} \cdot \boldsymbol{\beta}} \\ \mathbf{y} &= e^{\alpha} \cdot e^{\left( \mathbf{X} \cdot \boldsymbol{\beta} \right) } \end{aligned}

The intercept α\alpha and coefficients β\boldsymbol{\beta} are only interpretable after exponentiation.

What remains is to specify the model parameters' prior distributions:

  • Prior Distribution of α\alpha – Knowledge we possess regarding the model's intercept.

  • Prior Distribution of β\boldsymbol{\beta} – Knowledge we possess regarding the model's independent variables' coefficients.

Our goal is to instantiate a regression with count data using the observed data (y\mathbf{y} and X\mathbf{X}) and find the posterior distribution of our model's parameters of interest (α\alpha and β\boldsymbol{\beta}). This means to find the full posterior distribution of:

P(θy)=P(α,βy) P(\boldsymbol{\theta} \mid \mathbf{y}) = P(\alpha, \boldsymbol{\beta} \mid \mathbf{y})

Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter σ\sigma in our regression with count data. This is due to the Poisson not having a "scale" parameter such as the σ\sigma parameter in the Gaussian/normal distribution.

This is easily accomplished with Turing:

using Turing
using LazyArrays
using Random: seed!
seed!(123)

@model function poissonreg(X, y; predictors=size(X, 2))
    #priors
    α ~ Normal(0, 2.5)
    β ~ filldist(TDist(3), predictors)

    #likelihood
    y ~ arraydist(LazyArray(@~ LogPoisson.(α .+ X * β)))
end;

Here I am specifying very weakly informative priors:

  • αNormal(0,2.5)\alpha \sim \text{Normal}(0, 2.5) – This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of α\alpha. Remember that the normal distribution has support over all the real number line (,+)\in (-\infty, +\infty).

  • βStudent-t(0,1,3)\boldsymbol{\beta} \sim \text{Student-}t(0,1,3) – The predictors all have a prior distribution of a Student-tt distribution centered on 0 with variance 1 and degrees of freedom ν=3\nu = 3. That wide-tailed tt distribution will cover all possible values for our coefficients. Remember the Student-tt also has support over all the real number line (,+)\in (-\infty, +\infty). Also the filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.

Turing's arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray() constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~ creates a broadcast and is a nice short hand for the familiar dot . broadcasting operator in Julia. This is an efficient way to tell Turing that our y vector is distributed lazily as a LogPoisson broadcasted to α added to the product of the data matrix X and β coefficient vector. LogPoisson is Turing's efficient distribution that already apply exponentiation to all the linear predictors.

Using Negative Binomial Likelihood

yNegative Binomial(e(α+Xβ),ϕ)ϕGamma(0.01,0.01)αNormal(μα,σα)βNormal(μβ,σβ) \begin{aligned} \mathbf{y} &\sim \text{Negative Binomial}\left( e^{(\alpha + \mathbf{X} \cdot \boldsymbol{\beta})}, \phi \right) \\ \phi &\sim \text{Gamma}(0.01, 0.01) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned}

where:

  • y\mathbf{y} – discrete and positive dependent variable.

  • ee – exponential.

  • ϕ\phi – dispersion.

  • ϕ\phi^- – reciprocal dispersion.

  • α\alpha – intercept.

  • β\boldsymbol{\beta} – coefficient vector.

  • X\mathbf{X} – data matrix.

Note that when we compare with the Poisson model, we have a new parameter ϕ\phi that parameterizes the negative binomial likelihood. This parameter is the probability of successes pp of the negative binomial distribution and we generally use a Gamma distribution as prior so that the inverse of ϕ\phi which is ϕ\phi^- fulfills the function of a "reciprocal dispersion" parameter. Most of the time we use a weakly informative prior of the parameters shape α=0.01\alpha = 0.01 and scale θ=0.01\theta = 0.01 (Gelman et al., 2013; 2020). But you can also use ϕExponential(1)\phi^- \sim \text{Exponential}(1) as prior (McElreath, 2020).

Here is what a Gamma(0.01,0.01)\text{Gamma}(0.01, 0.01) looks like:

using StatsPlots, Distributions
plot(Gamma(0.01, 0.01),
    lw=2, label=false,
    xlabel=L"\phi",
    ylabel="Density",
    xlims=(0, 0.001))

Gamma Distribution with α=0.01\alpha = 0.01 and θ=0.01\theta = 0.01

In both likelihood options, what remains is to specify the model parameters' prior distributions:

  • Prior Distribution of α\alpha – Knowledge we possess regarding the model's intercept.

  • Prior Distribution of β\boldsymbol{\beta} – Knowledge we possess regarding the model's independent variables' coefficients.

Our goal is to instantiate a regression with count data using the observed data (y\mathbf{y} and X\mathbf{X}) and find the posterior distribution of our model's parameters of interest (α\alpha and β\boldsymbol{\beta}). This means to find the full posterior distribution of:

P(θy)=P(α,βy) P(\boldsymbol{\theta} \mid \mathbf{y}) = P(\alpha, \boldsymbol{\beta} \mid \mathbf{y})

Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter σ\sigma in our regression with count data. This is due to neither the Poisson nor negative binomial distributions having a "scale" parameter such as the σ\sigma parameter in the Gaussian/normal distribution.

Alternative Negative Binomial Parameterization

One last thing before we get into the details of the negative binomial distribution is to consider an alternative parameterization. Julia's Distributions.jl and, consequently, Turing's parameterization of the negative binomial distribution follows the following the Wolfram reference:

Negative-Binomial(kr,p)Γ(k+r)k!Γ(r)pr(1p)k,for k=0,1,2, \text{Negative-Binomial}(k \mid r, p) \sim \frac{\Gamma(k+r)}{k! \Gamma(r)} p^r (1 - p)^k, \quad \text{for } k = 0, 1, 2, \ldots

where:

  • kk – number of failures before the rrth success in a sequence of independent Bernoulli trials

  • rr – number of successes

  • pp – probability of success in an individual Bernoulli trial

This is not ideal for most of the modeling situations that we would employ the negative binomial distribution. In particular, we want to have a parameterization that is more appropriate for count data. What we need is the familiar mean (or location) and variance (or scale) parameterization. If we look in Stan's documentation for the neg_binomial_2 function, we have the following two equations:

μ=r(1p)pμ+μ2ϕ=r(1p)p2 \begin{aligned} \mu &= \frac{r (1 - p)}{p} \\ \mu + \frac{\mu^2}{\phi} &= \frac{r (1 - p)}{p^2} \end{aligned}

With a little bit of algebra, we can substitute the first equation of (11) into the right hand side of the second equation and get the following:

μ+μ2ϕ=μp1+μϕ=1pp=11+μϕ \begin{aligned} \mu + \frac{\mu^2}{\phi} &= \frac{μ}{p} \\ 1 + \frac{\mu}{\phi} &= \frac{1}{p} \\ p &= \frac{1}{\frac{1 + \mu}{\phi}} \end{aligned}

Then in (11) we have:

μ=r(1(11+μϕ))(1+μϕ)μ=r((1+μϕ)1)r=ϕ \begin{aligned} \mu &= r \left(1 - \left( \frac{1}{\frac{1 + \mu}{\phi}} \right) \right) \cdot \left(1 + \frac{\mu}{\phi} \right) \\ \mu &= r \left( \left(1 + \frac{\mu}{\phi} \right) - 1 \right) \\ r &= \phi \end{aligned}

Hence, the resulting map is Negative-Binomial(μ,ϕ)Negative-Binomial(r=ϕ,p=11+μϕ)\text{Negative-Binomial}(\mu, \phi) \equiv \text{Negative-Binomial} \left( r = \phi, p = \frac{1}{\frac{1 + \mu}{\phi}} \right). I would like to point out that this implementation was done by Tor Fjelde in a COVID-19 model with the code available in GitHub. So we can use this parameterization in our negative binomial regression model. But first, we need to define an alternative negative binomial distribution function:

function NegativeBinomial2(μ, ϕ)
    p = 1 / (1 + μ / ϕ)
    p = p > 0 ? p : 1e-4 # numerical stability
    r = ϕ

    return NegativeBinomial(r, p)
end
NegativeBinomial2 (generic function with 1 method)

Now we create our Turing model with the alternative NegBinomial2 parameterization:

@model function negbinreg(X, y; predictors=size(X, 2))
    #priors
    α ~ Normal(0, 2.5)
    β ~ filldist(TDist(3), predictors)
    ϕ⁻ ~ Gamma(0.01, 0.01)
    ϕ = 1 / ϕ⁻

    #likelihood
    y ~ arraydist(LazyArray(@~ NegativeBinomial2.(exp.(α .+ X * β), ϕ)))
end;

Here I am also specifying very weakly informative priors:

  • αNormal(0,2.5)\alpha \sim \text{Normal}(0, 2.5) – This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of α\alpha. Remember that the normal distribution has support over all the real number line (,+)\in (-\infty, +\infty).

  • βStudent-t(0,1,3)\boldsymbol{\beta} \sim \text{Student-}t(0,1,3) – The predictors all have a prior distribution of a Student-tt distribution centered on 0 with variance 1 and degrees of freedom ν=3\nu = 3. That wide-tailed tt distribution will cover all possible values for our coefficients. Remember the Student-tt also has support over all the real number line (,+)\in (-\infty, +\infty). Also the filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.

  • ϕExponential(1)\phi \sim \text{Exponential}(1) – overdispersion parameter of the negative binomial distribution.

Turing's arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray() constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~ creates a broadcast and is a nice short hand for the familiar dot . broadcasting operator in Julia. This is an efficient way to tell Turing that our y vector is distributed lazily as a NegativeBinomial2 broadcasted to α added to the product of the data matrix X and β coefficient vector. Note that NegativeBinomial2 does not apply exponentiation so we had to include the exp.() broadcasted function to all the linear predictors.

Example - Roaches Extermination

For our example, I will use a famous dataset called roaches (Gelman & Hill, 2007), which is data on the efficacy of a pest management system at reducing the number of roaches in urban apartments. It has 262 observations and the following variables:

  • y – number of roaches caught.

  • roach1 – pretreatment number of roaches.

  • treatment – binary/dummy (0 or 1) for treatment indicator.

  • senior – binary/dummy (0 or 1) for only elderly residents in building.

  • exposure2 – number of days for which the roach traps were used

Ok let's read our data with CSV.jl and output into a DataFrame from DataFrames.jl:

using DataFrames, CSV, HTTP

url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/roaches.csv"
roaches = CSV.read(HTTP.get(url).body, DataFrame)
describe(roaches)
5×7 DataFrame
 Row │ variable   mean       min   median   max        nmissing  eltype
     │ Symbol     Float64    Real  Float64  Real       Int64     DataType
─────┼────────────────────────────────────────────────────────────────────
   1 │ y          25.6489     0        3.0  357               0  Int64
   2 │ roach1     42.1935     0.0      7.0  450.0             0  Float64
   3 │ treatment   0.603053   0        1.0    1               0  Int64
   4 │ senior      0.305344   0        0.0    1               0  Int64
   5 │ exposure2   1.02105    0.2      1.0    4.28571         0  Float64

As you can see from the describe() output the average number of roaches caught by the pest management system is around 26 roaches. The average number of roaches pretreatment is around 42 roaches (oh boy...). 30% of the buildings has only elderly residents and 60% of the buildings received a treatment by the pest management system. Also note that the traps were set in general for only 1 day and it ranges from 0.2 days (almost 5 hours) to 4.3 days (which is approximate 4 days and 7 hours).

Poisson Regression

Let's first run the Poisson regression. First, we instantiate our model with the data:

X = Matrix(select(roaches, Not(:y)))
y = roaches[:, :y]
model_poisson = poissonreg(X, y);

And, finally, we will sample from the Turing model. We will be using the default NUTS() sampler with 1_000 samples, with 4 Markov chains using multiple threads MCMCThreads():

chain_poisson = sample(model_poisson, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain_poisson)
Summary Statistics
  parameters      mean       std   naive_se      mcse         ess      rhat   ess_per_sec
      Symbol   Float64   Float64    Float64   Float64     Float64   Float64       Float64

           α    2.9680    0.0446     0.0007    0.0009   2094.8385    1.0004       49.2289
        β[1]    0.0065    0.0001     0.0000    0.0000   3335.8001    1.0006       78.3917
        β[2]   -0.5151    0.0245     0.0004    0.0005   2405.7072    1.0016       56.5344
        β[3]   -0.3785    0.0328     0.0005    0.0009   1605.7037    1.0005       37.7342
        β[4]    0.1605    0.0378     0.0006    0.0009   2041.3466    1.0006       47.9719

We had no problem with the Markov chains as all the rhat are well below 1.01 (or above 0.99). Note that the coefficients are in log scale. So we need to apply the exponential function to them. We can do this with a transformation in a DataFrame constructed from a Chains object:

using Chain

@chain quantile(chain_poisson) begin
    DataFrame
    select(_,
        :parameters,
        names(_, r"%") .=> ByRow(exp),
        renamecols=false)
end
5×6 DataFrame
 Row │ parameters  2.5%       25.0%      50.0%      75.0%      97.5%
     │ Symbol      Float64    Float64    Float64    Float64    Float64
─────┼───────────────────────────────────────────────────────────────────
   1 │ α           17.9011    18.8684    19.4457    20.0206    21.3264
   2 │ β[1]         1.00636    1.00649    1.00655    1.00661    1.00672
   3 │ β[2]         0.569928   0.587616   0.597261   0.607139   0.627645
   4 │ β[3]         0.639823   0.670909   0.684702   0.700423   0.729319
   5 │ β[4]         1.0892     1.14579    1.17467    1.20449    1.2622

Let's analyze our results. The intercept α is the basal number of roaches caught y and has a median value of 19.4 roaches caught. The remaining 95% credible intervals for the βs can be interpreted as follows:

  • β[1] – first column of X, roach1, has 95% credible interval 1.01 to 1.01. This means that each increase in one unit of roach1 is related to an increase of 1% more roaches caught.

  • β[2] – second column of X, treatment, has 95% credible interval 0.57 to 0.63. This means that if an apartment was treated with the pest management system then we expect an decrease of around 40% roaches caught.

  • β[3] – third column of X, senior, has a 95% credible interval from 0.64 to 0.73. This means that if an apartment building has only elderly residents then we expect an decrease of around 30% roaches caught.

  • β[4] – fourth column of X, exposure2, has a 95% credible interval from 1.09 to 1.26. Each increase in one day for the exposure of traps in an apartment we expect an increase of between 9% to 26% roaches caught.

That's how you interpret 95% credible intervals from a quantile() output of a regression with count data Chains object converted from a log scale.

Negative Binomial Regression

Let's now run the negative binomial regression.

model_negbin = negbinreg(X, y);

We will also default NUTS() sampler with 1_000 samples, with 4 Markov chains using multiple threads MCMCThreads():

chain_negbin = sample(model_negbin, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain_negbin)
Summary Statistics
  parameters      mean       std   naive_se      mcse         ess      rhat   ess_per_sec
      Symbol   Float64   Float64    Float64   Float64     Float64   Float64       Float64

           α    2.3917    0.3749     0.0059    0.0087   2021.2558    1.0022       44.9278
        β[1]    0.0127    0.0015     0.0000    0.0000   4211.0012    0.9995       93.6007
        β[2]   -0.7326    0.1512     0.0024    0.0028   2860.7723    0.9995       63.5883
        β[3]   -0.3233    0.1661     0.0026    0.0026   3291.0800    0.9995       73.1530
        β[4]    0.4319    0.3441     0.0054    0.0083   1957.8123    1.0025       43.5176
          ϕ⁻    1.4058    0.0777     0.0012    0.0013   2834.8365    0.9994       63.0118

We had no problem with the Markov chains as all the rhat are well below 1.01 (or above 0.99). Note that the coefficients are in log scale. So we need to also apply the exponential function as we did before.

@chain quantile(chain_negbin) begin
    DataFrame
    select(_,
        :parameters,
        names(_, r"%") .=> ByRow(exp),
        renamecols=false)
end
6×6 DataFrame
 Row │ parameters  2.5%      25.0%     50.0%      75.0%      97.5%
     │ Symbol      Float64   Float64   Float64    Float64    Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ α           4.99184   8.54313   11.1166    14.1096    22.7458
   2 │ β[1]        1.00992   1.01166    1.0127     1.01384    1.01596
   3 │ β[2]        0.355529  0.434152   0.482392   0.534852   0.643815
   4 │ β[3]        0.521363  0.645817   0.725943   0.808503   1.00054
   5 │ β[4]        0.811722  1.21325    1.51185    1.92875    3.16543
   6 │ ϕ⁻          3.52958   3.86366    4.06888    4.29254    4.75894

Our results show much more uncertainty in the coefficients than in the Poisson regression. So it might be best to use the Poisson regression in the roaches dataset.

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.

McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.