Bayesian Logistic Regression

Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The first is logistic regression (also called binomial regression).

A logistic regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables X\mathbf{X} by the estimated coefficients β\boldsymbol{\beta}, plus an intercept α\alpha. However, instead of returning a continuous value yy, such as linear regression, it returns the logistic function of yy:

Logistic(x)=11+e(x) \text{Logistic}(x) = \frac{1}{1 + e^{(-x)}}

We use logistic regression when our dependent variable is binary. It has only two distinct values, usually encoded as 00 or 11. See the figure below for a graphical intuition of the logistic function:

using CairoMakie

function logistic(x)
    return 1 / (1 + exp(-x))
end

f, ax, l = lines(-10 .. 10, logistic; axis=(; xlabel=L"x", ylabel=L"\mathrm{Logistic}(x)"))
f

Logistic Function

As we can see, the logistic function is basically a mapping of any real number to a real number in the range between 0 and 1:

Logistic(x)={R[,+]}{R(0,1)} \text{Logistic}(x) = \{ \mathbb{R} \in [- \infty , + \infty] \} \to \{ \mathbb{R} \in (0, 1) \}

That is, the logistic function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted between 0 and 1. That is why it is used when we need a model to have a probability as a dependent variable (remembering that any real number between 0 and 1 is a valid probability). In the case of a binary dependent variable, we can use this probability as the chance of the dependent variable taking a value of 0 or 1.

Comparison with Linear Regression

Linear regression follows the following mathematical formulation:

Linear=θ0+θ1x1+θ2x2++θnxn \text{Linear} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n
  • θ\theta - model parameters

    • θ0\theta_0 - intercept

    • θ1,θ2,\theta_1, \theta_2, \dots - independent variables x1,x2,x_1, x_2, \dots coefficients

  • nn - total number of independent variables

Logistic regression would add the logistic function to the linear term:

  • p^=Logistic(Linear)=11+eLinear\hat{p} = \text{Logistic}(\text{Linear}) = \frac{1}{1 + e^{-\operatorname{Linear}}} - predicted probability of the observation being the value 1

  • y^={0 if p^<0.51 if p^0.5\hat{\mathbf{y}}=\left\{\begin{array}{ll} 0 & \text { if } \hat{p} < 0.5 \\ 1 & \text { if } \hat{p} \geq 0.5 \end{array}\right. - predicted discrete value of y\mathbf{y}

Example:

Probability of Death=Logistic(10+10cancer+12diabetes+8obesity) \text{Probability of Death} = \text{Logistic} \big(-10 + 10 \cdot \text{cancer} + 12 \cdot \text{diabetes} + 8 \cdot \text{obesity} \big)

Bayesian Logistic Regression

We can model logistic regression in two ways. The first option with a Bernoulli likelihood function and the second option with a binomial likelihood function.

With the Bernoulli likelihood we model a binary dependent variable yy which is the result of a Bernoulli trial with a certain probability pp.

In a binomial likelihood, we model a continuous dependent variable yy which is the number of successes of nn independent Bernoulli trials.

Using Bernoulli Likelihood

yBernoulli(p)pLogistic(α+Xβ)αNormal(μα,σα)βNormal(μβ,σβ) \begin{aligned} \mathbf{y} &\sim \text{Bernoulli}\left( p \right) \\ \mathbf{p} &\sim \text{Logistic}(\alpha + \mathbf{X} \cdot \boldsymbol{\beta}) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned}

where:

  • y\mathbf{y} – binary dependent variable.

  • p\mathbf{p} – probability of y\mathbf{y} taking the value of y\mathbf{y} – success of an independent Bernoulli trial.

  • Logistic\text{Logistic} – logistic function.

  • α\alpha – intercept.

  • β\boldsymbol{\beta} – coefficient vector.

  • X\mathbf{X} – data matrix.

Using Binomial Likelihood

yBinomial(n,p)pLogistic(α+Xβ)αNormal(μα,σα)βNormal(μβ,σβ) \begin{aligned} \mathbf{y} &\sim \text{Binomial}\left( n, p \right) \\ \mathbf{p} &\sim \text{Logistic}(\alpha + \mathbf{X} \cdot \boldsymbol{\beta}) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned}

where:

  • y\mathbf{y} – binary dependent variable – successes of nn independent Bernoulli trials.

  • nn – number of independent Bernoulli trials.

  • p\mathbf{p} – probability of y\mathbf{y} taking the value of y\mathbf{y} – success of an independent Bernoulli trial.

  • Logistic\text{Logistic} – logistic function.

  • α\alpha – intercept.

  • β\boldsymbol{\beta} – coefficient vector.

  • X\mathbf{X} – data matrix.

In both likelihood options, what remains is to specify the model parameters' prior distributions:

  • Prior Distribution of α\alpha – Knowledge we possess regarding the model's intercept.

  • Prior Distribution of β\boldsymbol{\beta} – Knowledge we possess regarding the model's independent variables' coefficients.

Our goal is to instantiate a logistic regression with the observed data (y\mathbf{y} and X\mathbf{X}) and find the posterior distribution of our model's parameters of interest (α\alpha and β\boldsymbol{\beta}). This means to find the full posterior distribution of:

P(θy)=P(α,βy) P(\boldsymbol{\theta} \mid \mathbf{y}) = P(\alpha, \boldsymbol{\beta} \mid \mathbf{y})

Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter σ\sigma in our logistic regression. This is due to neither the Bernoulli nor binomial distributions having a "scale" parameter such as the σ\sigma parameter in the Gaussian/normal distribution.

Also note that the Bernoulli distribution is a special case of the binomial distribution where n=1n = 1:

Bernoulli(p)=Binomial(1,p) \text{Bernoulli}(p) = \text{Binomial}(1, p)

This is easily accomplished with Turing:

using Turing
using LazyArrays
using Random: seed!
seed!(123)

@model function logreg(X, y; predictors=size(X, 2))
    #priors
    α ~ Normal(0, 2.5)
    β ~ filldist(TDist(3), predictors)

    #likelihood
    return y ~ arraydist(LazyArray(@~ BernoulliLogit.(α .+ X * β)))
end;

Here I am specifying very weakly informative priors:

  • αNormal(0,2.5)\alpha \sim \text{Normal}(0, 2.5) – This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of α\alpha. Remember that the normal distribution has support over all the real number line (,+)\in (-\infty, +\infty).

  • βStudent-t(0,1,3)\boldsymbol{\beta} \sim \text{Student-}t(0,1,3) – The predictors all have a prior distribution of a Student-tt distribution centered on 0 with variance 1 and degrees of freedom ν=3\nu = 3. That wide-tailed tt distribution will cover all possible values for our coefficients. Remember the Student-tt also has support over all the real number line (,+)\in (-\infty, +\infty). Also the filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.

Turing's arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray() constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~ creates a broadcast and is a nice short hand for the familiar dot . broadcasting operator in Julia. This is an efficient way to tell Turing that our y vector is distributed lazily as a BernoulliLogit broadcasted to α added to the product of the data matrix X and β coefficient vector.

If your dependent variable y is continuous and represents the number of successes of nn independent Bernoulli trials you can use the binomial likelihood in your model:

y ~ arraydist(LazyArray(@~ BinomialLogit.(n, α .+ X * β)))

Example - Contaminated Water Wells

For our example, I will use a famous dataset called wells (Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:

  • switch – binary/dummy (0 or 1) for well-switching.

  • arsenic – arsenic level in respondent's well.

  • dist – distance (meters) from the respondent's house to the nearest well with safe drinking water.

  • association – binary/dummy (0 or 1) if member(s) of household participate in community organizations.

  • educ – years of education (head of household).

Ok let's read our data with CSV.jl and output into a DataFrame from DataFrames.jl:

using DataFrames
using CSV
using HTTP

url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/wells.csv"
wells = CSV.read(HTTP.get(url).body, DataFrame)
describe(wells)
5×7 DataFrame
 Row │ variable  mean       min    median   max      nmissing  eltype
     │ Symbol    Float64    Real   Float64  Real     Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ switch     0.575166  0       1.0       1             0  Int64
   2 │ arsenic    1.65693   0.51    1.3       9.65          0  Float64
   3 │ dist      48.3319    0.387  36.7615  339.531         0  Float64
   4 │ assoc      0.422848  0       0.0       1             0  Int64
   5 │ educ       4.82848   0       5.0      17             0  Int64

As you can see from the describe() output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.

Now let's us instantiate our model with the data:

X = Matrix(select(wells, Not(:switch)))
y = wells[:, :switch]
model = logreg(X, y);

And, finally, we will sample from the Turing model. We will be using the default NUTS() sampler with 1_000 samples, with 4 Markov chains using multiple threads MCMCThreads():

chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain)
Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ess_per_sec
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64       Float64

           α   -0.1537    0.1008    0.0026   1487.6006   2091.0499    1.0020       26.5174
        β[1]    0.4664    0.0419    0.0009   2237.3707   2405.0432    1.0010       39.8825
        β[2]   -0.0090    0.0010    0.0000   4269.6543   3192.4775    1.0009       76.1093
        β[3]   -0.1226    0.0777    0.0019   1680.2431   1877.4329    1.0002       29.9514
        β[4]    0.0424    0.0096    0.0002   2656.3110   2520.0618    1.0020       47.3504

We had no problem with the Markov chains as all the rhat are well below 1.01 (or above 0.99). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:

odds=p1p \text{odds} = \frac{p}{1-p}

where pp is a probability. So log-odds is defined as:

log(odds)=log(p1x) \log(\text{odds}) = \log \left( \frac{p}{1-x} \right)

So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:

odds=exp(log(odds)) \text{odds} = \exp ( \log ( \text{odds} ))

We can do this with a transformation in a DataFrame constructed from a Chains object:

using Chain

@chain quantile(chain) begin
    DataFrame
    select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
end
5×6 DataFrame
 Row │ parameters  2.5%      25.0%     50.0%     75.0%     97.5%
     │ Symbol      Float64   Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────────────
   1 │ α           0.700906  0.801908  0.859547  0.91731   1.04274
   2 │ β[1]        1.47317   1.54865   1.59402   1.63747   1.7351
   3 │ β[2]        0.989048  0.990354  0.991039  0.991766  0.993111
   4 │ β[3]        0.75859   0.84017   0.885127  0.931194  1.03098
   5 │ β[4]        1.0235    1.03671   1.04353   1.0502    1.06242

Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that yy will be 11. And anything above 1 increases the probability of yy being 11, while 1 itself is a neutral odds for yy being either 11 or 00. Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob() that converts log-odds to probabilities:

function logodds2prob(logodds::Float64)
    return exp(logodds) / (1 + exp(logodds))
end

@chain quantile(chain) begin
    DataFrame
    select(_, :parameters, names(_, r"%") .=> ByRow(logodds2prob); renamecols=false)
end
5×6 DataFrame
 Row │ parameters  2.5%      25.0%     50.0%     75.0%     97.5%
     │ Symbol      Float64   Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────────────
   1 │ α           0.412078  0.445033  0.462234  0.478436  0.510462
   2 │ β[1]        0.59566   0.607635  0.614497  0.620848  0.634383
   3 │ β[2]        0.497247  0.497577  0.49775   0.497933  0.498272
   4 │ β[3]        0.431363  0.456572  0.469532  0.482186  0.507626
   5 │ β[4]        0.505807  0.509012  0.510649  0.512243  0.515133

There you go, much better now. Let's analyze our results. The intercept α is the basal switch probability which has a median value of 46%. All coefficients whose 95% credible intervals captures the value 12=0.5\frac{1}{2} = 0.5 tells that the effect on the propensity of switch is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients. So this rules out β[3] which is the third column of Xassoc. The other remaining 95% credible intervals can be interpreted as follows:

  • β[1] – first column of X, arsenic, has 95% credible interval 0.595 to 0.634. This means that each increase in one unit of arsenic is related to an increase of 9.6% to 13.4% propension of switch being 1.

  • β[2] – second column of X, dist, has a 95% credible interval from 0.497 to 0.498. So we expect that each increase in one meter of dist is related to a decrease of 0.01% propension of switch being 1.

  • β[4] – fourth column of X, educ, has a 95% credible interval from 0.506 to 0.515. Each increase in one year of educ is related to an increase of 0.6% to 1.5% propension of switch being 1.

That's how you interpret 95% credible intervals from a quantile() output of a logistic regression Chains object converted from log-odds to probability.

Footnotes

[1] actually the logit function or the log-odds is the logarithm of the odds p1p\frac{p}{1-p} where pp is a probability.

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.