Robust Bayesian Regression

Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The fourth of these is robust regression.

A regression with count data behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables X\mathbf{X} by the estimated coefficients β\boldsymbol{\beta}, plus an intercept α\alpha. However, instead of using a Gaussian/normal likelihood function, it uses a Student-tt likelihood function.

We use robust regression in the same context as linear regression: our dependent variable is continuous. But robust regression allows us to better handle outliers in our data.

Before we dive in the nuts and bolts of robust regression let's remember the Gaussian/normal curve that has a bell shape (figure below). It does not have a "fat tail" (or sometimes known as "long tail"). In other words, the observations are not far from the mean. When we use this distribution as a likelihood function in the Bayesian models, we force that all estimates must be conditioned into a normal distribution of the dependent variable. If there are many outliers in the data (observations quite far from the mean), this causes the estimates of the independent variables' coefficients to be unstable. This is because the normal distribution cannot contemplate observations that are very spread away from the mean without having to change the mean's position (or location). In other words, the bell curve needs to "shift" to be able to contemplate outliers, thus making the inference unstable.

using CairoMakie
using Distributions

f, ax, l = lines(-4 .. 4, Normal(0, 1); linewidth=5, axis=(; xlabel=L"x", ylabel="Density"))

Normal with μ=0\mu=0 and σ=1\sigma = 1

So we need a more "malleable" distribution as a likelihood function. A distribution that is more robust to outliers. A distribution similar to Normal but that has "fatter" (or "longer") tails to precisely contemplate observations very far from the average without having to "shift" the mean's position (or location). For that we have the Student-tt distribution. See the figure below to remember its shape.

f, ax, l = lines(-4 .. 4, TDist(2); linewidth=5, axis=(xlabel=L"x", ylabel="Density"))

Student-tt with ν=2\nu = 2

Comparison Between Normal vs Student-tt

Take a look at the tails in the comparison below:

f, ax, l = lines(
    -4 .. 4,
    Normal(0, 1);
    linewidth=5,
    label="Normal",
    axis=(; xlabel=L"x", ylabel="Density"),
)
lines!(ax, -4 .. 4, TDist(2); linewidth=5, label="Student")
axislegend(ax)

Comparison between Normal and Student-tt Distributions

Bayesian Robust Regression

The standard approach for modeling a continuous dependent variable is with a Gaussian/normal likelihood function. This implies that the model error, σ\sigma of the Gaussian/normal likelihood function is distributed as a normal distribution:

yNormal(α+Xβ,σ)αNormal(μα,σα)βNormal(μβ,σβ)σExponential(λσ) \begin{aligned} \mathbf{y} &\sim \text{Normal}\left( \alpha + \mathbf{X} \cdot \boldsymbol{\beta}, \sigma \right) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \\ \sigma &\sim \text{Exponential}(\lambda_\sigma) \end{aligned}

From a Bayesian point of view, there is nothing special about Gaussian/normal likelihood function It is just a probabilistic distribution specified in a model. We can make the model more robust by using a Student-tt distribution as a likelihood function. This implies that the model error, σ\sigma does not follow a normal distribution, instead it follows a Student-tt distribution:

yStudent(ν,α+Xβ,σ)αNormal(μα,σα)βNormal(μβ,σβ)νLog-Normal(2,1)σExponential(λσ) \begin{aligned} \mathbf{y} &\sim \text{Student}\left( \nu, \alpha + \mathbf{X} \cdot \boldsymbol{\beta}, \sigma \right) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \\ \nu &\sim \text{Log-Normal}(2, 1) \\ \sigma &\sim \text{Exponential}(\lambda_\sigma) \end{aligned}

Here are some differences:

  1. Student-tt likelihood function requires one additional parameter: ν\nu, degrees of freedom. These control how "fat" (or "long") the tails will be. Values of ν>20\nu> 20 forces the Student-tt distribution to practically become a normal distribution.

  2. There is nothing special about ν\nu. It is just another parameter for the model to estimate. So just specify a prior on it. In this case, I am using a Log-Normal distribution with mean 2 and standard deviation 1.

Note that there is also nothing special about the priors of the β\boldsymbol{\beta} coefficients or the intercept α\alpha. We could very well also specify other distributions as priors or even make the model even more robust to outliers by specifying priors as Student-tt distributions with degrees of freedom ν=3\nu = 3:

αStudent(να=3,μα,σα)βStudent(νβ=3,μβ,σβ) \begin{aligned} \alpha &\sim \text{Student}(\nu_\alpha = 3, \mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Student}(\nu_{\boldsymbol{\beta}} = 3, \mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned}

Our goal is to instantiate a regression with count data using the observed data (y\mathbf{y} and X\mathbf{X}) and find the posterior distribution of our model's parameters of interest (α\alpha and β\boldsymbol{\beta}). This means to find the full posterior distribution of:

P(θy)=P(α,β,σy) P(\boldsymbol{\theta} \mid \mathbf{y}) = P(\alpha, \boldsymbol{\beta}, \sigma \mid \mathbf{y})

This is easily accomplished with Turing:

using Turing
using Statistics: mean, std
using StatsBase: mad
using Random: seed!
seed!(123)

@model function robustreg(X, y; predictors=size(X, 2))
    #priors
    α ~ LocationScale(median(y), 2.5 * mad(y), TDist(3))
    β ~ filldist(TDist(3), predictors)
    σ ~ Exponential(1)
    ν ~ LogNormal(2, 1)

    #likelihood
    return y ~ arraydist(LocationScale.(α .+ X * β, σ, TDist.(ν)))
end;

Here I am specifying very weakly informative priors:

  • αStudent-t(median(y),2.5MAD(y),να=3)\alpha \sim \text{Student-}t(\operatorname{median}(\mathbf{y}), 2.5 \cdot \operatorname{MAD}(\mathbf{y}), \nu_{\alpha} = 3) – This means a Student-tt distribution with degrees of freedom ν = 3 centered on y's median with variance 2.5 times the mean absolute deviation (MAD) of y. That prior should with ease cover all possible values of α\alpha. Remember that the Student-tt distribution has support over all the real number line (,+)\in (-\infty, +\infty). The LocationScale() Turing's function adds location and scale parameters to distributions that doesn't have it. This is the case with the TDist() distribution which only takes the ν degrees of of freedom as parameter.

  • βStudent-t(0,1,νβ)\boldsymbol{\beta} \sim \text{Student-}t(0,1,\nu_{\boldsymbol{\beta}}) – The predictors all have a prior distribution of a Student-tt distribution with degrees of freedom ν = 3 centered on 0 with variance 1 and degrees of freedom νβ\nu_{\boldsymbol{\beta}}. That wide-tailed tt distribution will cover all possible values for our coefficients. Remember the Student-tt also has support over all the real number line (,+)\in (-\infty, +\infty). Also the filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.

  • σExponential(1)\sigma \sim \text{Exponential}(1) – A wide-tailed-positive-only distribution perfectly suited for our model's error.

Turing's arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. It creates a broadcast and is a nice short hand for the familiar dot . broadcasting operator in Julia. By specifying that y vector is "broadcasted distributed" as a LocationScale broadcasted to mean (location parameter) α added to the product of the data matrix X and β coefficient vector along with a variance (scale parameter) σ. To conclude, we place inside the LocationScale a broadcasted TDist with ν degrees of freedom parameter.

Example - Duncan's Prestige

For our example, I will use a famous dataset called duncan (Duncan, 1961), which is data from occupation's prestige filled with outliers. It has 45 observations and the following variables:

  • profession: name of the profession.

  • type: type of occupation. A qualitative variable:

    • prof - professional or management.

    • wc - white-collar.

    • bc - blue-collar.

  • income: percentage of people in the occupation earning over U$ 3,500 per year in 1950 (more or less U$ 36,000 in 2017).

  • education: percentage of people in the occupation who had a high school diploma in 1949 (which, being cynical, we can say is somewhat equivalent to a PhD degree in 2017).

  • prestige: percentage of respondents in the survey who classified their occupation as at least "good" with respect to prestige.

Ok let's read our data with CSV.jl and output into a DataFrame from DataFrames.jl:

using DataFrames
using CSV
using HTTP

url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/duncan.csv"
duncan = CSV.read(HTTP.get(url).body, DataFrame)
describe(duncan)
5×7 DataFrame
 Row │ variable    mean     min          median  max             nmissing  eltype
     │ Symbol      Union…   Any          Union…  Any             Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────────
   1 │ profession           RR.engineer          welfare.worker         0  String31
   2 │ type                 bc                   wc                     0  String7
   3 │ income      41.8667  7            42.0    81                     0  Int64
   4 │ education   52.5556  7            45.0    100                    0  Int64
   5 │ prestige    47.6889  3            41.0    97                     0  Int64

As you can see from the describe() output the average occupation's percentage of respondents who classified their occupation as at least "good" with respect to prestige is around 41%. But prestige variable is very dispersed and actually has a bimodal distribution:

f = Figure()
plt = data(duncan) * mapping(:prestige) * AlgebraOfGraphics.density()
draw!(f[1, 1], plt)
UndefVarError: `data` not defined

// Image matching '/assets/pages/10_robust_reg/code/prestige_density' not found. //

Density Plot of prestige

Besides that, the mean prestige per type shows us where the source of variation might come from:

gdf = groupby(duncan, :type)
f = Figure()
plt =
    data(combine(gdf, :prestige => mean)) * mapping(:type, :prestige_mean) * visual(BarPlot)
draw!(f[1, 1], plt)
UndefVarError: `data` not defined

// Image matching '/assets/pages/10_robust_reg/code/prestige_per_type' not found. //

Mean prestige per type

Now let's us instantiate our model with the data:

X = Matrix(select(duncan, [:income, :education]))
y = duncan[:, :prestige]
model = robustreg(X, y);

And, finally, we will sample from the Turing model. We will be using the default NUTS() sampler with 1_000 samples, with 4 Markov chains using multiple threads MCMCThreads():

chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain)
Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ess_per_sec
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64       Float64

           α   -7.5790    3.0116    0.0628   2309.7654   1978.4710    1.0042       70.5574
        β[1]    0.7719    0.1024    0.0025   1748.2750   1843.0804    1.0028       53.4053
        β[2]    0.4464    0.0836    0.0020   1702.5880   1820.3078    1.0052       52.0097
           σ    7.1578    1.7728    0.0477   1225.3095    826.2654    1.0027       37.4300
           ν    2.8148    2.4661    0.0635   1452.4245   1333.5578    1.0009       44.3678

We had no problem with the Markov chains as all the rhat are well below 1.01 (or above 0.99). Also note that all degrees of freedom parameters, the ν stuff, have been estimated with mean around 3 to 5, which indeed signals that our model needed fat tails to make a robust inference. Our model has an error σ of around 7. So it estimates occupation's prestige ±7. The intercept α is the basal occupation's prestige value. So each occupation has -7±7 prestige before we add the coefficients multiplied by the occupations' independent variables. And from our coefficients β\boldsymbol{\beta}, we can see that the quantile() tells us the uncertainty around their estimates:

quantile(chain)
Quantiles
  parameters       2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol    Float64   Float64   Float64   Float64   Float64

           α   -13.3670   -9.6056   -7.5776   -5.5600   -1.7272
        β[1]     0.5567    0.7114    0.7743    0.8380    0.9659
        β[2]     0.2796    0.3936    0.4460    0.4993    0.6170
           σ     4.1120    5.8269    7.0177    8.3067   10.9849
           ν     1.0269    1.6074    2.1733    3.1841    8.1579
  • β[1] – first column of X, income, has 95% credible interval from 0.55 to 0.96. This means that an increase of U$ 1,000 in occupations' annual income is associated with an increase in roughly 0.5 to 1.0 in occupation's prestige.

  • β[2] – second column of X, education, has a 95% credible interval from 0.29 to 0.61. So we expect that an increase of 1% in occupations' percentage of respondents who had a high school diploma increases occupations' prestige roughly 0.3 to 0.6.

That's how you interpret 95% credible intervals from a quantile() output of a robust regression Chains object.

References

Duncan, O. D. (1961). A socioeconomic index for all occupations. Class: Critical Concepts, 1, 388–426.