Bayesian Ordinal Regression

Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The second is ordinal regression.

A ordinal regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables X\mathbf{X} by the estimated coefficients β\boldsymbol{\beta}, but now we have not only one intercept but several intercepts αk\alpha_k for kKk \in K.

We use ordinal regression when our dependent variable is ordinal. That means it has different that have a "natural order"**. Most important, the distance between values is not the same. For example, imagine a pain score scale that goes from 1 to 10. The distance between 1 and 2 is different from the distance 9 to 10. Another example is opinion pools with their ubiquously disagree-agree range of plausible values. These are also known as Likert scale variables. The distance between "disagree" to "not agree or disagree" is different than the distance between "agree" and "strongly agree".

This assumption is what we call the "metric" assumption, also called as the equidistant assumption. Almost always when we model an ordinal dependent variable this assumption is violated. Thus, we cannot blindly employ linear regression here.

How to deal with Ordered Discrete Dependent Variable?

So, how we deal with ordered discrete responses in our dependent variable? This is similar with the previous logistic regression approach. We have to do a non-linear transformation of the dependent variable.

Cumulative Distribution Function (CDF)

In the case of ordinal regression, we need to first transform the dependent variable into a cumulative scale. We need to calculate the cumulative distribution function (CDF) of our dependent variable:

P(Yy)=i=yminyP(Y=i)P(Y \leq y) = \sum^y_{i=y_{\text{min}}} P(Y = i)

The CDF is a monotonic increasing function that depicts the probability of our random variable YY taking values less than a certain value. In our case, the discrete ordinal case, these values can be represented as positive integers ranging from 1 to the length of possible values. For instance, a 6-categorical ordered response variable will have 6 possible values, and all their CDFs will lie between 0 and 1. Furthermore, their sum will be 1; since it represents the total probability of the variable taking any of the possible values, i.e. 100%.

Log-cumulative-odds

That is still not enough, we need to apply the logit function to the CDF:

logit(x)=logistic1(x)=ln(x1x)\mathrm{logit}(x) = \mathrm{logistic}^{-1}(x) = \ln\left(\frac{x}{1 -x}\right)

where ln\ln is the natural logarithm function.

The logit is the inverse of the logistic transformation, it takes as a input any number between 0 and 1 (where a probability is the perfect candidate) and outputs a real number, which we call the log-odds.

Since we are taking the log-odds of the CDF, we can call this complete transformation as log-odds of the CDF, or log-cumulative-odds.

K1K-1 Intercepts

Now, the next question is: what do we do with the log-cumulative-odds?

We need the log-cumulative-odds because it allows us to construct different intercepts for the possible values our ordinal dependent variable. We create an unique intercept for each possible outcome kKk \in K.

Notice that the highest probable value of YY will always have a log-cumulative-odds of \infty, since for p=1p=1:

lnp1p=ln111=ln0=\ln \frac{p}{1-p} = \ln \frac{1}{1-1} = \ln 0 = \infty

Thus, we only need K1K-1 intercepts for a KK possible depedent variables' response values. These are known as cut points.

Each intercept implies a CDF for each value KK. This allows us to violate the equidistant assumption absent in most ordinal variables.

Each intercept implies a log-cumulative-odds for each kKk \in K. We also need to undo the cumulative nature of the K1K-1 intercepts. We can accomplish this by first converting a log-cumulative-odds back to a cumulative probability. This is done by undoing the logit transformation and applying the logistic function:

logit1(x)=logistic(x)=11+ex\mathrm{logit}^{-1}(x) = \mathrm{logistic}(x) = \frac{1}{1 + e^{-x}}

Then, finally, we can remove the cumulative from "cumulative probability" by subtraction of each of the kk cut points by their previous k1k-1 cut point:

P(Y=k)=P(Yk)P(Yk1)P(Y=k) = P(Y \leq k) - P(Y \leq k-1)

where YY is the depedent variable and kKk \in K are the cut points for each intercept.

Let me show you an example with some syntethic data.

using DataFrames
using CairoMakie
using AlgebraOfGraphics
using Distributions
using StatsFuns: logit

Here we have a discrete variable x with 6 possible ordered values as response. The values range from 1 to 6 having probability, respectively: 10%, 15%, 33%, 25%, 10%, and 7%; represented with the probs vector. The underlying distribution is represented by a Categorical distribution from Distributions.jl, which takes a vector of probabilities as parameters (probs).

For each value we are calculating:

  1. Probability Mass Function with the pdf function

  2. Cumulative Distribution Function with the cdf function

  3. Log-cumulative-odds with the logit transformation of the CDF

In the plot below there are 3 subplots:

  • Upper corner: histogram of x

  • Left lower corner: CDF of x

  • Right lower corner: log-cumulative-odds of x

let
    probs = [0.10, 0.15, 0.33, 0.25, 0.10, 0.07]
    dist = Categorical(probs)
    x = 1:length(probs)
    x_pmf = pdf.(dist, x)
    x_cdf = cdf.(dist, x)
    x_logodds_cdf = logit.(x_cdf)
    df = DataFrame(;
        x,
        x_pmf,
        x_cdf,
        x_logodds_cdf)
    labels = ["CDF", "Log-cumulative-odds"]
    fig = Figure()
    plt1 = data(df) *
           mapping(:x, :x_pmf) *
           visual(BarPlot)
    plt2 = data(df) *
           mapping(:x,
               [:x_cdf, :x_logodds_cdf];
               col=dims(1) => renamer(labels)) *
           visual(ScatterLines)
    axis = (; xticks=1:6)
    draw!(fig[1, 2:3], plt1; axis)
    draw!(fig[2, 1:4], plt2;
        axis,
        facet=(; linkyaxes=:none))
    fig
end

Ordinal Dependent Variable

As we can see, we have K1K-1 (in our case 61=56-1=5) intercept values in log-cumulative-odds. You can carly see that these values they violate the equidistant assumption for metric response values. The spacing between the cut points are not the same, they vary.

Adding Coefficients β\boldsymbol{\beta}

Ok, the K1K-1 intercepts α\boldsymbol{\alpha} are done. Now let's add coefficients to act as covariate effects in our ordinal regression model.

We transformed everything into log-odds scale so that we can add effects (coefficients multiplying a covariate) or basal rates (intercepts) together. For each kK1k \in K-1, we calculate:

ϕk=αk+βixi\phi_k = \alpha_k + \beta_i x_i

where αk\alpha_k is the log-cumulative-odds for the kK1k \in K-1 intercepts, βi\beta_i is the coefficient for the iith covariate xx. Finally, ϕk\phi_k represents the linear predictor for the kkth intercept.

Observe that the coefficient β\beta is being added to a log-cumulative-odds, such that, it will be expressed in a log-cumulative-odds also.

We can express it in matrix form:

ϕ=α+Xβ\boldsymbol{\phi} = \boldsymbol{\alpha} + \mathbf{X} \cdot \boldsymbol{\beta}

where ϕ\boldsymbol{\phi}, α\boldsymbol{\alpha} and β\boldsymbol{\beta} are vectors and X\mathbf{X} is the data matrix where each row is an observation and each column a covariate.

This still obeys the ordered constraint on the dependent variable possible values.

How to Interpret Coefficient β\boldsymbol{\beta}?

Now, suppose we have found our ideal value for our β\boldsymbol{\beta}. How we would interpret our β\boldsymbol{\beta} estimated values?

First, to recap, β\boldsymbol{\beta} measures the strength of association between the covariate x\mathbf{x} and depedent variable y\mathbf{y}. But, β\boldsymbol{\beta} is nested inside a non-linear transformation called logistic function:

logistic(β)=11+eβ\mathrm{logistic}(\boldsymbol{\beta}) = \frac{1}{1 + e^{-\boldsymbol{\beta}}}

So, our first step is to undo the logistic function. There is a function that is called logit function that is the inverse of the logistic function:

logistic1(x)=logit(x)=ln(x1x)\mathrm{logistic}^{-1}(x) = \mathrm{logit}(x) = \ln\left(\frac{x}{1 -x}\right)

where ln\ln is the natural logarithm function.

If we analyze closely the logit function we can find that inside the ln\ln there is a disguised odds in the x1x\frac{x}{1 -x}. Hence, our β\boldsymbol{\beta} can be interpreted as the logarithm of the odds, or in short form: the log-odds.

We already saw how odds, log-odds, and probability are related in the previous logistic regression tutorial. So you might want to go back there to get the full explanation.

The log-odds are the key to interpret coefficient β\boldsymbol{\beta}**. Any positive value of β\beta means that there exists a positive association between xx and yy, while any negative values indicate a negative association between xx and yy. Values close to 0 demonstrates the lack of association between xx and yy.

Likelihood

We have almost everything we need for our ordinal regression. We are only missing a final touch. Currently our logistic function outputs a vector of probabilities that sums to 1.

All of the intercepts αk\alpha_k along with the coefficients βi\beta_i are in log-cumulative-odds scale. If we apply the logistic function to the linear predictors ϕk\phi_k we get K1K-1 probabilities: one for each ϕk\phi_k

We need a likelihood that can handle a vector of probabilities and outputs a single discrete value. The categorical distribution is the perfect candidate.

Bayesian Ordinal Regression

Now we have all the components for our Bayesian ordinal regression specification:

yCategorical(p)p=Logistic(ϕ)ϕ=α+Xβα1=CDF(y1)αk=CDF(yk)CDF(yk1)for1<k<K1αK1=1CDF(yK1) \begin{aligned} \mathbf{y} &\sim \text{Categorical}(\mathbf{p}) \\ \mathbf{p} &= \text{Logistic}(\boldsymbol{\phi}) \\ \boldsymbol{\phi} &= \boldsymbol{\alpha} + \mathbf{X} \cdot \boldsymbol{\beta} \\ \alpha_1 &= \text{CDF}(y_1) \\ \alpha_k &= \text{CDF}(y_k) - \text{CDF}(y_{k-1}) \quad \text{for} \quad 1 < k < K-1 \\ \alpha_{K-1} &= 1 - \text{CDF}(y_{K-1}) \end{aligned}

where:

  • y\mathbf{y} – ordered discrete dependent variable.

  • p\mathbf{p} – probability vector of length KK.

  • KK – number of possible values y\mathbf{y} can take, i.e. number of ordered discrete values.

  • ϕ\boldsymbol{\phi} – log-cumulative-odds, i.e. cut points considering the intercepts and covariates effect.

  • αk\alpha_k – intercept in log-cumulative-odds for each kK1k \in K-1.

  • X\mathbf{X} – covariate data matrix.

  • β\boldsymbol{\beta} – coefficient vector of the same length as the number of columns in X\mathbf{X}.

  • logistic\mathrm{logistic} – logistic function.

  • CDF\mathrm{CDF}cumulative distribution function.

What remains is to specify the model parameters' prior distributions:

  • Prior Distribution of α\boldsymbol{\alpha} – Knowledge we possess regarding the model's intercepts.

  • Prior Distribution of β\boldsymbol{\beta} – Knowledge we possess regarding the model's independent variables' coefficients.

Our goal is to instantiate an ordinal regression with the observed data (y\mathbf{y} and X\mathbf{X}) and find the posterior distribution of our model's parameters of interest (α\boldsymbol{\alpha} and β\boldsymbol{\beta}). This means to find the full posterior distribution of:

P(θy)=P(α,βy) P(\boldsymbol{\theta} \mid \mathbf{y}) = P(\boldsymbol{\alpha}, \boldsymbol{\beta} \mid \mathbf{y})

Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter σ\sigma in our ordinal regression. This is due to the Categorical distribution not having a "scale" parameter such as the σ\sigma parameter in the Gaussian/normal distribution.

This is easily accomplished with Turing:

using Turing
using Bijectors
using LazyArrays
using LinearAlgebra
using Random: seed!
seed!(123)

@model function ordreg(X, y; predictors=size(X, 2), ncateg=maximum(y))
    #priors
    cutpoints ~ Bijectors.ordered(filldist(TDist(3) * 5, ncateg - 1))
    β ~ filldist(TDist(3) * 2.5, predictors)

    #likelihood
    y ~ arraydist([OrderedLogistic(X[i, :] ⋅ β, cutpoints) for i in 1:length(y)])
end;

First, let's deal with the new stuff in our model: the Bijectors.ordered. As I've said in the 4. How to use Turing, Turing has a rich ecossystem of packages. Bijectors implements a set of functions for transforming constrained random variables (e.g. simplexes, intervals) to Euclidean space. Here we are defining cutpoints as a ncateg - 1 vector of Student-tt distributions with mean 0, standard deviation 5 and degrees of freedom ν=3\nu = 3. Remember that we only need K1K-1 cutpoints for all of our KK intercepts. And we are also contraining it to be an ordered vector with Bijectors.ordered, such that for all cutpoints cic_i we have c1<c2<...ck1c_1 < c_2 < ... c_{k-1}.

As before, we are giving β\boldsymbol{\beta} a very weakly informative priors of a Student-tt distribution centered on 0 with variance 1 and degrees of freedom ν=3\nu = 3. That wide-tailed tt distribution will cover all possible values for our coefficients. Remember the Student-tt also has support over all the real number line (,+)\in (-\infty, +\infty). Also the filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.

Finally, in the likelihood, Turing's arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. And we use some indexing inside an array literal.

Example - Esoph

For our example, I will use a famous dataset called esoph (Breslow & Day, 1980), which is data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. It has records for 88 age/alcohol/tobacco combinations:

  • agegp: Age group

    • 1: 25-34 years

    • 2: 35-44 years

    • 3: 45-54 years

    • 4: 55-64 years

    • 5: 65-74 years

    • 6: 75+ years

  • alcgp: Alcohol consumption

    • 1: 0-39 g/day

    • 2: 40-79 g/day

    • 3: 80-119 g/day

    • 4: 120+ g/day

  • tobgp: Tobacco consumption

    • 1: 0-9 g/day

    • 2: 10-19 g/day

    • 3: 20-29 g/day

    • 4: 30+ g/day

  • ncases: Number of cases

  • ncontrols: Number of controls

Ok let's read our data with CSV.jl and output into a DataFrame from DataFrames.jl:

using DataFrames, CSV, HTTP

url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/esoph.csv"
esoph = CSV.read(HTTP.get(url).body, DataFrame)
describe(esoph)
5×7 DataFrame
 Row │ variable   mean     min        median  max     nmissing  eltype
     │ Symbol     Union…   Any        Union…  Any     Int64     DataType
─────┼───────────────────────────────────────────────────────────────────
   1 │ agegp               25-34              75+            0  String7
   2 │ alcgp               0-39g/day          80-119         0  String15
   3 │ tobgp               0-9g/day           30+            0  String15
   4 │ ncases     2.27273  0          1.0     17             0  Int64
   5 │ ncontrols  8.80682  0          4.0     60             0  Int64

As you can see from the describe() output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.

Now let's us instantiate our model with the data. But here I need to do some data wrangling to create the data matrix X. Specifically, I need to convert all of the categorical variables to integer values, representing the ordinal values of both our independent and also dependent variables:

using CategoricalArrays

transform!(
    esoph,
    :agegp =>
        x -> categorical(
            x; levels=["25-34", "35-44", "45-54", "55-64", "65-74", "75+"], ordered=true
        ),
    :alcgp =>
        x -> categorical(x; levels=["0-39g/day", "40-79", "80-119", "120+"], ordered=true),
    :tobgp =>
        x -> categorical(x; levels=["0-9g/day", "10-19", "20-29", "30+"], ordered=true);
    renamecols=false
)
transform!(esoph, [:agegp, :alcgp, :tobgp] .=> ByRow(levelcode); renamecols=false)

X = Matrix(select(esoph, [:agegp, :alcgp]))
y = esoph[:, :tobgp]
model = ordreg(X, y);

And, finally, we will sample from the Turing model. We will be using the default NUTS() sampler with 1_000 samples, with 4 Markov chains using multiple threads MCMCThreads():

chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain)
Summary Statistics
    parameters      mean       std   naive_se      mcse         ess      rhat   ess_per_sec
        Symbol   Float64   Float64    Float64   Float64     Float64   Float64       Float64

  cutpoints[1]   -1.4068    0.6390     0.0101    0.0125   2477.8651    1.0002      110.1861
  cutpoints[2]   -0.2369    0.6221     0.0098    0.0118   2550.2363    1.0001      113.4043
  cutpoints[3]    0.8158    0.6283     0.0099    0.0125   2597.9588    1.0001      115.5264
          β[1]   -0.0719    0.1151     0.0018    0.0020   2781.9947    1.0001      123.7102
          β[2]   -0.0718    0.1703     0.0027    0.0028   3285.7691    1.0003      146.1121

We had no problem with the Markov chains as all the rhat are well below 1.01 (or above 0.99). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:

odds=p1p \text{odds} = \frac{p}{1-p}

where pp is a probability. So log-odds is defined as:

log(odds)=log(p1x) \log(\text{odds}) = \log \left( \frac{p}{1-x} \right)

So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:

odds=exp(log(odds)) \text{odds} = \exp ( \log ( \text{odds} ))

We can do this with a transformation in a DataFrame constructed from a Chains object:

using Chain

@chain quantile(chain) begin
    DataFrame
    select(_,
        :parameters,
        names(_, r"%") .=> ByRow(exp),
        renamecols=false)
end
5×6 DataFrame
 Row │ parameters    2.5%       25.0%     50.0%     75.0%     97.5%
     │ Symbol        Float64    Float64   Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ cutpoints[1]  0.0718185  0.158259  0.241939  0.372134  0.867115
   2 │ cutpoints[2]  0.243682   0.516606  0.786366  1.18029   2.74772
   3 │ cutpoints[3]  0.678895   1.47193   2.23704   3.40228   8.0828
   4 │ β[1]          0.741014   0.862452  0.927978  1.00343   1.16994
   5 │ β[2]          0.667512   0.829369  0.929348  1.04207   1.3054

Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that yy will be 11. And anything above 1 increases the probability of yy being 11, while 1 itself is a neutral odds for yy being either 11 or 00. Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob() that converts log-odds to probabilities:

function logodds2prob(logodds::Float64)
    return exp(logodds) / (1 + exp(logodds))
end

@chain quantile(chain) begin
    DataFrame
    select(_,
        :parameters,
        names(_, r"%") .=> ByRow(logodds2prob),
        renamecols=false)
end
5×6 DataFrame
 Row │ parameters    2.5%       25.0%     50.0%     75.0%     97.5%
     │ Symbol        Float64    Float64   Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ cutpoints[1]  0.0670062  0.136635  0.194807  0.271208  0.464414
   2 │ cutpoints[2]  0.195936   0.340633  0.440204  0.541345  0.733171
   3 │ cutpoints[3]  0.40437    0.595458  0.691076  0.772845  0.889902
   4 │ β[1]          0.425622   0.463073  0.481322  0.500857  0.539158
   5 │ β[2]          0.400304   0.453364  0.48169   0.510301  0.566236

There you go, much better now. Let's analyze our results. The cutpoints is the basal rate of the probability of our dependent variable having values less than a certain value. For example the cutpoint for having values less than 2 which its code represents the tobacco comsumption of 10-19 g/day has a median value of 20%.

Now let's take a look at our coefficients All coefficients whose 95% credible intervals captures the value 12=0.5\frac{1}{2} = 0.5 tells that the effect on the propensity of tobacco comsumption is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients.

That's how you interpret 95% credible intervals from a quantile() output of a ordinal regression Chains object converted from log-odds to probability.

Footnotes

[1] actually the logit function or the log-odds is the logarithm of the odds p1p\frac{p}{1-p} where pp is a probability.

References

Breslow, N. E. & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.