Bayesian Ordinal Regression
Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The second is ordinal regression.
A ordinal regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , but now we have not only one intercept but several intercepts for .
We use ordinal regression when our dependent variable is ordinal. That means it has different that have a "natural order"**. Most important, the distance between values is not the same. For example, imagine a pain score scale that goes from 1 to 10. The distance between 1 and 2 is different from the distance 9 to 10. Another example is opinion pools with their ubiquitous disagree-agree range of plausible values. These are also known as Likert scale variables. The distance between "disagree" to "not agree or disagree" is different than the distance between "agree" and "strongly agree".
This assumption is what we call the "metric" assumption, also called as the equidistant assumption. Almost always when we model an ordinal dependent variable this assumption is violated. Thus, we cannot blindly employ linear regression here.
How to deal with Ordered Discrete Dependent Variable?
So, how we deal with ordered discrete responses in our dependent variable? This is similar with the previous logistic regression approach. We have to do a non-linear transformation of the dependent variable.
Cumulative Distribution Function (CDF)
In the case of ordinal regression, we need to first transform the dependent variable into a cumulative scale. We need to calculate the cumulative distribution function (CDF) of our dependent variable:
The CDF is a monotonic increasing function that depicts the probability of our random variable taking values less than a certain value. In our case, the discrete ordinal case, these values can be represented as positive integers ranging from 1 to the length of possible values. For instance, a 6-categorical ordered response variable will have 6 possible values, and all their CDFs will lie between 0 and 1. Furthermore, their sum will be 1; since it represents the total probability of the variable taking any of the possible values, i.e. 100%.
Log-cumulative-odds
That is still not enough, we need to apply the logit function to the CDF:
where is the natural logarithm function.
The logit is the inverse of the logistic transformation, it takes as a input any number between 0 and 1 (where a probability is the perfect candidate) and outputs a real number, which we call the log-odds.
Since we are taking the log-odds of the CDF, we can call this complete transformation as log-odds of the CDF, or log-cumulative-odds.
Intercepts
Now, the next question is: what do we do with the log-cumulative-odds?
We need the log-cumulative-odds because it allows us to construct different intercepts for the possible values our ordinal dependent variable. We create an unique intercept for each possible outcome .
Notice that the highest probable value of will always have a log-cumulative-odds of , since for :
Thus, we only need intercepts for a possible dependent variables' response values. These are known as cut points.
Each intercept implies a CDF for each value . This allows us to violate the equidistant assumption absent in most ordinal variables.
Each intercept implies a log-cumulative-odds for each . We also need to undo the cumulative nature of the intercepts. We can accomplish this by first converting a log-cumulative-odds back to a cumulative probability. This is done by undoing the logit transformation and applying the logistic function:
Then, finally, we can remove the cumulative from "cumulative probability" by subtraction of each of the cut points by their previous cut point:
where is the dependent variable and are the cut points for each intercept.
Let me show you an example with some synthetic data.
using DataFrames
using CairoMakie
using AlgebraOfGraphics
using Distributions
using StatsFuns: logit
Here we have a discrete variable x
with 6 possible ordered values as response. The values range from 1 to 6 having probability, respectively: 10%, 15%, 33%, 25%, 10%, and 7%; represented with the probs
vector. The underlying distribution is represented by a Categorical
distribution from Distributions.jl
, which takes a vector of probabilities as parameters (probs
).
For each value we are calculating:
Probability Mass Function with the
pdf
functionCumulative Distribution Function with the
cdf
functionLog-cumulative-odds with the
logit
transformation of the CDF
In the plot below there are 3 subplots:
Upper corner: histogram of
x
Left lower corner: CDF of
x
Right lower corner: log-cumulative-odds of
x
let
probs = [0.10, 0.15, 0.33, 0.25, 0.10, 0.07]
dist = Categorical(probs)
x = 1:length(probs)
x_pmf = pdf.(dist, x)
x_cdf = cdf.(dist, x)
x_logodds_cdf = logit.(x_cdf)
df = DataFrame(; x, x_pmf, x_cdf, x_logodds_cdf)
labels = ["CDF", "Log-cumulative-odds"]
f = Figure()
plt1 = data(df) * mapping(:x, :x_pmf) * visual(BarPlot)
plt2 =
data(df) *
mapping(:x, [:x_cdf, :x_logodds_cdf]; col=dims(1) => renamer(labels)) *
visual(ScatterLines)
axis = (; xticks=1:6)
draw!(f[1, 2:3], plt1; axis)
draw!(f[2, 1:4], plt2; axis, facet=(; linkyaxes=:none))
f
end
CairoMakie.Screen{SVG}
As we can see, we have (in our case ) intercept values in log-cumulative-odds. You can carly see that these values they violate the equidistant assumption for metric response values. The spacing between the cut points are not the same, they vary.
Adding Coefficients
Ok, the intercepts are done. Now let's add coefficients to act as covariate effects in our ordinal regression model.
We transformed everything into log-odds scale so that we can add effects (coefficients multiplying a covariate) or basal rates (intercepts) together. For each , we calculate:
where is the log-cumulative-odds for the intercepts, is the coefficient for the th covariate . Finally, represents the linear predictor for the th intercept.
Observe that the coefficient is being added to a log-cumulative-odds, such that, it will be expressed in a log-cumulative-odds also.
We can express it in matrix form:
where , and are vectors and is the data matrix where each row is an observation and each column a covariate.
This still obeys the ordered constraint on the dependent variable possible values.
How to Interpret Coefficient ?
Now, suppose we have found our ideal value for our . How we would interpret our estimated values?
First, to recap, measures the strength of association between the covariate and dependent variable . But, is nested inside a non-linear transformation called logistic function:
So, our first step is to undo the logistic function. There is a function that is called logit function that is the inverse of the logistic function:
where is the natural logarithm function.
If we analyze closely the logit function we can find that inside the there is a disguised odds in the . Hence, our can be interpreted as the logarithm of the odds, or in short form: the log-odds.
We already saw how odds, log-odds, and probability are related in the previous logistic regression tutorial. So you might want to go back there to get the full explanation.
The log-odds are the key to interpret coefficient **. Any positive value of means that there exists a positive association between and , while any negative values indicate a negative association between and . Values close to 0 demonstrates the lack of association between and .
Likelihood
We have almost everything we need for our ordinal regression. We are only missing a final touch. Currently our logistic function outputs a vector of probabilities that sums to 1.
All of the intercepts along with the coefficients are in log-cumulative-odds scale. If we apply the logistic function to the linear predictors we get probabilities: one for each
We need a likelihood that can handle a vector of probabilities and outputs a single discrete value. The categorical distribution is the perfect candidate.
Bayesian Ordinal Regression
Now we have all the components for our Bayesian ordinal regression specification:
where:
– ordered discrete dependent variable.
– probability vector of length .
– number of possible values can take, i.e. number of ordered discrete values.
– log-cumulative-odds, i.e. cut points considering the intercepts and covariates effect.
– intercept in log-cumulative-odds for each .
– covariate data matrix.
– coefficient vector of the same length as the number of columns in .
– logistic function.
– cumulative distribution function.
What remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercepts.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate an ordinal regression with the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our ordinal regression. This is due to the Categorical distribution not having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
This is easily accomplished with Turing:
using Turing
using Bijectors
using LazyArrays
using LinearAlgebra
using Random: seed!
using Bijectors: transformed, OrderedBijector
seed!(123)
@model function ordreg(X, y; predictors=size(X, 2), ncateg=maximum(y))
#priors
cutpoints ~ transformed(filldist(TDist(3) * 5, ncateg - 1), OrderedBijector())
β ~ filldist(TDist(3) * 2.5, predictors)
#likelihood
return y ~ arraydist([OrderedLogistic(X[i, :] ⋅ β, cutpoints) for i in 1:length(y)])
end;
First, let's deal with the new stuff in our model: the Bijectors.jl
's transformed
and OrderedBijector
. As I've said in the 4. How to use Turing, Turing has a rich ecosystem of packages. Bijectors implements a set of functions for transforming constrained random variables (e.g. simplexes, intervals) to Euclidean space. Here we are defining cutpoints
as a ncateg - 1
vector of Student- distributions with mean 0, standard deviation 5 and degrees of freedom . Remember that we only need cutpoints for all of our intercepts. And we are also constraining it to be an ordered vector with transformed(d, OrderedBijector)
, such that for all cutpoints we have .
As before, we are giving a very weakly informative priors of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Finally, in the likelihood, Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And we use some indexing inside an array literal.
Example - Esoph
For our example, I will use a famous dataset called esoph
(Breslow & Day, 1980), which is data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. It has records for 88 age/alcohol/tobacco combinations:
agegp
: Age group1
: 25-34 years2
: 35-44 years3
: 45-54 years4
: 55-64 years5
: 65-74 years6
: 75+ years
alcgp
: Alcohol consumption1
: 0-39 g/day2
: 40-79 g/day3
: 80-119 g/day4
: 120+ g/day
tobgp
: Tobacco consumption1
: 0-9 g/day2
: 10-19 g/day3
: 20-29 g/day4
: 30+ g/day
ncases
: Number of casesncontrols
: Number of controls
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
using CSV
using HTTP
url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/esoph.csv"
esoph = CSV.read(HTTP.get(url).body, DataFrame)
88×5 DataFrame
Row │ agegp alcgp tobgp ncases ncontrols
│ String7 String15 String15 Int64 Int64
─────┼─────────────────────────────────────────────────
1 │ 25-34 0-39g/day 0-9g/day 0 40
2 │ 25-34 0-39g/day 10-19 0 10
3 │ 25-34 0-39g/day 20-29 0 6
4 │ 25-34 0-39g/day 30+ 0 5
5 │ 25-34 40-79 0-9g/day 0 27
6 │ 25-34 40-79 10-19 0 7
7 │ 25-34 40-79 20-29 0 4
8 │ 25-34 40-79 30+ 0 7
9 │ 25-34 80-119 0-9g/day 0 2
10 │ 25-34 80-119 10-19 0 1
11 │ 25-34 80-119 30+ 0 2
12 │ 25-34 120+ 0-9g/day 0 1
13 │ 25-34 120+ 10-19 1 0
14 │ 25-34 120+ 20-29 0 1
15 │ 25-34 120+ 30+ 0 2
16 │ 35-44 0-39g/day 0-9g/day 0 60
17 │ 35-44 0-39g/day 10-19 1 13
18 │ 35-44 0-39g/day 20-29 0 7
19 │ 35-44 0-39g/day 30+ 0 8
20 │ 35-44 40-79 0-9g/day 0 35
21 │ 35-44 40-79 10-19 3 20
22 │ 35-44 40-79 20-29 1 13
23 │ 35-44 40-79 30+ 0 8
24 │ 35-44 80-119 0-9g/day 0 11
25 │ 35-44 80-119 10-19 0 6
26 │ 35-44 80-119 20-29 0 2
27 │ 35-44 80-119 30+ 0 1
28 │ 35-44 120+ 0-9g/day 2 1
29 │ 35-44 120+ 10-19 0 3
30 │ 35-44 120+ 20-29 2 2
31 │ 45-54 0-39g/day 0-9g/day 1 45
32 │ 45-54 0-39g/day 10-19 0 18
33 │ 45-54 0-39g/day 20-29 0 10
34 │ 45-54 0-39g/day 30+ 0 4
35 │ 45-54 40-79 0-9g/day 6 32
36 │ 45-54 40-79 10-19 4 17
37 │ 45-54 40-79 20-29 5 10
38 │ 45-54 40-79 30+ 5 2
39 │ 45-54 80-119 0-9g/day 3 13
40 │ 45-54 80-119 10-19 6 8
41 │ 45-54 80-119 20-29 1 4
42 │ 45-54 80-119 30+ 2 2
43 │ 45-54 120+ 0-9g/day 4 0
44 │ 45-54 120+ 10-19 3 1
45 │ 45-54 120+ 20-29 2 1
46 │ 45-54 120+ 30+ 4 0
47 │ 55-64 0-39g/day 0-9g/day 2 47
48 │ 55-64 0-39g/day 10-19 3 19
49 │ 55-64 0-39g/day 20-29 3 9
50 │ 55-64 0-39g/day 30+ 4 2
51 │ 55-64 40-79 0-9g/day 9 31
52 │ 55-64 40-79 10-19 6 15
53 │ 55-64 40-79 20-29 4 13
54 │ 55-64 40-79 30+ 3 3
55 │ 55-64 80-119 0-9g/day 9 9
56 │ 55-64 80-119 10-19 8 7
57 │ 55-64 80-119 20-29 3 3
58 │ 55-64 80-119 30+ 4 0
59 │ 55-64 120+ 0-9g/day 5 5
60 │ 55-64 120+ 10-19 6 1
61 │ 55-64 120+ 20-29 2 1
62 │ 55-64 120+ 30+ 5 1
63 │ 65-74 0-39g/day 0-9g/day 5 43
64 │ 65-74 0-39g/day 10-19 4 10
65 │ 65-74 0-39g/day 20-29 2 5
66 │ 65-74 0-39g/day 30+ 0 2
67 │ 65-74 40-79 0-9g/day 17 17
68 │ 65-74 40-79 10-19 3 7
69 │ 65-74 40-79 20-29 5 4
70 │ 65-74 80-119 0-9g/day 6 7
71 │ 65-74 80-119 10-19 4 8
72 │ 65-74 80-119 20-29 2 1
73 │ 65-74 80-119 30+ 1 0
74 │ 65-74 120+ 0-9g/day 3 1
75 │ 65-74 120+ 10-19 1 1
76 │ 65-74 120+ 20-29 1 0
77 │ 65-74 120+ 30+ 1 0
78 │ 75+ 0-39g/day 0-9g/day 1 17
79 │ 75+ 0-39g/day 10-19 2 4
80 │ 75+ 0-39g/day 30+ 1 2
81 │ 75+ 40-79 0-9g/day 2 3
82 │ 75+ 40-79 10-19 1 2
83 │ 75+ 40-79 20-29 0 3
84 │ 75+ 40-79 30+ 1 0
85 │ 75+ 80-119 0-9g/day 1 0
86 │ 75+ 80-119 10-19 1 0
87 │ 75+ 120+ 0-9g/day 2 0
88 │ 75+ 120+ 10-19 1 0
Now let's us instantiate our model with the data. But here I need to do some data wrangling to create the data matrix X
. Specifically, I need to convert all of the categorical variables to integer values, representing the ordinal values of both our independent and also dependent variables:
using CategoricalArrays
DataFrames.transform!(
esoph,
:agegp =>
x -> categorical(
x; levels=["25-34", "35-44", "45-54", "55-64", "65-74", "75+"], ordered=true
),
:alcgp =>
x -> categorical(x; levels=["0-39g/day", "40-79", "80-119", "120+"], ordered=true),
:tobgp =>
x -> categorical(x; levels=["0-9g/day", "10-19", "20-29", "30+"], ordered=true);
renamecols=false,
)
DataFrames.transform!(
esoph, [:agegp, :alcgp, :tobgp] .=> ByRow(levelcode); renamecols=false
)
X = Matrix(select(esoph, [:agegp, :alcgp]))
y = esoph[:, :tobgp]
model = ordreg(X, y);
And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain)
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
cutpoints[1] -1.3806 0.6268 0.0119 2788.6718 2472.5534 1.0009 69.0744
cutpoints[2] -0.2021 0.6072 0.0113 2897.1099 2438.5542 1.0013 71.7604
cutpoints[3] 0.8493 0.6203 0.0114 2962.0272 2770.1618 1.0021 73.3684
β[1] -0.0696 0.1161 0.0021 3108.6746 2347.4018 1.0007 77.0008
β[2] -0.0658 0.1691 0.0030 3128.0391 2909.9287 1.0016 77.4804
We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:
where is a probability. So log-odds is defined as:
So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:
We can do this with a transformation in a DataFrame
constructed from a Chains
object:
using Chain
@chain quantile(chain) begin
DataFrame
select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
end
5×6 DataFrame
Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
│ Symbol Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────
1 │ cutpoints[1] 0.0738932 0.163178 0.252302 0.389008 0.845115
2 │ cutpoints[2] 0.254298 0.534801 0.818166 1.2405 2.71577
3 │ cutpoints[3] 0.704014 1.52676 2.32062 3.56916 8.04289
4 │ β[1] 0.745092 0.863281 0.933127 1.00918 1.17439
5 │ β[2] 0.677071 0.832166 0.935842 1.05535 1.29072
Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that will be . And anything above 1 increases the probability of being , while 1 itself is a neutral odds for being either or . Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob()
that converts log-odds to probabilities:
function logodds2prob(logodds::Float64)
return exp(logodds) / (1 + exp(logodds))
end
@chain quantile(chain) begin
DataFrame
select(_, :parameters, names(_, r"%") .=> ByRow(logodds2prob); renamecols=false)
end
5×6 DataFrame
Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
│ Symbol Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────
1 │ cutpoints[1] 0.0688087 0.140286 0.201471 0.280062 0.458028
2 │ cutpoints[2] 0.202741 0.34845 0.449995 0.553671 0.730877
3 │ cutpoints[3] 0.41315 0.604236 0.698851 0.781141 0.889416
4 │ β[1] 0.426964 0.463312 0.482703 0.502285 0.540101
5 │ β[2] 0.403722 0.454198 0.483429 0.513466 0.563457
There you go, much better now. Let's analyze our results. The cutpoints
is the basal rate of the probability of our dependent variable having values less than a certain value. For example the cutpoint for having values less than 2
which its code represents the tobacco consumption of 10-19 g/day has a median value of 20%.
Now let's take a look at our coefficients All coefficients whose 95% credible intervals captures the value tells that the effect on the propensity of tobacco consumption is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients.
That's how you interpret 95% credible intervals from a quantile()
output of a ordinal regression Chains
object converted from log-odds to probability.
Footnotes
[1] | actually the logit function or the log-odds is the logarithm of the odds where is a probability. |
References
Breslow, N. E. & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.