Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The second is ordinal regression.
A ordinal regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , but now we have not only one intercept but several intercepts for .
We use ordinal regression when our dependent variable is ordinal. That means it has different that have a "natural order"**. Most important, the distance between values is not the same. For example, imagine a pain score scale that goes from 1 to 10. The distance between 1 and 2 is different from the distance 9 to 10. Another example is opinion pools with their ubiquously disagree-agree range of plausible values. These are also known as Likert scale variables. The distance between "disagree" to "not agree or disagree" is different than the distance between "agree" and "strongly agree".
This assumption is what we call the "metric" assumption, also called as the equidistant assumption. Almost always when we model an ordinal dependent variable this assumption is violated. Thus, we cannot blindly employ linear regression here.
So, how we deal with ordered discrete responses in our dependent variable? This is similar with the previous logistic regression approach. We have to do a non-linear transformation of the dependent variable.
In the case of ordinal regression, we need to first transform the dependent variable into a cumulative scale. We need to calculate the cumulative distribution function (CDF) of our dependent variable:
The CDF is a monotonic increasing function that depicts the probability of our random variable taking values less than a certain value. In our case, the discrete ordinal case, these values can be represented as positive integers ranging from 1 to the length of possible values. For instance, a 6-categorical ordered response variable will have 6 possible values, and all their CDFs will lie between 0 and 1. Furthermore, their sum will be 1; since it represents the total probability of the variable taking any of the possible values, i.e. 100%.
That is still not enough, we need to apply the logit function to the CDF:
where is the natural logarithm function.
The logit is the inverse of the logistic transformation, it takes as a input any number between 0 and 1 (where a probability is the perfect candidate) and outputs a real number, which we call the log-odds.
Since we are taking the log-odds of the CDF, we can call this complete transformation as log-odds of the CDF, or log-cumulative-odds.
Now, the next question is: what do we do with the log-cumulative-odds?
We need the log-cumulative-odds because it allows us to construct different intercepts for the possible values our ordinal dependent variable. We create an unique intercept for each possible outcome .
Notice that the highest probable value of will always have a log-cumulative-odds of , since for :
Thus, we only need intercepts for a possible depedent variables' response values. These are known as cut points.
Each intercept implies a CDF for each value . This allows us to violate the equidistant assumption absent in most ordinal variables.
Each intercept implies a log-cumulative-odds for each . We also need to undo the cumulative nature of the intercepts. We can accomplish this by first converting a log-cumulative-odds back to a cumulative probability. This is done by undoing the logit transformation and applying the logistic function:
Then, finally, we can remove the cumulative from "cumulative probability" by subtraction of each of the cut points by their previous cut point:
where is the depedent variable and are the cut points for each intercept.
Let me show you an example with some syntethic data.
using DataFrames using CairoMakie using AlgebraOfGraphics using Distributions using StatsFuns: logit
Here we have a discrete variable
x with 6 possible ordered values as response. The values range from 1 to 6 having probability, respectively: 10%, 15%, 33%, 25%, 10%, and 7%; represented with the
probs vector. The underlying distribution is represented by a
Categorical distribution from
Distributions.jl, which takes a vector of probabilities as parameters (
For each value we are calculating:
Probability Mass Function with the
Cumulative Distribution Function with the
Log-cumulative-odds with the
logittransformation of the CDF
In the plot below there are 3 subplots:
Upper corner: histogram of
Left lower corner: CDF of
Right lower corner: log-cumulative-odds of
let probs = [0.10, 0.15, 0.33, 0.25, 0.10, 0.07] dist = Categorical(probs) x = 1:length(probs) x_pmf = pdf.(dist, x) x_cdf = cdf.(dist, x) x_logodds_cdf = logit.(x_cdf) df = DataFrame(; x, x_pmf, x_cdf, x_logodds_cdf) labels = ["CDF", "Log-cumulative-odds"] fig = Figure() plt1 = data(df) * mapping(:x, :x_pmf) * visual(BarPlot) plt2 = data(df) * mapping(:x, [:x_cdf, :x_logodds_cdf]; col=dims(1) => renamer(labels)) * visual(ScatterLines) axis = (; xticks=1:6) draw!(fig[1, 2:3], plt1; axis) draw!(fig[2, 1:4], plt2; axis, facet=(; linkyaxes=:none)) fig end
As we can see, we have (in our case ) intercept values in log-cumulative-odds. You can carly see that these values they violate the equidistant assumption for metric response values. The spacing between the cut points are not the same, they vary.
Ok, the intercepts are done. Now let's add coefficients to act as covariate effects in our ordinal regression model.
We transformed everything into log-odds scale so that we can add effects (coefficients multiplying a covariate) or basal rates (intercepts) together. For each , we calculate:
where is the log-cumulative-odds for the intercepts, is the coefficient for the th covariate . Finally, represents the linear predictor for the th intercept.
Observe that the coefficient is being added to a log-cumulative-odds, such that, it will be expressed in a log-cumulative-odds also.
We can express it in matrix form:
where , and are vectors and is the data matrix where each row is an observation and each column a covariate.
This still obeys the ordered constraint on the dependent variable possible values.
Now, suppose we have found our ideal value for our . How we would interpret our estimated values?
First, to recap, measures the strength of association between the covariate and depedent variable . But, is nested inside a non-linear transformation called logistic function:
So, our first step is to undo the logistic function. There is a function that is called logit function that is the inverse of the logistic function:
where is the natural logarithm function.
If we analyze closely the logit function we can find that inside the there is a disguised odds in the . Hence, our can be interpreted as the logarithm of the odds, or in short form: the log-odds.
We already saw how odds, log-odds, and probability are related in the previous logistic regression tutorial. So you might want to go back there to get the full explanation.
The log-odds are the key to interpret coefficient **. Any positive value of means that there exists a positive association between and , while any negative values indicate a negative association between and . Values close to 0 demonstrates the lack of association between and .
We have almost everything we need for our ordinal regression. We are only missing a final touch. Currently our logistic function outputs a vector of probabilities that sums to 1.
All of the intercepts along with the coefficients are in log-cumulative-odds scale. If we apply the logistic function to the linear predictors we get probabilities: one for each
We need a likelihood that can handle a vector of probabilities and outputs a single discrete value. The categorical distribution is the perfect candidate.
Now we have all the components for our Bayesian ordinal regression specification:
– ordered discrete dependent variable.
– probability vector of length .
– number of possible values can take, i.e. number of ordered discrete values.
– log-cumulative-odds, i.e. cut points considering the intercepts and covariates effect.
– intercept in log-cumulative-odds for each .
– covariate data matrix.
– coefficient vector of the same length as the number of columns in .
– logistic function.
– cumulative distribution function.
What remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercepts.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate an ordinal regression with the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our ordinal regression. This is due to the Categorical distribution not having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
This is easily accomplished with Turing:
using Turing using Bijectors using LazyArrays using LinearAlgebra using Random: seed! seed!(123) function ordreg(X, y; predictors=size(X, 2), ncateg=maximum(y)) #priors cutpoints ~ Bijectors.ordered(filldist(TDist(3) * 5, ncateg - 1)) β ~ filldist(TDist(3) * 2.5, predictors) #likelihood y ~ arraydist([OrderedLogistic(X[i, :] ⋅ β, cutpoints) for i in 1:length(y)]) end;
First, let's deal with the new stuff in our model: the
Bijectors.ordered. As I've said in the 4. How to use Turing, Turing has a rich ecossystem of packages. Bijectors implements a set of functions for transforming constrained random variables (e.g. simplexes, intervals) to Euclidean space. Here we are defining
cutpoints as a
ncateg - 1 vector of Student- distributions with mean 0, standard deviation 5 and degrees of freedom . Remember that we only need cutpoints for all of our intercepts. And we are also contraining it to be an ordered vector with
Bijectors.ordered, such that for all cutpoints we have .
As before, we are giving a very weakly informative priors of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the
filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Finally, in the likelihood, Turing's
arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. And we use some indexing inside an array literal.
For our example, I will use a famous dataset called
esoph (Breslow & Day, 1980), which is data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. It has records for 88 age/alcohol/tobacco combinations:
agegp: Age group
1: 25-34 years
2: 35-44 years
3: 45-54 years
4: 55-64 years
5: 65-74 years
6: 75+ years
alcgp: Alcohol consumption
1: 0-39 g/day
2: 40-79 g/day
3: 80-119 g/day
4: 120+ g/day
tobgp: Tobacco consumption
1: 0-9 g/day
2: 10-19 g/day
3: 20-29 g/day
4: 30+ g/day
ncases: Number of cases
ncontrols: Number of controls
Ok let's read our data with
CSV.jl and output into a
using DataFrames, CSV, HTTP url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/esoph.csv" esoph = CSV.read(HTTP.get(url).body, DataFrame) describe(esoph)
5×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼─────────────────────────────────────────────────────────────────── 1 │ agegp 25-34 75+ 0 String7 2 │ alcgp 0-39g/day 80-119 0 String15 3 │ tobgp 0-9g/day 30+ 0 String15 4 │ ncases 2.27273 0 1.0 17 0 Int64 5 │ ncontrols 8.80682 0 4.0 60 0 Int64
As you can see from the
describe() output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.
Now let's us instantiate our model with the data. But here I need to do some data wrangling to create the data matrix
X. Specifically, I need to convert all of the categorical variables to integer values, representing the ordinal values of both our independent and also dependent variables:
using CategoricalArrays transform!( esoph, :agegp => x -> categorical( x; levels=["25-34", "35-44", "45-54", "55-64", "65-74", "75+"], ordered=true ), :alcgp => x -> categorical(x; levels=["0-39g/day", "40-79", "80-119", "120+"], ordered=true), :tobgp => x -> categorical(x; levels=["0-9g/day", "10-19", "20-29", "30+"], ordered=true); renamecols=false ) transform!(esoph, [:agegp, :alcgp, :tobgp] .=> ByRow(levelcode); renamecols=false) X = Matrix(select(esoph, [:agegp, :alcgp])) y = esoph[:, :tobgp] model = ordreg(X, y);
And, finally, we will sample from the Turing model. We will be using the default
NUTS() sampler with
1_000 samples, with 4 Markov chains using multiple threads
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4) summarystats(chain)
Summary Statistics parameters mean std naive_se mcse ess rhat ess_per_sec Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 cutpoints -1.4068 0.6390 0.0101 0.0125 2477.8651 1.0002 110.1861 cutpoints -0.2369 0.6221 0.0098 0.0118 2550.2363 1.0001 113.4043 cutpoints 0.8158 0.6283 0.0099 0.0125 2597.9588 1.0001 115.5264 β -0.0719 0.1151 0.0018 0.0020 2781.9947 1.0001 123.7102 β -0.0718 0.1703 0.0027 0.0028 3285.7691 1.0003 146.1121
We had no problem with the Markov chains as all the
rhat are well below
1.01 (or above
0.99). Note that the coefficients are in log-odds scale. They are the natural log of the odds, and odds is defined as:
where is a probability. So log-odds is defined as:
So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:
We can do this with a transformation in a
DataFrame constructed from a
using Chain quantile(chain) begin DataFrame select(_, :parameters, names(_, r"%") .=> ByRow(exp), renamecols=false) end
5×6 DataFrame Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5% │ Symbol Float64 Float64 Float64 Float64 Float64 ─────┼───────────────────────────────────────────────────────────────── 1 │ cutpoints 0.0718185 0.158259 0.241939 0.372134 0.867115 2 │ cutpoints 0.243682 0.516606 0.786366 1.18029 2.74772 3 │ cutpoints 0.678895 1.47193 2.23704 3.40228 8.0828 4 │ β 0.741014 0.862452 0.927978 1.00343 1.16994 5 │ β 0.667512 0.829369 0.929348 1.04207 1.3054
Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that will be . And anything above 1 increases the probability of being , while 1 itself is a neutral odds for being either or . Since I am not a gambling man, let's talk about probabilities. So I will create a function called
logodds2prob() that converts log-odds to probabilities:
function logodds2prob(logodds::Float64) return exp(logodds) / (1 + exp(logodds)) end quantile(chain) begin DataFrame select(_, :parameters, names(_, r"%") .=> ByRow(logodds2prob), renamecols=false) end
5×6 DataFrame Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5% │ Symbol Float64 Float64 Float64 Float64 Float64 ─────┼───────────────────────────────────────────────────────────────── 1 │ cutpoints 0.0670062 0.136635 0.194807 0.271208 0.464414 2 │ cutpoints 0.195936 0.340633 0.440204 0.541345 0.733171 3 │ cutpoints 0.40437 0.595458 0.691076 0.772845 0.889902 4 │ β 0.425622 0.463073 0.481322 0.500857 0.539158 5 │ β 0.400304 0.453364 0.48169 0.510301 0.566236
There you go, much better now. Let's analyze our results. The
cutpoints is the basal rate of the probability of our dependent variable having values less than a certain value. For example the cutpoint for having values less than
2 which its code represents the tobacco comsumption of 10-19 g/day has a median value of 20%.
Now let's take a look at our coefficients All coefficients whose 95% credible intervals captures the value tells that the effect on the propensity of tobacco comsumption is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients.
That's how you interpret 95% credible intervals from a
quantile() output of a ordinal regression
Chains object converted from log-odds to probability.
|||actually the logit function or the log-odds is the logarithm of the odds where is a probability.|
Breslow, N. E. & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.