Generalized Additive Models

Noam Ross (with lots of slides by David L Miller)
August 5th, 2017

Overview

Motivation
Getting our feet wet with an example
What is a GAM?
What is smoothing?
How do GAMs work? (Roughly)

Generalized Additive Models

Generalized = many response distributions
Additive = terms add together
Models = Models

To GAMs from GLMs and LMs

(Generalized) Linear Models

Models that look like:

\[ y_i = \beta_0 + x_{1i}\beta_1 + x_{2i}\beta_2 + \ldots + \epsilon_i \]

(describe the response, \( y_i \), as linear combination of the covariates, \( x_{ji} \), with an offset)

We can make \( y_i\sim \) any exponential family distribution (Normal, Poisson, etc).

Error term \( \epsilon_i \) is normally distributed (usually).

Why bother with anything more complicated?!

Is this linear?

plot of chunk islinear

Is this linear? Maybe?

lm(y ~ x1, data=dat)

plot of chunk maybe

What can we do?

Adding a quadratic term?

lm(y ~ x1 + poly(x1, 2), data=dat)

plot of chunk quadratic

Is this sustainable?

Adding in quadratic (and higher terms) can make sense
This feels a bit ad hoc
Better if we had a framework to deal with these issues?

plot of chunk ruhroh

Wait, before we do math, let's talk dolphins

a pantropical spotted dolphin doing its thing

(Sort of. Let's fire up RStudio)

How is the GAM different?

In GLM we model the mean of data as a sum of linear terms:

\[ y_i = \beta_0 +\sum_j \color{red}{ \beta_j x_{ji}} +\epsilon_i \]

A GAM is a sum of smooth functions or smooths

\[ y_i = \beta_0 + \sum_j \color{red}{s_j(x_{ji})} + \epsilon_i \]

where \( \epsilon_i \sim N(0, \sigma^2) \), \( y_i \sim \text{Normal} \) (for now)

Call the above equation the linear predictor in both cases.

Why use functions rather that just coefficients?

Want to model the covariates flexibly
Covariates and response not necessarily linearly related!
Want some “wiggles”

Why use functions rather that just coefficients?

Want to model the covariates flexibly
Covariates and response not necessarily linearly related!
Want some “wiggles”

plot of chunk wsmooths

What are smooths, smoothing, and wiggliness?

Splines

Functions made of other, simpler basis functions (usually polynomials)
Each basis functions \( b_k \) has coefficients \( \beta_k \)
Our spline is just the sum: \( s(x) = \sum_{k=1}^K \beta_k b_k(x) \)

Splines (2)

Several different types of splines, with different applications

Basis functions of cubic spline (top), and thin-plate spline (bottom)

Splines (4)

We often write linear models in matrix notation: \( X\boldsymbol{\beta} \)
- \( X \) is our data
- \( \boldsymbol{\beta} \) are parameters we need to estimate
For a GAM it's the same
- \( X \) has columns for each basis function, evaluated at each observation
- again, this is the linear predictor
Let's look at these:

Back RStudio to look at this in our model!

Avoiding Overfitting

plot of chunk wiggles

Want a line that is “close” to all the data (high likelihood)
Splines could just fit every data point, but this would be overfitting. We know there is “error”
Easy to overfit - want a smooth curve.
How do we measure smoothness? Calculus!

Wigglyness by derivatives

Animation of derivatives

What was that grey bit?: Wigglyness!

\[ \int_\mathbb{R} \left( \frac{\partial^2 f(x)}{\partial^2 x}\right)^2 \text{d}x = \boldsymbol{\beta}^\text{T}S\boldsymbol{\beta} = \large{W} \]

(Wigglyness is 100% the right mathy word)

We penalize wiggliness to avoid overfitting.

Making wigglyness matter

\( W \) measures wigglyness
(log) likelihood measures closeness to the data
We use a smoothing parameter (\( \lambda \)) to define the trade-off, to find the spline coefficients (\( B_k \)) that maximize

\[ \log(\text{Likelihood}) - \lambda W \]

Picking the right smoothing parameter

plot of chunk wiggles-plot

Picking the right smoothing parameter

Two ways to think about how to optimize \( \lambda \):
- Predictive: Minimize out-of-sample error
- Bayesian: Put priors on our basis coeffiients
Many methods: AIC, Mallow's \( C_p \), GCV, ML, REML
Practically, Use REML, because of numerical stability:

Hence gam(..., method="REML")

Maximum wiggliness

We set basis complexity or “size” (\( k \))
This is maximum wigglyness, can be thought of as number of small functionns that make up a curve.
Once smoothing is applied, curves have fewer have effective degrees of freedom (EDF)

\[ \text{EDF} < k \]

\( k \) must be “large enough”, the \( \lambda \) penalty does the rest
Bigger \( k \) increases computational cost
In mgcv, default \( k \) values are arbitrary

A wiggly exercise!

We set \( k \) in s() terms with s(variable, k=n)
Re-fit models with small, medium and large \( k \) values.
Look at coefficients, model.matrix, summaries and plots
How do deviance explained, EDF values, smooth shapes change?
What is default \( k \)?

a pantropical spotted dolphin doing its thing

GAM summary so far

GAMs give us a framework to model flexible nonlinear relationships
Use little functions (basis functions) to make big functions (smooths)
Use a penalty to trade off wiggliness/generality
Need to make sure your smooths are wiggly enough

Check-in: do we need COFFEE?

Predictions and Uncertainty

What is a prediction?

Evaluate the model, at a particular covariate combination
Answering (e.g.) the question “at a given depth and location how many dolphins?”
Steps:
1. evaluate the \( s(\ldots) \) terms
2. move to the response scale (exponentiate? Do nothing?)
3. (multiply any offset etc)

\[ \text{count}_i = A_i \exp \left( \beta_0 + s(x_i, y_i) + s(\text{Depth}_i)\right) \]

What about uncertainty?

Without uncertainty, we're not doing statistics

Where does uncertainty come from?

\( \boldsymbol{\beta} \): uncertainty in the spline parameters
\( \boldsymbol{\lambda} \): uncertainty in the smoothing parameter

Parameter uncertainty

All the model coefficients together have approximately a multi-normal distribution around the mean:

\[ \boldsymbol{\beta} \sim N(\hat{\boldsymbol{\beta}}, \mathbf{V}_\boldsymbol{\beta}) \]

However, this is a distribution dependent on the smoothing parameter.

In mgcv, vcov(model) returns \( \mathbf{V}_\boldsymbol{\beta} \), the variance-covariance matrix of the coefficients.

vcov(mode, unconditional = TRUE) corrects for uncertainty in the smoothing parameter.

How do we make use of this information?

confidence intervals in plot
standard errors using se.fit in the predict() function
simulating from the distribution of our coefficients using \( \mathbf{V}_\boldsymbol{\beta} \)

Back to sea!

a pantropical spotted dolphin doing its thing

Exercise

Make predictions of dolphin counts at all points in preddata.
plot separate maps of mean, low, and high estimates

Varying spline shapes

Because our spline coefficients covary, it can be somewhat misleading to just plot confidence intervals. Sometimes we want to look at a better sample of possible predictions.
We can sample value of our coefficents from a mutivariate normal using vcov(model).
Since our model is ultimately a linear function of these coefficients, we can predict using

\[ \hat{y} = L_p \boldsymbol{\hat{\beta}} \]

Where \( L_p \) is the linear predictor matrix, which constructed from our data and our basis functions.

Simulating parameters

Prediction / Uncertainty Summary

Everything comes from variance of parameters
Can include uncertainty in the smoothing parameter too
Need to re-project/scale them to get the quantities we need
mgcv does most of the hard work for us
Fancy stuff possible with a little math

Okay, that was a lot of information

Summary

GAMs are GLMs plus some extra wiggles
Need to make sure things are just wiggly enough
- Basis + penalty is the way to do this
Fitting looks like glm with extra s() terms
Most stuff comes down to matrix algebra, that mgcv sheilds you from
- To do fancy stuff, get inside the matrices