Generalized Additive Models

Noam Ross (with lots of slides by David L Miller)
August 5th, 2018

Overview

  • Motivation
  • Getting our feet wet with an example
  • What is a GAM?
  • What is smoothing?
  • How do GAMs work? (Roughly)

Generalized Additive Models

  • Generalized = many response distributions
  • Additive = terms add together
  • Models = Models

To GAMs from GLMs and LMs

(Generalized) Linear Models

Models that look like:

\[ y_i = \beta_0 + x_{1i}\beta_1 + x_{2i}\beta_2 + \ldots + \epsilon_i \]

(describe the response, \( y_i \), as linear combination of the covariates, \( x_{ji} \), with an offset)

We can make \( y_i\sim \) any exponential family distribution (Normal, Poisson, etc).

Error term \( \epsilon_i \) is normally distributed (usually).

Why bother with anything more complicated?!

Is this linear?

plot of chunk islinear

Is this linear? Maybe?

lm(y ~ x1, data=dat)

plot of chunk maybe

What can we do?

Adding a quadratic term?

lm(y ~ x1 + poly(x1, 2), data=dat)

plot of chunk quadratic

Is this sustainable?

  • Adding in quadratic (and higher terms) can make sense
  • This feels a bit ad hoc
  • Better if we had a framework to deal with these issues?

plot of chunk ruhroh

Wait, before we do math, let's talk dolphins

a pantropical spotted dolphin doing its thing

(Sort of. Let's fire up RStudio)

How is the GAM different?

In GLM we model the mean of data as a sum of linear terms:

\[ y_i = \beta_0 +\sum_j \color{red}{ \beta_j x_{ji}} +\epsilon_i \]

A GAM is a sum of smooth functions or smooths

\[ y_i = \beta_0 + \sum_j \color{red}{s_j(x_{ji})} + \epsilon_i \]

where \( \epsilon_i \sim N(0, \sigma^2) \), \( y_i \sim \text{Normal} \) (for now)

Call the above equation the linear predictor in both cases.

Why use functions rather that just coefficients?

  • Want to model the covariates flexibly
  • Covariates and response not necessarily linearly related!
  • Want some “wiggles”

plot of chunk smoothdat

Why use functions rather that just coefficients?

  • Want to model the covariates flexibly
  • Covariates and response not necessarily linearly related!
  • Want some “wiggles”

plot of chunk wsmooths

What are smooths, smoothing, and wiggliness?

Splines

  • Functions made of other, simpler basis functions (usually polynomials)
  • Each basis functions \( b_k \) has coefficients \( \beta_k \)
  • Our spline is just the sum: \( s(x) = \sum_{k=1}^K \beta_k b_k(x) \)

Splines (2)

Several different types of splines, with different applications

Basis functions of cubic spline (top), and thin-plate spline (bottom)

Splines (4)

  • We often write linear models in matrix notation: \( X\boldsymbol{\beta} \)
    • \( X \) is our data
    • \( \boldsymbol{\beta} \) are parameters we need to estimate
  • For a GAM it's the same
    • \( X \) has columns for each basis function, evaluated at each observation
    • again, this is the linear predictor
  • Let's look at these:

Back RStudio to look at this in our model!

Avoiding Overfitting

plot of chunk wiggles

  • Want a line that is “close” to all the data (high likelihood)
  • Splines could just fit every data point, but this would be overfitting. We know there is “error”
  • Easy to overfit - want a smooth curve.
  • How do we measure smoothness? Calculus!

Wigglyness by derivatives

Animation of derivatives

What was that grey bit?: Wigglyness!

\[ \int_\mathbb{R} \left( \frac{\partial^2 f(x)}{\partial^2 x}\right)^2 \text{d}x = \boldsymbol{\beta}^\text{T}S\boldsymbol{\beta} = \large{W} \]

(Wigglyness is 100% the right mathy word)

We penalize wiggliness to avoid overfitting.

Making wigglyness matter

  • \( W \) measures wigglyness
  • (log) likelihood measures closeness to the data
  • We use a smoothing parameter (\( \lambda \)) to define the trade-off, to find the spline coefficients (\( B_k \)) that maximize

\[ \log(\text{Likelihood}) - \lambda W \]

Picking the right smoothing parameter

plot of chunk wiggles-plot

Picking the right smoothing parameter

  • Two ways to think about how to optimize \( \lambda \):
    • Predictive: Minimize out-of-sample error
    • Bayesian: Put priors on our basis coeffiients
  • Many methods: AIC, Mallow's \( C_p \), GCV, ML, REML
  • Practically, Use REML, because of numerical stability:

Hence gam(..., method="REML")

Maximum wiggliness

  • We set basis complexity or “size” (\( k \))
  • This is maximum wigglyness, can be thought of as number of small functionns that make up a curve.
  • Once smoothing is applied, curves have fewer have effective degrees of freedom (EDF)

\[ \text{EDF} < k \]

  • \( k \) must be “large enough”, the \( \lambda \) penalty does the rest
  • Bigger \( k \) increases computational cost
  • In mgcv, default \( k \) values are arbitrary

A wiggly exercise!

  • We set \( k \) in s() terms with s(variable, k=n)
  • Re-fit models with small, medium and large \( k \) values.
  • Look at coefficients, model.matrix, summaries and plots
  • How do deviance explained, EDF values, smooth shapes change?
  • What is default \( k \)?

a pantropical spotted dolphin doing its thing

GAM summary so far

  • GAMs give us a framework to model flexible nonlinear relationships
  • Use little functions (basis functions) to make big functions (smooths)
  • Use a penalty to trade off wiggliness/generality
  • Need to make sure your smooths are wiggly enough

Check-in: do we need COFFEE?

Predictions and Uncertainty

What is a prediction?

  • Evaluate the model, at a particular covariate combination
  • Answering (e.g.) the question “at a given depth and location how many dolphins?”
  • Steps:
    1. evaluate the \( s(\ldots) \) terms
    2. move to the response scale (exponentiate? Do nothing?)
    3. (multiply any offset etc)

\[ \text{count}_i = A_i \exp \left( \beta_0 + s(x_i, y_i) + s(\text{Depth}_i)\right) \]

What about uncertainty?

Without uncertainty, we're not doing statistics

Where does uncertainty come from?

  • \( \boldsymbol{\beta} \): uncertainty in the spline parameters
  • \( \boldsymbol{\lambda} \): uncertainty in the smoothing parameter

Parameter uncertainty

All the model coefficients together have approximately a multi-normal distribution around the mean:

\[ \boldsymbol{\beta} \sim N(\hat{\boldsymbol{\beta}}, \mathbf{V}_\boldsymbol{\beta}) \]

However, this is a distribution dependent on the smoothing parameter.

In mgcv, vcov(model) returns \( \mathbf{V}_\boldsymbol{\beta} \), the variance-covariance matrix of the coefficients.

vcov(mode, unconditional = TRUE) corrects for uncertainty in the smoothing parameter.

How do we make use of this information?

  • confidence intervals in plot
  • standard errors using se.fit in the predict() function
  • simulating from the distribution of our coefficients using \( \mathbf{V}_\boldsymbol{\beta} \)

Back to sea!

a pantropical spotted dolphin doing its thing

Exercise

  • Make predictions of dolphin counts at all points in preddata.
  • plot separate maps of mean, low, and high estimates

Varying spline shapes

  • Because our spline coefficients covary, it can be somewhat misleading to just plot confidence intervals. Sometimes we want to look at a better sample of possible predictions.

  • We can sample value of our coefficents from a mutivariate normal using vcov(model).

  • Since our model is ultimately a linear function of these coefficients, we can predict using

\[ \hat{y} = L_p \boldsymbol{\hat{\beta}} \]

Where \( L_p \) is the linear predictor matrix, which constructed from our data and our basis functions.

Simulating parameters

Animation of uncertainty

Prediction / Uncertainty Summary

  • Everything comes from variance of parameters
  • Can include uncertainty in the smoothing parameter too
  • Need to re-project/scale them to get the quantities we need
  • mgcv does most of the hard work for us
  • Fancy stuff possible with a little math

Okay, that was a lot of information

Summary

  • GAMs are GLMs plus some extra wiggles
  • Need to make sure things are just wiggly enough
    • Basis + penalty is the way to do this
  • Fitting looks like glm with extra s() terms
  • Most stuff comes down to matrix algebra, that mgcv sheilds you from
    • To do fancy stuff, get inside the matrices

COFFEE