Model checking

Noam Ross (with lots of help from Eric J Pedersen)
August 5th, 2017

Outline

So you have a GAM:

  • How do you know you have the right degrees of freedom? gam.check()
  • Diagnosing model issues: gam.check() part 2
  • When covariates aren't independent: estimating concurvity

GAMs are models too

With all models, how accurate your predictions will be depends on how good the model is

plot of chunk misspecify

So how do we test how well our model fits?

Examples:

set.seed(2)
n = 400
x1 = rnorm(n)
x2 = rnorm(n)
y_val =1 + 2*cos(pi*x1) + 2/(1+exp(-5*(x2)))
y_norm = y_val + rnorm(n, 0, 0.5)
y_negbinom = rnbinom(n, mu = exp(y_val),size=10)
y_binom = rbinom(n,1,prob = exp(y_val)/(1+exp(y_val)))

plot of chunk sims_plot

gam.check() part 1: do you have the right functional form?

How well does the model fit?

  • Many choices: k, family, type of smoother, …
  • How do we assess how well our model fits?

Basis size (k)

  • Set k per term
  • e.g. s(x, k=10) or s(x, y, k=100)
  • Penalty removes “extra” wigglyness
    • up to a point!
  • (But computation is slower with bigger k)

Checking basis size

norm_model_1 = gam(y_norm~s(x1,k=4)+s(x2,k=4),method= "REML")
gam.check(norm_model_1)

Method: REML   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-0.0003467788,0.0005154578]
(score 736.9402 & scale 2.252304).
Hessian positive definite, eigenvalue range [0.000346021,198.5041].
Model rank =  7 / 7 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

        k'  edf k-index p-value    
s(x1) 3.00 1.00    0.13  <2e-16 ***
s(x2) 3.00 2.91    1.04    0.79    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking basis size

norm_model_2 = gam(y_norm~s(x1,k=12)+s(x2,k=4),method= "REML")
gam.check(norm_model_2)

Method: REML   Optimizer: outer newton
full convergence after 11 iterations.
Gradient range [-5.658609e-06,5.392657e-06]
(score 345.3111 & scale 0.2706205).
Hessian positive definite, eigenvalue range [0.967727,198.6299].
Model rank =  15 / 15 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'   edf k-index p-value    
s(x1) 11.00 10.84    0.99    0.41    
s(x2)  3.00  2.98    0.86  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking basis size

norm_model_3 = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model_3)

Method: REML   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-5.103686e-05,-1.089481e-08]
(score 334.2084 & scale 0.2485446).
Hessian positive definite, eigenvalue range [2.812257,198.6868].
Model rank =  23 / 23 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'   edf k-index p-value
s(x1) 11.00 10.85    0.98    0.28
s(x2) 11.00  7.95    0.95    0.12

Checking basis size

layout(matrix(1:6,ncol=2,byrow = T))
plot(norm_model_1);plot(norm_model_2);plot(norm_model_3)

plot of chunk gam_check_norm4

layout(1)

Using gam.check() part 2: visual checks

gam.check() plots

gam.check() creates 4 plots:

  1. Quantile-quantile plots of residuals. If the model is right, should follow 1-1 line

  2. Histogram of residuals

  3. Residuals vs. linear predictor

  4. Observed vs. fitted values

gam.check() uses deviance residuals by default

gam.check() plots: Gaussian data, Gaussian model

norm_model = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model)

plot of chunk gam_check_plots1

gam.check() plots: negative binomial data, Poisson model

pois_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=poisson,method= "REML")
gam.check(pois_model)

plot of chunk gam_check_plots2

gam.check() plots: negative binomial data, negative binomial model

negbin_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=nb,method= "REML")
gam.check(negbin_model)

plot of chunk gam_check_plots3

Exercises

  1. You previously fit models to dolphin models with various \( k \) values. Run gam.check() on models with both high and low values and inspect the results.

  2. Look at the gam.check() plots and find other problems with residual distribution. Can you fix these?

    • (Hint: look at ?quasipoisson, ?negbin, and ?tw)

Concurvity

What is concurvity?

  • Nonlinear measure, similar to co-linearity

  • Measures, for each smooth term, how well this term could be approximated by

    • concurvity(model, full=TRUE): some combination of all other smooth terms
    • concurvity(model, full=FALSE): Each of the other smooth terms in the model (useful for identifying which terms are causing issues)

A demonstration

plot of chunk concurve1

[1] "concurvity(m1, full=FALSE)"

A demonstration

[1] "concurvity(m1, full=FALSE)"
$worst
         para s(x1_cc) s(x2_cc)
para        1    0.000    0.000
s(x1_cc)    0    1.000    0.119
s(x2_cc)    0    0.119    1.000

$observed
         para s(x1_cc) s(x2_cc)
para        1    0.000    0.000
s(x1_cc)    0    1.000    0.039
s(x2_cc)    0    0.069    1.000

$estimate
         para s(x1_cc) s(x2_cc)
para        1    0.000    0.000
s(x1_cc)    0    1.000    0.035
s(x2_cc)    0    0.039    1.000

A demonstration

plot of chunk concurve2

[1] "concurvity(m1, full=TRUE)"
         para s(x1_cc) s(x2_cc)
worst       0     0.32     0.32
observed    0     0.03     0.19
estimate    0     0.08     0.19

A demonstration

plot of chunk concurve3

[1] "concurvity(m1, full=TRUE)"
         para s(x1_cc) s(x2_cc)
worst       0     0.84     0.84
observed    0     0.22     0.57
estimate    0     0.28     0.60

A demonstration

plot of chunk concurve4

[1] "concurvity(m1, full=TRUE)"
         para s(x1_cc) s(x2_cc)
worst       0      1.0     1.00
observed    0      1.0     1.00
estimate    0      0.3     0.97

Concurvity: things to remember

  • Can make your model unstable to small changes
  • cor(data) not sufficient: use the concurvity(model) function
  • Not always obvious from plots of smooths!!

Overall

Make sure to test your model! GAMs are powerful, but with great power…

You should at least:

  1. Check if your smooths are sufficiently smooth

  2. Test if you have the right distribution

  3. Make sure there's no patterns left in your data

  4. If you have time series, grouped, spatial, etc. data, check for dependencies