Noam Ross (with lots of help from Eric J Pedersen)
August 5th, 2018
So you have a GAM:
gam.check()
gam.check()
part 2With all models, how accurate your predictions will be depends on how good the model is
set.seed(2)
n = 400
x1 = rnorm(n)
x2 = rnorm(n)
y_val =1 + 2*cos(pi*x1) + 2/(1+exp(-5*(x2)))
y_norm = y_val + rnorm(n, 0, 0.5)
y_negbinom = rnbinom(n, mu = exp(y_val),size=10)
y_binom = rbinom(n,1,prob = exp(y_val)/(1+exp(y_val)))
k
per terms(x, k=10)
or s(x, y, k=100)
k
)norm_model_1 = gam(y_norm~s(x1,k=4)+s(x2,k=4),method= "REML")
gam.check(norm_model_1)
Method: REML Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-0.0003467788,0.0005154578]
(score 736.9402 & scale 2.252304).
Hessian positive definite, eigenvalue range [0.000346021,198.5041].
Model rank = 7 / 7
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x1) 3.00 1.00 0.13 <2e-16 ***
s(x2) 3.00 2.91 1.04 0.79
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
norm_model_2 = gam(y_norm~s(x1,k=12)+s(x2,k=4),method= "REML")
gam.check(norm_model_2)
Method: REML Optimizer: outer newton
full convergence after 11 iterations.
Gradient range [-5.658609e-06,5.392657e-06]
(score 345.3111 & scale 0.2706205).
Hessian positive definite, eigenvalue range [0.967727,198.6299].
Model rank = 15 / 15
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x1) 11.00 10.84 0.99 0.41
s(x2) 3.00 2.98 0.86 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
norm_model_3 = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model_3)
Method: REML Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-5.103686e-05,-1.089481e-08]
(score 334.2084 & scale 0.2485446).
Hessian positive definite, eigenvalue range [2.812257,198.6868].
Model rank = 23 / 23
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x1) 11.00 10.85 0.98 0.28
s(x2) 11.00 7.95 0.95 0.12
layout(matrix(1:6,ncol=2,byrow = T))
plot(norm_model_1);plot(norm_model_2);plot(norm_model_3)
layout(1)
gam.check()
creates 4 plots:
Quantile-quantile plots of residuals. If the model is right, should follow 1-1 line
Histogram of residuals
Residuals vs. linear predictor
Observed vs. fitted values
gam.check()
uses deviance residuals by default
norm_model = gam(y_norm~s(x1,k=12)+s(x2,k=12),method= "REML")
gam.check(norm_model)
pois_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=poisson,method= "REML")
gam.check(pois_model)
negbin_model = gam(y_negbinom~s(x1,k=12)+s(x2,k=12),family=nb,method= "REML")
gam.check(negbin_model)
You previously fit models to dolphin models with various \( k \) values. Run
gam.check()
on models with both high and low values and inspect the results.
Look at the gam.check()
plots and find other problems with residual distribution.
Can you fix these?
?quasipoisson
, ?negbin
, and ?tw
)Nonlinear measure, similar to co-linearity
Measures, for each smooth term, how well this term could be approximated by
concurvity(model, full=TRUE)
: some combination of all other smooth termsconcurvity(model, full=FALSE)
: Each of the other smooth terms in the model
(useful for identifying which terms are causing issues)[1] "concurvity(m1, full=FALSE)"
[1] "concurvity(m1, full=FALSE)"
$worst
para s(x1_cc) s(x2_cc)
para 1 0.000 0.000
s(x1_cc) 0 1.000 0.119
s(x2_cc) 0 0.119 1.000
$observed
para s(x1_cc) s(x2_cc)
para 1 0.000 0.000
s(x1_cc) 0 1.000 0.039
s(x2_cc) 0 0.069 1.000
$estimate
para s(x1_cc) s(x2_cc)
para 1 0.000 0.000
s(x1_cc) 0 1.000 0.035
s(x2_cc) 0 0.039 1.000
[1] "concurvity(m1, full=TRUE)"
para s(x1_cc) s(x2_cc)
worst 0 0.32 0.32
observed 0 0.03 0.19
estimate 0 0.08 0.19
[1] "concurvity(m1, full=TRUE)"
para s(x1_cc) s(x2_cc)
worst 0 0.84 0.84
observed 0 0.22 0.57
estimate 0 0.28 0.60
[1] "concurvity(m1, full=TRUE)"
para s(x1_cc) s(x2_cc)
worst 0 1.0 1.00
observed 0 1.0 1.00
estimate 0 0.3 0.97
cor(data)
not sufficient: use the concurvity(model)
functionMake sure to test your model! GAMs are powerful, but with great power…
You should at least:
Check if your smooths are sufficiently smooth
Test if you have the right distribution
Make sure there's no patterns left in your data
If you have time series, grouped, spatial, etc. data, check for dependencies