Eric Pedersen, Gavin Simpson, David Miller, Noam Ross
August 5th, 2017
Most glm families (Poisson, Gamma, Gaussian, Binomial) are exponential families
\[ f(x|\theta) \sim exp(\sum_i \eta_i(\theta)T_i(x) - A(\theta)) \]
mgcv
has expanded to cover many new familiesModelling exta zeros with zero-inflated and adjusted families
NOTE: All the distributions we're covering here have their own quirks. Read the help files carefully before using them!
tw()
nb()
betar()
family in mgcvThe binomial model also model's proportions — more specifically it models the number of successes in \( m \) trials. If you have data of this form then model the binomial counts as this can yield predicted counts if required.
If you have true percentage or proportion data, say estimated proportional plant cover in a quadrat, then the beta model is appropriate.
Also, if all you have is the percentages, the beta model is unlikely to be terribly bad.
To illustrate the use of the betar()
family in mgcv we use a behavioural data set of observations on captive cheetahs. These data are provided and extensively analysed in Zuur et al () and originate from Quirke et al (2012).
cheetah <- read.table("../data/beta-regression/ZooData.txt", header = TRUE)
names(cheetah)
[1] "Number" "Scans" "Proportion" "Size" "Visual"
[6] "Raised" "Visitors" "Feeding" "Oc" "Other"
[11] "Enrichment" "Group" "Sex" "Enclosure" "Vehicle"
[16] "Diet" "Age" "Zoo" "Eps"
cheetah <- transform(cheetah, Raised = factor(Raised),
Feeding = factor(Feeding),
Oc = factor(Oc),
Other = factor(Other),
Enrichment = factor(Enrichment),
Group = factor(Group),
Sex = factor(Sex, labels = c("Male","Female")),
Zoo = factor(Zoo))
m <- gam(Proportion ~ s(log(Size)) + s(Visitors) + s(Enclosure) +
s(Vehicle) + s(Age) + s(Zoo, bs = "re") +
Feeding + Oc + Other + Enrichment + Group + Sex,
data = cheetah, family = betar(), method = "REML")
Family: Beta regression(14.008)
Link function: logit
Formula:
Proportion ~ s(log(Size)) + s(Visitors) + s(Enclosure) + s(Vehicle) +
s(Age) + s(Zoo, bs = "re") + Feeding + Oc + Other + Enrichment +
Group + Sex
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.62169 0.28642 -9.153 < 2e-16 ***
Feeding2 -0.47017 0.24004 -1.959 0.050150 .
Oc2 0.89373 0.23419 3.816 0.000135 ***
Other2 -0.08821 0.22064 -0.400 0.689298
Enrichment2 -0.17822 0.24557 -0.726 0.468000
Group2 -0.57575 0.21491 -2.679 0.007383 **
SexFemale 0.16166 0.17415 0.928 0.353278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(log(Size)) 2.8849324 3.606 27.686 1.23e-05 ***
s(Visitors) 1.0000413 1.000 0.088 0.76717
s(Enclosure) 1.6013748 1.979 1.177 0.51516
s(Vehicle) 1.0000770 1.000 7.391 0.00656 **
s(Age) 1.0008134 1.002 7.217 0.00726 **
s(Zoo) 0.0000135 8.000 0.000 0.62532
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.289 Deviance explained = 102%
-REML = -170.92 Scale est. = 1 n = 88
set.seed(4)
n=300
dat = data.frame(x=seq(0,10,length=n))
dat$f = 20*exp(-dat$x)*dat$x
dat$y = 1*rt(n,df = 3) + dat$f
norm_mod = gam(y~s(x,k=20), data=dat, family=gaussian(link="identity"))
t_mod = gam(y~s(x,k=20), data=dat, family=scat(link="identity"))
Family: Scaled t(2.976,0.968)
Link function: identity
Formula:
y ~ s(x, k = 20)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.02664 0.06853 29.57 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x) 13.27 15.71 1221 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.695 Deviance explained = 63.1%
-REML = 546.75 Scale est. = 1 n = 300
n= 200
dat = data.frame(x1 = runif(n,-1,1),x2=2*pi*runif(n))
dat$f = dat$x1^2 + sin(dat$x2)
dat$y_latent = dat$f + rnorm(n,dat$f)
dat$y = ifelse(dat$y_latent<0,1, ifelse(dat$y_latent<0.5,2,3))
ocat_model = gam(y~s(x1)+s(x2), family=ocat(R=3),data=dat)
plot(ocat_model,page=1)
summary(ocat_model)
Family: Ordered Categorical(-1,-0.09)
Link function: identity
Formula:
y ~ s(x1) + s(x2)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5010 0.2792 1.794 0.0727 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(x1) 3.452 4.282 18.67 0.00133 **
s(x2) 5.195 6.270 84.34 1.09e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Deviance explained = 57.7%
-REML = 97.38 Scale est. = 1 n = 200
head(model_dat)
tree_cover road_dist y
1 0.51 8.6 1
2 0.31 9.9 0
3 0.43 8.2 1
4 0.69 2.9 1
5 0.09 0.7 1
6 0.23 5.6 1
pairs(model_dat)
multinom_pred_data = as.data.frame(expand.grid(road_dist =seq(0,10,length=50),
tree_cover =c(0,0.33,0.66,1)))
multinom_pred = predict(multinom_model, multinom_pred_data,type = "response")
colnames(multinom_pred) = c("monkey","deer","pig")
multinom_pred_data = cbind(multinom_pred_data,multinom_pred)
multinom_pred_data_long = multinom_pred_data %>%
gather(species, probability, monkey, deer,pig)%>%
mutate(tree_cover =paste("tree cover = ", tree_cover,sep=""))
ggplot(aes(road_dist, probability,color=species),data=multinom_pred_data_long)+
geom_line()+
facet_grid(.~tree_cover)+
theme_bw(20)
formula = y~s(x1)+s(x2), weights= censor.var,family=cox.ph
formula = list(y~s(x1)+s(x2), ~s(x2)+s(x3)), family=gaulss
formula = list(y~s(x1)+s(x2), ~s(x2)+s(x3)), family=ziplss
That's the end of this section! We convene after lunch (1:00 PM). You'll get to work through a few more advanced examples of your choice.