Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression
You may have to install the required packages the first time you use them. You can install a package by install.packages("package_of_interest") for each package you require.
library(biostat3)
library(dplyr) # for data manipulationLoad the diet data using time-on-study as the timescale.
head(diet)## id chd y hieng energy job month height weight doe dox
## 1 127 0 16.791239 low 2023.25 conductor 2 173.9900 61.46280 1960-02-16 1976-12-01
## 2 200 0 19.958933 low 2448.68 bank 12 177.8000 73.48320 1956-12-16 1976-12-01
## 3 198 0 19.958933 low 2281.38 bank 12 NA NA 1956-12-16 1976-12-01
## 4 222 0 15.394935 low 2467.95 bank 2 158.7500 58.24224 1957-02-16 1972-07-10
## 5 305 1 1.494866 low 2362.93 bank 1 NA NA 1960-01-16 1961-07-15
## 6 173 0 15.958932 low 2380.11 conductor 12 164.4904 79.01712 1960-12-16 1976-12-01
## dob y1k bmi jobNumber
## 1 1910-09-27 0.016791239 20.30316 1
## 2 1909-06-18 0.019958933 23.24473 2
## 3 1910-06-30 0.019958933 NA 2
## 4 1902-07-11 0.015394935 23.11057 2
## 5 1913-06-30 0.001494866 NA 2
## 6 1915-06-28 0.015958932 29.20385 1
summary(diet)## id chd y hieng energy
## Min. : 1 Min. :0.0000 Min. : 0.2875 low :155 Min. :1748
## 1st Qu.: 85 1st Qu.:0.0000 1st Qu.:10.7762 high:182 1st Qu.:2537
## Median :169 Median :0.0000 Median :15.4606 Median :2803
## Mean :169 Mean :0.1365 Mean :13.6607 Mean :2829
## 3rd Qu.:253 3rd Qu.:0.0000 3rd Qu.:17.0431 3rd Qu.:3110
## Max. :337 Max. :1.0000 Max. :20.0411 Max. :4396
##
## job month height weight doe
## driver :102 Min. : 1.000 Min. :152.4 Min. : 46.72 Min. :1956-11-16
## conductor: 84 1st Qu.: 3.000 1st Qu.:168.9 1st Qu.: 64.64 1st Qu.:1959-01-16
## bank :151 Median : 6.000 Median :173.0 Median : 72.80 Median :1960-02-16
## Mean : 6.231 Mean :173.4 Mean : 72.54 Mean :1960-06-22
## 3rd Qu.:10.000 3rd Qu.:177.8 3rd Qu.: 79.83 3rd Qu.:1961-06-16
## Max. :12.000 Max. :190.5 Max. :106.14 Max. :1966-09-16
## NA's :5 NA's :4
## dox dob y1k bmi
## Min. :1958-08-29 Min. :1892-01-10 Min. :0.0002875 Min. :15.88
## 1st Qu.:1972-09-29 1st Qu.:1906-01-18 1st Qu.:0.0107762 1st Qu.:21.59
## Median :1976-12-01 Median :1911-02-25 Median :0.0154606 Median :24.11
## Mean :1974-02-19 Mean :1911-01-04 Mean :0.0136607 Mean :24.12
## 3rd Qu.:1976-12-01 3rd Qu.:1915-01-30 3rd Qu.:0.0170431 3rd Qu.:26.50
## Max. :1976-12-01 Max. :1930-09-19 Max. :0.0200411 Max. :33.29
## NA's :5
## jobNumber
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.145
## 3rd Qu.:2.000
## Max. :2.000
##
(a)
diet <- biostat3::diet
diet$y1k <- diet$y/1000
diet.ir6a <- survRate(Surv(y/1000,chd) ~ hieng, data=diet)
## or
diet %>%
group_by(hieng) %>%
summarise(Event = sum(chd), Time = sum(y1k), Rate = Event/Time, # group sums
CI_low = poisson.test(Event,Time)$conf.int[1],
CI_high = poisson.test(Event,Time)$conf.int[2]) ## # A tibble: 2 x 6
## hieng Event Time Rate CI_low CI_high
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 low 28 2.06 13.6 9.03 19.6
## 2 high 18 2.54 7.07 4.19 11.2
## IRR
with(diet.ir6a, poisson.test(event,tstop)) ##
## Comparison of Poisson rates
##
## data: event time base: tstop
## count1 = 28, expected count1 = 20.578, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 1.026173 3.688904
## sample estimates:
## rate ratio
## 1.921747
with(diet.ir6a, poisson.test(rev(event),rev(tstop)))##
## Comparison of Poisson rates
##
## data: rev(event) time base: rev(tstop)
## count1 = 18, expected count1 = 25.422, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.2710832 0.9744943
## sample estimates:
## rate ratio
## 0.5203599
We see that individuals with a high energy intake have a lower CHD incidence rate. The estimated crude incidence rate ratio is 0.52 (95% CI: 0.27, 0.97).
(b)
poisson6b <- glm( chd ~ hieng + offset( log( y1k ) ), family=poisson, data=diet)
summary(poisson6b)##
## Call:
## glm(formula = chd ~ hieng + offset(log(y1k)), family = poisson,
## data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7382 -0.6337 -0.4899 -0.3891 3.0161
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.6098 0.1890 13.811 <2e-16 ***
## hienghigh -0.6532 0.3021 -2.162 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 258.00 on 335 degrees of freedom
## AIC: 354
##
## Number of Fisher Scoring iterations: 6
eform(poisson6b)## exp(beta) 2.5 % 97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh 0.5203599 0.2878432 0.9407012
eform(poisson6b, method="Profile")## Waiting for profiling to be done...
## exp(beta) 2.5 % 97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh 0.5203599 0.2829171 0.9328392
The point estimate for the IRR calculated by the Poisson regression is the same as the IRR calculated in 6(a). A theoretical observation: if we consider the data as being cross-classified solely by hieng then the Poisson regression model with one parameter is a saturated model so the IRR estimated from the model will be identical to the ‘observed’ IRR. That is, the model is a perfect fit.
(c)
hist6c <- hist(diet$energy, breaks=25, probability=TRUE, xlab="Energy (units)")
curve(dnorm(x, mean=mean(diet$energy), sd=sd(diet$energy)), col = "red", add=TRUE)quantile(diet$energy, probs=c(0.01,0.05,0.1,0.25,0.5,0.75,0.90,0.95,0.99))## 1% 5% 10% 25% 50% 75% 90% 95% 99%
## 1887.268 2177.276 2314.114 2536.690 2802.980 3109.660 3365.644 3588.178 4046.820
# For kurtosis and skewness, see package e1071The histogram gives us an idea of the distribution of energy intake. We can also tabulate moments and percentiles of the distribution.
(d)
diet$eng3 <- cut(diet$energy, breaks=c(1500,2500,3000,4500),labels=c("low","medium","high"),
right = FALSE)
cbind(Freq=table(diet$eng3),
Prop=table(diet$eng3)/nrow(diet))## Freq Prop
## low 75 0.2225519
## medium 150 0.4451039
## high 112 0.3323442
(e)
## rates and IRRs
diet.ir6e <- survRate(Surv(y/1000,chd) ~ eng3, data=diet)
print(diet.ir6e)## eng3 tstop event rate lower upper
## eng3=low low 0.9466338 16 16.901995 9.660951 27.447781
## eng3=medium medium 2.0172621 22 10.905871 6.834651 16.511619
## eng3=high high 1.6397728 8 4.878725 2.106287 9.613033
# calculate IRR and confidence intervals
with(diet.ir6e, rate[eng3=="medium"] / rate[eng3=="low"])## [1] 0.6452416
with(diet.ir6e[c(2,1),], { # compare second row with first row
poisson.test(event, tstop)
})##
## Comparison of Poisson rates
##
## data: event time base: tstop
## count1 = 22, expected count1 = 25.863, p-value = 0.2221
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.3237007 1.3143509
## sample estimates:
## rate ratio
## 0.6452416
with(diet.ir6e, rate[eng3=="high"] / rate[eng3=="low"])## [1] 0.2886479
with(diet.ir6e[c(3,1),], { # compare third row with first row
poisson.test(event, tstop)
})##
## Comparison of Poisson rates
##
## data: event time base: tstop
## count1 = 8, expected count1 = 15.216, p-value = 0.004579
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
## 0.1069490 0.7148284
## sample estimates:
## rate ratio
## 0.2886479
We see that the CHD incidence rate decreases as the level of total energy intake increases.
(f)
diet <- mutate(diet,
X1 = as.numeric(eng3 == "low"),
X2 = as.numeric(eng3 == "medium"),
X3 = as.numeric(eng3 == "high"))
# or
diet <- biostat3::addIndicators(diet, ~eng3+0) %>%
mutate(X1 = eng3low, X2 = eng3medium, X3 = eng3high)
colSums(diet[c("X1","X2","X3")])## X1 X2 X3
## 75 150 112
(g)
filter(diet, eng3=="low") %>% select(c(energy,eng3,X1,X2,X3)) %>% head## energy eng3 X1 X2 X3
## 1 2023.25 low 1 0 0
## 2 2448.68 low 1 0 0
## 3 2281.38 low 1 0 0
## 4 2467.95 low 1 0 0
## 5 2362.93 low 1 0 0
## 6 2380.11 low 1 0 0
filter(diet, eng3=="medium") %>% select(c(energy,eng3,X1,X2,X3)) %>% head## energy eng3 X1 X2 X3
## 1 2664.64 medium 0 1 0
## 2 2533.33 medium 0 1 0
## 3 2854.08 medium 0 1 0
## 4 2673.77 medium 0 1 0
## 5 2766.88 medium 0 1 0
## 6 2586.69 medium 0 1 0
filter(diet, eng3=="high") %>% select(c(energy,eng3,X1,X2,X3)) %>% head## energy eng3 X1 X2 X3
## 1 3067.36 high 0 0 1
## 2 3298.95 high 0 0 1
## 3 3147.60 high 0 0 1
## 4 3180.47 high 0 0 1
## 5 3045.81 high 0 0 1
## 6 3060.03 high 0 0 1
(h)
poisson6h <- glm( chd ~ X2 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
summary(poisson6h)##
## Call:
## glm(formula = chd ~ X2 + X3 + offset(log(y1k)), family = poisson,
## data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8231 -0.6052 -0.4532 -0.3650 2.9434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8274 0.2500 11.312 < 2e-16 ***
## X2 -0.4381 0.3285 -1.334 0.18233
## X3 -1.2425 0.4330 -2.870 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 253.62 on 334 degrees of freedom
## AIC: 351.62
##
## Number of Fisher Scoring iterations: 6
eform(poisson6h)## exp(beta) 2.5 % 97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## X2 0.6452416 0.3389050 1.2284761
## X3 0.2886478 0.1235407 0.6744143
Level 1 of the categorized total energy is the reference category. The estimated rate ratio comparing level 2 to level 1 is 0.6452 and the estimated rate ratio comparing level 3 to level 1 is 0.2886.
(i)
poisson6i <- glm( chd ~ X1 + X3 + offset( log( y1k ) ), family=poisson, data=diet )
# or
poisson6i <- glm( chd ~ I(eng3=="low") + I(eng3=="high") + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6i )##
## Call:
## glm(formula = chd ~ I(eng3 == "low") + I(eng3 == "high") + offset(log(y1k)),
## family = poisson, data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8231 -0.6052 -0.4532 -0.3650 2.9434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3893 0.2132 11.207 <2e-16 ***
## I(eng3 == "low")TRUE 0.4381 0.3285 1.334 0.1823
## I(eng3 == "high")TRUE -0.8044 0.4129 -1.948 0.0514 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 253.62 on 334 degrees of freedom
## AIC: 351.62
##
## Number of Fisher Scoring iterations: 6
eform( poisson6i )## exp(beta) 2.5 % 97.5 %
## (Intercept) 10.9058706 7.1810131 16.562846
## I(eng3 == "low")TRUE 1.5498071 0.8140167 2.950679
## I(eng3 == "high")TRUE 0.4473485 0.1991681 1.004783
Now use level 2 as the reference (by omitting X2 but including X1 and X3). The estimated rate ratio comparing level 1 to level 2 is 1.5498 and the estimated rate ratio comparing level 3 to level 2 is 0.4473.
(j)
poisson6j <- glm( chd ~ eng3 + offset( log( y1k ) ), family=poisson, data=diet )
summary( poisson6j )##
## Call:
## glm(formula = chd ~ eng3 + offset(log(y1k)), family = poisson,
## data = diet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8231 -0.6052 -0.4532 -0.3650 2.9434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.8274 0.2500 11.312 < 2e-16 ***
## eng3medium -0.4381 0.3285 -1.334 0.18233
## eng3high -1.2425 0.4330 -2.870 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.82 on 336 degrees of freedom
## Residual deviance: 253.62 on 334 degrees of freedom
## AIC: 351.62
##
## Number of Fisher Scoring iterations: 6
eform( poisson6j )## exp(beta) 2.5 % 97.5 %
## (Intercept) 16.9019960 10.3556032 27.5867534
## eng3medium 0.6452416 0.3389050 1.2284761
## eng3high 0.2886478 0.1235407 0.6744143
The estimates are identical (as we would hope) when we have R create indicator variables for us.
(k)
Somehow (there are many different alternatives) you’ll need to calculate the total number of events and the total person-time at risk and then calculate the incidence rate as events/person-time. For example,
summarise(diet, rate = sum(chd) / sum(y))## rate
## 1 0.009992031
The estimated incidence rate is 0.00999 events per person-year.