This post will illustrate how to conduct statistical analysis on count data.
When discussing the modeling of count data, it’s important to clarify exactly what is meant by a count. The word “count” is typically used as a verb meaning to enumerate units, items, or events. We might count the number of road kills observed on a stretch of highway, how many patients died at a particular hospital within 48 hours of having a myocardial infarction.
The datset that will be explored is a forest fire dataset. In this dataset it contains the following:
cases: number of forest fire cases reported in each of the 253 areas between June and September, 2017;
size: the size of each area (in km2);
windspeed: the average wind speed (in mph) of each area over the observation period;
temperature: the average temperature (in oC) of each area over the observation period;
wui: a quantity measuring the woodland-urban interface, the proportion of each area where houses are in or near wildland vegetation;
terrain: a variable describing the main topographical feature of each area.
A local authoriy is looking at using the data from past forest fires to drive policy decision on what polidices would best prevent a protect from local forest fires from breaking out.
In this example the data is bounded at zero, with forest fire cases being the dependent variable then it is not possible to obtain negative counts. Consequently, it would not be suitable to use standard linear regression in this case, since the assumptions that Y (the dependent variable) follows a normal distribution is violated, as it assumes normal errors around the mean, and hence equally weights them. This means that the likelihood of a count of one is just as likely as negative one, which is not the case for this data here.
#Histogram of forest fire cases
hist(forestfire$cases,breaks = c(0,0.9999,1.9999,2.9999,3.9999,4.9999,5.9999),
freq = TRUE,xlab='Cases',main = 'Histogram of number of forest fires')
the AREAS in the plot are wrong -- rather use 'freq = FALSE'
The above histogram shows how the distribution of the data does not follow a normal distribution and is thus unsuitable for linear regression.
#Plot of the data
plot(forestfire)
In the above pair plot it is evident that there is no liner correlation between any of the variables. This further illustrates how a linear regression model is not suitable. An alternative is to use a Poisson regression model or one of its variants. These models have a number of advantages over an ordinary linear regression model, including a skew, discrete distribution, and the restriction of predicted values to non-negative numbers. A Poisson model is similar to an ordinary linear regression, with two exceptions. First, it assumes that the errors follow a Poisson, not a normal, distribution. Second, rather than modeling Y as a linear function of the regression coefficients, it models the natural log of the response variable, ln(Y), as a linear function of the coefficients.
Each count can be modeled within a Poisson distribution with a mean value of
\[\textrm{Pr}(Y=y|\lambda)=\frac{e^{-\lambda}\lambda^{y}}{y!} \quad (y=0,1,2,...). \] In the above equation the distribution is specified using a single parameter lambda. The parameter lambda is the mean incidence count. Exposure may be time, space, distance, area, volume, or population size.
The parameter lambda may be interpreted as the risk of a new occurrence of the event during a specified exposure period, t. The probability of y events is then given by
\[\Pr(Y = y | \lambda,t) = \frac{e^{-\lambda t}(\lambda t)^y}{y!} \quad (y=0,1,2,...).\] When employing a rate parameter over time, area or volume to a Poisson model, then it is entered as the natural log of t as an offset into the estimating algorithm. The offset with the fitted value is expressed as
\[\lambda = e^{(x\beta + \log(t))},\] and can be rewritten as
\[e^{x\beta} = \frac{\lambda}{t}, \quad \lambda = t e^{x\beta},\] log(t) is entered into the estimating algorithm as a constant (Hilbe, 2014).
#Number of data points
length(forestfire$cases)
[1] 253
summary(forestfire)
cases size windspeed temperature wui terrain
Min. :0.000 Min. : 60.8 Min. : 1.984 Min. :20.00 Min. :0.05015 forest : 38
1st Qu.:0.000 1st Qu.:106.3 1st Qu.: 6.705 1st Qu.:22.23 1st Qu.:0.16471 hillyArea: 51
Median :1.000 Median :117.7 Median : 7.950 Median :24.32 Median :0.20477 plain :117
Mean :1.115 Mean :119.1 Mean : 8.051 Mean :24.28 Mean :0.20087 swamp : 47
3rd Qu.:2.000 3rd Qu.:130.8 3rd Qu.: 9.479 3rd Qu.:26.02 3rd Qu.:0.23695
Max. :5.000 Max. :168.5 Max. :15.621 Max. :31.03 Max. :0.35279
Having established that linear regression is unsuited to model this count data, a Poisson regression can be used as an appropriate model to use in this case, provided it satisfies the basic requirements of a Poisson model. The general formula for the Poisson regres- sion model is:
\[Y_i = \lambda_i = \exp^{x_{i}\beta},\] where Y is the predicted count, lambda is the predicted event rate for ith sample, and e are the regressors.
The basic Poisson assumptions are:
1. The distribution is discrete with a single parameter, the mean. It is the expected number of times that an item or event occurs per unit of time, area or volume.
2. The response terms are non negative integers.
3. Observations are independent of one another.
4. The mean and variance of the model are almost the same.
One condition that the Poisson regression must satisfy is to ensure that the data is not significantly over or under dispersed. This means that the mean should be approximately the same as the variance (Hilbe, 2014).
#Compare Mean and Variance the number of cases
mean(forestfire$cases)
[1] 1.114625
var(forestfire$cases)
[1] 1.165381
mean(forestfire$size)
[1] 119.0527
var(forestfire$size)
[1] 366.4492
mean(forestfire$cases)/mean(forestfire$size)
[1] 0.00936245
var(forestfire$cases)/var(forestfire$size)
[1] 0.003180196
#Difference in rate mean and rate variance
mean(forestfire$cases)/mean(forestfire$size)-var(forestfire$cases)/var(forestfire$size)
[1] 0.006182253
Subsequently, the difference between the rate mean and rate variance are almost identical (a difference of 0.006) suggesting there is no substantial over or under dispersion here.
The general Poisson model using the size of each area as the offset term gives the following regression formula: \[\log(\lambda_i) = \log(\textrm{size}) + \beta_0 + \beta_1\textrm{Windspeed} + \beta_2\textrm{Temprature} +\beta_3 \textrm{WUI}\\ + \beta_4\textrm{Terrain}X_1\textrm{HillyArea}+\beta_4\textrm{Terrain}X_2\textrm{Plain}+\beta_4\textrm{Terrain}X_3\textrm{Swamp}.\]
#Fitting a poisson model
FullPoissonModel <- glm(forestfire$cases ~ forestfire$windspeed +
forestfire$temperature + forestfire$wui
+ forestfire$terrain + offset(log(forestfire$size)),poisson)
summary(FullPoissonModel)
Call:
glm(formula = forestfire$cases ~ forestfire$windspeed + forestfire$temperature +
forestfire$wui + forestfire$terrain + offset(log(forestfire$size)),
family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.14120 -1.29547 -0.06082 0.46339 2.62288
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.439233 0.687780 -9.362 < 2e-16 ***
forestfire$windspeed 0.024764 0.028430 0.871 0.38374
forestfire$temperature 0.067991 0.023652 2.875 0.00405 **
forestfire$wui 0.549923 1.117609 0.492 0.62268
forestfire$terrainhillyArea -0.276622 0.196199 -1.410 0.15857
forestfire$terrainplain -0.353247 0.166549 -2.121 0.03392 *
forestfire$terrainswamp -0.006726 0.186626 -0.036 0.97125
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 299.85 on 252 degrees of freedom
Residual deviance: 284.72 on 246 degrees of freedom
AIC: 696.14
Number of Fisher Scoring iterations: 5
The P-value for temprature is less than 0.05 so this is sgnificant. Although the P-value for the terrain plain is close to 0.05 and is only slightly larger. Subsequntly, the next step is to fit a model with the temprature only.
#fit the model with temprature only
TempPoissonModel <- glm(forestfire$cases ~ forestfire$temperature +
offset(log(forestfire$size)),poisson)
summary(TempPoissonModel)
Call:
glm(formula = forestfire$cases ~ forestfire$temperature + offset(log(forestfire$size)),
family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.96352 -1.35488 -0.08503 0.53047 2.73585
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.20382 0.58385 -10.626 <2e-16 ***
forestfire$temperature 0.06258 0.02352 2.661 0.0078 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 299.85 on 252 degrees of freedom
Residual deviance: 292.80 on 251 degrees of freedom
AIC: 694.21
Number of Fisher Scoring iterations: 5
The next step is to compare the two models and decide which one would ultimatly be best to use as a final model.
#Compare the full model with the reduced model
library(lmtest)
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
lrtest(FullPoissonModel,TempPoissonModel)
Likelihood ratio test
Model 1: forestfire$cases ~ forestfire$windspeed + forestfire$temperature +
forestfire$wui + forestfire$terrain + offset(log(forestfire$size))
Model 2: forestfire$cases ~ forestfire$temperature + offset(log(forestfire$size))
#Df LogLik Df Chisq Pr(>Chisq)
1 7 -341.07
2 2 -345.11 -5 8.079 0.1519
H0: Smaller model is best H1: Larger model is better
Here P > 0.05 so we retain H0 and the full model is worse. We go for the simpler model. However, model 1 does have a slightly lower LogLiklihood.
The regression model is consequently: \[ \log(\lambda_i) = \log(\textrm{size}) + \beta_0 + \beta_1\textrm{Temprature}\] \[ \log(\lambda_i) = -5.90696+ 0.05267*\textrm{Temprature}\]
Goodness of fit
anova(TempPoissonModel,test = "Chisq")
Analysis of Deviance Table
Model: poisson, link: log
Response: forestfire$cases
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 252 299.85
forestfire$temperature 1 7.0473 251 292.80 0.007939 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
H0: Null Modell is best.
H1: Full model fits better.
Here P-val < 0.05 so reject H0, and the Temp Model is a good fit.
Check for over/underdispersion
library(AER)
deviance(TempPoissonModel)/TempPoissonModel$df.residual
[1] 1.166541
dispersiontest(TempPoissonModel)
Overdispersion test
data: TempPoissonModel
z = 0.079021, p-value = 0.4685
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
1.007029
Here P-val > 0.05 so retain null hypothesis, and true dispersion is not significantly greater than one and the data is not overdispersed.
Check for influential and leverage points
library(car)
influencePlot(TempPoissonModel)
cooks <- cooks.distance(TempPoissonModel)
leverage <- hatvalues(TempPoissonModel)
forestfire[cooks>0.018, ] #2p/n p=2 n=220
forestfire[leverage>0.018, ]
Check for zero inflation
library(pscl)
zerocheck <- zeroinfl(forestfire$cases ~ forestfire$temperature+
offset(log(forestfire$size)), dist='poisson')
AIC(TempPoissonModel,zerocheck)
The Temp Poisson Model has a lower AIC and suggesting it is a better fit than a zero inflated model. This suggests that the data is not zero inflated.
Half normal probability plot.
library(faraway)
halfnorm(residuals(TempPoissonModel))
Half-Normal plots are obtained by plotting the ordered absolute values of a model diagnostic versus the expected order statistic of a half-normal distribution.
From the final regression model above it is clear that temperature is of key importance when it comes to pre- vention of forest fires. With a positive coefficient value of approximately 0.05 meaning there is a positive correlation with the number of fires and the temperature. Additionally, what the model has shown is that since the other factors are not statistically significant at the 5% level then there is no significant greater chance of forest fires occurring for different terrains. Furthermore, the wind-speed and proportion of woodland urban interface (WUI) does not have any significant added reason in a greater number of counts of forest fires occurring.
A policy maker would perhaps take greater precautions against forest fires in areas that have higher temperatures, to have a notable impact on reducing and preventing forest fires. Furthermore, the greatest drive in policy decsions should be focused around temprature. For example, allocating more resources when tempratures rise across all different terrains.