Note that we are using a data file that has already been explored and modified to remove outliers. Read the introduction to the model to see where to find the original data file and explanation on data exploration

MODEL

Our aim is to standardise hilsa catch (kg) as a function of effort (trip days and net length) among months and regions To demonstrate the process of model development we start with a Poisson distribution and include all covariates

M1 <- glmmTMB(Catch ~ Area + Nlength + TripDays + YrMonth,
                      family = "poisson"(link = "log"),
                      data = hils)

check_overdispersion(M1)

## # Overdispersion test
## 
##        dispersion ratio =    74.825
##   Pearson's Chi-Squared = 19379.679
##                 p-value =   < 0.001

## Overdispersion detected.

The model is overdispersed (p < 0.001). Clearly this is related to the number of zeroes in our data.

Now we try a negative binomial distribution

M2 <- glmmTMB(Catch ~ Area + YrMonth + Nlength + TripDays,
                      family = "nbinom1"(link = "log"),
                      ziformula=~ 0,
                      data = hils)
check_overdispersion(M2)

## # Overdispersion test
## 
##        dispersion ratio =   0.891
##   Pearson's Chi-Squared = 229.806
##                 p-value =   0.896

## No overdispersion detected.

The overdispersion statistic is better, but the fit can still be improved. For that we handle dependency due to area (spatial) and month (temporal) as random terms.

M3 <- glmmTMB(Catch ~ TripDays + Nlength +
                     (1|Area) + (1|YrMonth),
                      family = "nbinom1"(link = "log"),
                      ziformula=~ 0,
                      data = hils)

check_overdispersion(M3)

## # Overdispersion test
## 
##        dispersion ratio =   0.879
##   Pearson's Chi-Squared = 228.581
##                 p-value =    0.92

## No overdispersion detected.

Overdispersion is fine, but we can a warning that the model is probably overparameterised. This is because the data does not contain enough information to estimate model parameters properly). So we will use a zero-inflated negative binomial model (ZINB1)

# Try a zero-inflated negative binomial model (ZINB1)
M4 <- glmmTMB(Catch ~ Area + Nlength + TripDays + YrMonth, # models count data
                      family = "nbinom1"(link = "log"),
                      ziformula=~ 1,                       # models binomial data
                      data = hils)
check_overdispersion(M4)

## # Overdispersion test
## 
##        dispersion ratio =   0.894
##   Pearson's Chi-Squared = 229.807
##                 p-value =   0.888

## No overdispersion detected.

Now the dispersion ratio is clearly smaller than 1, so the model is a bit underdispersed. We will next include the area and time into both the count data and the binomial data part.

# Zero-inflated negative binomial (ZINB2)
M5 <- glmmTMB(Catch ~ Area + Nlength + TripDays + YrMonth, # models count data
                      family = "nbinom1"(link = "log"),
                      ziformula =~ Area + YrMonth,         # models binomial data
                      data = hils)
check_overdispersion(M5)

## # Overdispersion test
## 
##        dispersion ratio =   0.911
##   Pearson's Chi-Squared = 230.597
##                 p-value =   0.841

## No overdispersion detected.

Compare all models using Akaike information criterion

Out <- AIC(M1,M2,M3,M4,M5)
rownames(Out) <- c("Poisson GLM",
                   "NB GLM",
                   "NB GLMM",
                   "ZINB1 GLM",
                   "ZINB2 GLM")
colnames(Out) <- c("df", "AIC")
round(Out,0)

##             df   AIC
## Poisson GLM  7 11939
## NB GLM       8  1904
## NB GLMM      6  1911
## ZINB1 GLM    9  1906
## ZINB2 GLM   13  1913

The negative binomial GLM model (M2) has the lowest AIC value, so we consider it the best model. Next, we will simulate new data from model parameters and plot this datas as a frequency histogram. We expect the observed number of zeros in the data should fall within a the histogram for simulated number of zeros. We will do that with both model 2 and model 3 (even though we had a warning when we ran model 3)

par(mfrow = c(2,1), mar = c(5,5,3,3))

# Model M2 (NB GLM)
testZeroInflation(M2)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 0.76227, p-value = 0.008
## alternative hypothesis: two.sided

# Model M2 (NB GLMM)
testZeroInflation(M3)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 0.77705, p-value = 0.096
## alternative hypothesis: two.sided

We see that the second model produces a more likely distribution of data relative to our observation (red line, or observed number of zeros, fall within the simulated values), although the model cannot fully estimate all paramters given this dataset.

Plot model outputs and parameters

We plot model outputs to show predictions on how catches will depend on the number of days spent fishing and gillnet lengths.

We can also plot how catches depend on the trip length during different months and in different fishing areas.

Plot model parameter estimates in a publication format

## Random effect variances not available. Returned R2 does not account for random effects.

	NB GLM (Hilsha)
Coeffcient	Log-Mean	Conf. Int (95%)	P-value
(Intercept)	2.49	2.06 – 2.92	<0.001
Area [LowerMeghna]	-0.11	-0.48 – 0.26	0.556
Area [LowerPadma]	-0.44	-0.89 – 0.01	0.056
YrMonth [Oct92]	0.38	0.10 – 0.66	0.007
YrMonth [Sep92]	-0.08	-0.35 – 0.20	0.596
Nlength	0.00	-0.00 – 0.00	0.174
TripDays	0.41	0.33 – 0.49	<0.001
Observations	266
R² conditional / R² marginal	NA / 0.960

R code for hilsa (Tenualosa ilisha) CPUE data: zero inflated models

Carl Smith

2022-11-23

Data exploration: outliers

Normality and homogeneity of variance

How many zeros in the data?

MODEL

Plot model outputs and parameters

Plot model parameter estimates in a publication format