Binary Regression Models with Log-Link in the Cohort Studies

Regression models have been used to control confounding in food borne cohort studies, logistic regression has been commonly used due to easy converge. However, logistic regression provide estimates for OR only when RR estimate is lower than 10%, an unlikely situation in food borne outbreaks. Recent developments have resolved the binary model convergence problems applying log link. Food items significant in the univariable analysis were included for the multivariable analysis of two recent Finnish norovirus outbreaks. We used both log and logistic regression models in R and Bayesian model in Winbugs by SPSS and R. The log-link model could be used to identify the vehicle in the two norovirus outbreak datasets. Convergence problems were solved using Bayesian modelling. Binary model applying log link provided accurate and useful estimates of RR estimating the true risk, a suitable method of choice for multivariable analysis of outbreak cohort studies.


BACKGROUND
Regression models have been used in analytical outbreak studies when several variables are significant in the univariable analysis.More specifically, they have been used to control confounding in food borne analytical studies [1].Logistic regression has been commonly used in cohort settings due to its ability to converge in most situations [2][3][4].However, logistic regression provides estimates for odds ratio (OR) which only can be used as risk ratio (RR) estimates when the incidence is lower than 10%, an unlikely situation in many food borne outbreaks [5].From theoretical point of view, log regression models should be used in cohort settings, but often the estimates are on the boundary of the parameter space producing convergence difficulties [6].To overcome this problem, several different methods have been recently published to enable the convergence of binary models with log link, like maximum likelihood estimation [6] or modified Poisson regression [7].Attempts have also been made to use mathematical equations to convert OR values to RR values [8].However, the validity of these equations has been questioned [9] and recent developments in Bayesian modelling have resolved convergence problems of binary log link models [10,11].We used a simple binary model with log link Bayesian framework applying an algorithm ensuring confinement of *Address correspondence to this author at the Department of Infectious Disease Surveillance and Control, National Institute for Health and Welfare, Helsinki, Finland; Tel: 00-358-29-524 8914; Fax: 00-358-29-524 8468; E-mail: katri.jalava@thl.fithe estimates within the parameter space using Winbugs with applicable results.
The aim of this study was to show the applicability of binary model with log link in outbreak situations.Furthermore, we showed empirically that the estimates from binary models with log link are appropriate estimates for RR unlike those obtained from logistic regression by applying the method to two recent Finnish norovirus outbreaks where multivariable analysis was needed in the study.We further confirmed theoretically these results by showing that the commonly used transformation equation for converting OR's to RR's was invalid.

MATERIALS AND METHODS
We used the Bayesian log regression method for two recent real outbreak datasets with high attack rates requiring log regression due to several significant variables identified in the univariable analysis.Briefly, the first outbreak (outbreak 1) was caused by a norovirus in a working place canteen with attack rate of 53%, total cohort size was 175.In the univariable analysis, pesto chicken with potatoes and salmon in indie sauce were significant (Table 1).Second outbreak (outbreak 2) was also a norovirus outbreak in a working place canteen with an attack rate of 57%, total cohort size was 74.In the univariable analysis, cold fish items and pasta salad were significant (Table 1).We used both log and logistic regression models in R (packages glm and glm2) and created a binary Bayesian model with log-link in Winbugs by SPSS and R. R is a free statistical and mathematical software with a number of application packages available [12].The data and code for the respective Bayesian model are presented in Supplementary Material 1-6, cases with missing data were excluded.The mathematical proof of the invalidity of the convergence of OR to RR is presented in Supplementary Material 7.

RESULTS AND DISCUSSION
Binary model with log link should be used for multivariable analysis of cohort studies in outbreak situations when attack rates are >10% [8].However, due to convergence problems logistic regression models have been used in practical situations even with higher attack rates [2][3][4].Also formulas for converting odds ratios to risk ratios have been suggested [8] but the validity has already previously been questioned [9].We applied Bayesian binary regression modelling with log-link with good convergence, the data handling was done in SPSS and Winbugs was used through R. Log link model provided accurate and useful estimates of RR estimating the true risk.
In the outbreak 1, the exposure date was determined to be Tuesday based on the epidemic curve and incubation period for norovirus infection (data not shown).Of the food exposures served during that date, pesto chicken with potatoes was identified with higher risk in the univariable analysis, and salmon in indie sauce with a lower risk (Table 1).The Bayesian log-regression model was used to demonstrate dose response (Table 2).In the univariable analysis of the outbreak 2, cold fish and pasta salad had high risk ratios (Table 1).The Bayesian binary model with log link identified cold fish as a vehicle of the outbreak (Table 2).Both with outbreak 1 and outbreak 2, the logistic OR estimates were higher than the RR estimates from the loglink model.
The most notable difference between the logistic and log regression models was the magnitude of the point estimates.Overall, the logistic regression model gave higher point estimates thus overestimating the risk estimates.However, the point estimates and confidence intervals for the log link models were much lower compared to the logistic regression thus better estimating the true risk.The log regression theoretically estimates risk ratios in the population and the obtained RR's were close to those obtained by the univariable analysis.Furthermore the point estimates from the logistic regression were unrealistically large in practice.This has a sound theoretical basis [5].
We also attempted to perform the log regression within the frequentistic frame in R using various algorithms designed for multivariable analysis (e.g.fitting with stricter form of stephalving; glm2 or using expectation-maximization algorithm [13]) but these were either difficult to use or did not always converge (data not shown).The drawback of the Bayesian modelling is to construct the model but the one used in the present study is a very simple one.The mathematical proof of the invalidity of the conversion formula as suggested by Zhang et al. [8] is presented in Supplementary Material 7. Briefly describing, the standard formula for the risk ratio is mathematically formulated to include odds ratio estimates.The obtained end formula is the one presented for converging the OR values to RR [8] However, as the values of the all explanatory variables x i are used in the formula, it is not generally valid but depends on the data.

CONCLUSIONS
Binary model applying log link provided accurate and useful estimates of RR estimating the true risk, thus proved to be a suitable method for multivariable analysis of cohort studies in outbreak situations.Bayesian modelling was essential to ensure convergence of the model.

OR =
Odds ratio RR = Risk ratio