#load packages
library(mvtnorm)
library(systemfit)
#generate bivariate outcome data
<- 1000
n <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2, byrow = TRUE) # cov matrix
Sigma <- c(E = 0.5, C = 0.5) #mean E and C outcome
mu <- rmvnorm(n, mean = mu, sigma = Sigma)
ec_data <- ec_data
ec_data_new #rescale e to 0-1 range
1] <- (ec_data[,1]-min(ec_data[,1]))/(max(ec_data[,1])-min(ec_data[,1]))
ec_data_new[,#rescale c to 0-1000 range
2] <- (ec_data[,2]-min(ec_data[,2]))/(max(ec_data[,2])-min(ec_data[,2]))*1000
ec_data_new[,
#add treatment covariate (under assumption of better outcome and less costs for t=1 vs t=0)
<- ifelse(ec_data_new[,1]>0.6 & ec_data_new[,2]<500,1,0)
t_data <- ifelse(ec_data_new[,1]<0.6 & ec_data_new[,2]>500,0,t_data)
t_data <- ifelse(ec_data_new[,1]<0.6 & ec_data_new[,2]<500,rbinom(1,1,0.5),t_data)
t_data <- ifelse(ec_data_new[,1]>0.6 & ec_data_new[,2]>500,rbinom(1,1,0.5),t_data)
t_data
#obtain data frame
<- data.frame(ec_data_new[,1],ec_data_new[,2],t_data)
data.df names(data.df) <- c("e","c","t")
Seemingly Unreleated Regression
Hello everyone and welcome back to my regular post on this blog. Today I would like to talk about some technicalities related to a quite popular statistical methods used in the analysis of trial-based economic evaluations called seemingly unrelated regression or SUR (Zellner 1962). Apologies for those who are not really into the details of statistical methodology but I hope that sharing more information about these methods may lead to a better understanding of why we should or should not prefer them when analysing cost-effectiveness data compared to altrnative approaches, e.g. a series of independent linear regressions.
In short, SURs consist in a system of regression equations that are linked to each other by the fact that their residuals are allowed to be correlated. There are two main advantages of using a SUR system compared to fitting a series of separate regressions:
Gain in efficiency by recognising the correlation between individual level outcomes in the parameter estimation (Greene et al. 2008).
Possibility yo impose and/or test restrictions involving parameters in different equations (Fiebig 2001).
General SUR framework
Suppose \(y_{it}\) is a dependent variable, \(\boldsymbol{x}_{it}=(x_{it1},\ldots,x_{itK})\) is a \(K\) vector of explanatory variables for observational unit \(i\), and \(u_{it}\) is an unobservable error term, where the double index \(it\) denotes the \(t\)-th observation of the \(i\)-th equation in the system. A classical SUR model is given by the system of linear regression equations
\[ \begin{align} y_{it}=& \beta_1 x_{1t} + u_{1t} \\ &\vdots \\ y_{Nt}=& \beta_N x_{Nt} + u_{Nt}, \end{align} \] where \(i=1,\ldots,N\), \(t=1,\ldots,T\) and \(L=K_1,\ldots,K_N\). Further simplification can be obtained by stacking observations either in the \(t\) or \(i\) dimension: for example, if we stack for each observation \(t\), let \(Y_t=(y_{1t},\ldots,y_{Nt})^{\prime}\), \(\boldsymbol X_{t}=\text{diag}(x_{1t},\ldots,x_{Nt})\), \(U_t=(u_{1t},\ldots,u_{Nt})^{\prime}\), and \(\boldsymbol \beta=(\beta_1^{\prime},\ldots,\beta_N^{\prime})^{\prime}\), then we have
\[ Y_t = \boldsymbol X_t^{\prime} \boldsymbol \beta + U_t. \] Alternatively, another way to represent a SUR model is to write it in a form of multivariate regression with parameter restrictions: define \(\boldsymbol X_{t}=[x_{1t}^{\prime},\ldots,x_{Nt}^{\prime}]^{\prime}\) and \(A(\boldsymbol \beta)=\text{diag}(\beta_1,\ldots,\beta_N)\), so that we obtain
\[ Y_t = A(\boldsymbol \beta)^\prime X_t + U_t, \] with the coefficients satisfying the restrictions \(\text{vec}(A(\boldsymbol \beta))=G \boldsymbol \beta\), for some full rank matrix \(G\).
In the classical SUR model we assume that for each \(i=1,\ldots,N\), \(\boldsymbol x_{i}=[x_{i1},\ldots,x_{iT}]^{\prime}\) is of full rank \(K\) and that, conditional on all the regressors, the errors \(U_t\) are \(\text{iid}\) over \(t\) with mean zero and homoskedastic variance \(\Sigma=\text{E}[u_t u_t^{\prime}\mid X]\), and that \(\Sigma\) is positive definite with \(\sigma_{ij}=\text{E}[u_{it} u_{jt}\mid X]\) denoting the \(ij\)-th element of \(\Sigma\). Under this assumption the covariance matrix of the errors is given by \(\text{E}[\text{vec}(U)\text{vec}(U)^{\prime}]=\Sigma \otimes I_T\).
SUR applied to trial-based economic evaluations
In the context of trial-based economic evaluations, particularly in a cross-sectional analysis setting based on some individual level cost and effectiveness measures \((c_{i},e_{i})\), SUR models are typically preferred to the fitting of separate linear regressions as they recognise the correlation between individual costs and health outcomes in the parameter estimation. The set of individual level covariates \((\boldsymbol x^c_{i},\boldsymbol x^e_{i})\) can also differ for each endpoint but, often, the same \(\boldsymbol x_i\) are used for each endpoint, so that the model becomes
\[ \begin{align} c_{i}=& \boldsymbol \beta_c \boldsymbol x_{i} + u_{ci} \\ &\vdots \\ e_{i}=& \boldsymbol \beta_e \boldsymbol x_{i} + u_{ei}, \end{align} \] where \(\boldsymbol x_{i}\) denotes the set of common individual level covariates, typically including a treatment indicator and other baseline variables, whose regression coefficients are often meant to represent the incremental cost and effectiveness parameters estimated via OLS approach. Typically, SUR assumes that the individual error terms follow a bivariate normal distribution with mean zero and variances \(\sigma^2_c\) and \(\sigma^2_e\), and with the correlation between the outcomes being captured through the correlation parameter \(\rho\) (Gomes et al. 2012).
How to fit SUR models in R ?
In R
, the systemfit
package allows a user to specify multiple equations and fit them in an SUR framework. After doing so, one can perform tests on coefficients across the equations. As an example, let’s simulate some bivariate cost and effectiveness data.
Next, let’s define the two system of equations using the R
formula notation and pass them in a list to the systemfit
command. A summary of the systemfit
first shows a summary of the system, then the separate equations, and then how the residuals of the two equations are related. These are followed by the OLS fits of the separate equations.
#write formulae for the two outcome equations
<- e ~ t
eq_e <- c ~ t
eq_c
#fit SUR model
<- systemfit(list(readreg = eq_e, mathreg = eq_c), data=data.df)
fitsur_ec
#summarise results
summary(fitsur_ec)
systemfit results
method: OLS
N DF SSR detRCov OLS-R2 McElroy-R2
system 2000 1996 25310258 281.502 0.00189 0.032915
N DF SSR MSE RMSE R2 Adj R2
readreg 1000 998 2.23759e+01 2.2421e-02 0.149736 0.022027 0.021047
mathreg 1000 998 2.53102e+07 2.5361e+04 159.251239 0.001890 0.000890
The covariance matrix of the residuals
readreg mathreg
readreg 0.0224207 16.9443
mathreg 16.9442872 25360.9572
The correlations of the residuals
readreg mathreg
readreg 1.000000 0.710584
mathreg 0.710584 1.000000
OLS estimates for 'readreg' (equation 1)
Model Formula: e ~ t
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.47041588 0.00477582 98.49946 < 2.22e-16 ***
t 0.17366078 0.03662888 4.74109 2.4346e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.149736 on 998 degrees of freedom
Number of observations: 1000 Degrees of Freedom: 998
SSR: 22.375889 MSE: 0.022421 Root MSE: 0.149736
Multiple R-Squared: 0.022027 Adjusted R-Squared: 0.021047
OLS estimates for 'mathreg' (equation 2)
Model Formula: c ~ t
Estimate Std. Error t value Pr(>|t|)
(Intercept) 511.30534 5.07933 100.66402 < 2e-16 ***
t -53.56035 38.95665 -1.37487 0.16948
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 159.251239 on 998 degrees of freedom
Number of observations: 1000 Degrees of Freedom: 998
SSR: 25310235.331251 MSE: 25360.957246 Root MSE: 159.251239
Multiple R-Squared: 0.00189 Adjusted R-Squared: 0.00089
From the output above we can extract via the command coef(fitsur_ec)
and vcov(fitsur_ec)
the coefficient estimates and the covariance matrix for each equation in the model. In the summary(fitsur_ec)
output we also see results related to standard testing procedures, such as t and p-values related to the estimated coefficients for each covariate included in each equation within the system. In this case, for both equations, we only have treatment as the covariate, whose coefficients represent the mean incremental between the two groups in terms of effectiveness and cost variables. We can also compute related \(95\%\) confidence intervals using the command confint(fitsur_ec)
.
For today it is all and I hope you liked this quick overview of this specific topic. I might decide to do these “quick” descriptions of specific methods that are often used in economic evaluations, as I find them to be quite easy to follow and to have a sort of preliminary introduction to the topic under consideration. People may then decide to go in more depth into these topics based on their interest/curiosity, for example by consulting the literature or asking some experts. Till next time!