Seemingly Unreleated Regression

Quarto
R
Academia
Missing Data
Published

July 9, 2025

Hello everyone and welcome back to my regular post on this blog. Today I would like to talk about some technicalities related to a quite popular statistical methods used in the analysis of trial-based economic evaluations called seemingly unrelated regression or SUR (Zellner 1962). Apologies for those who are not really into the details of statistical methodology but I hope that sharing more information about these methods may lead to a better understanding of why we should or should not prefer them when analysing cost-effectiveness data compared to alternative approaches, e.g. a series of independent linear regressions.

In short, SURs consist in a system of regression equations that are linked to each other by the fact that their residuals are allowed to be correlated. There are two main advantages of using a SUR system compared to fitting a series of separate regressions:

  1. Gain in efficiency by recognising the correlation between individual level outcomes in the parameter estimation (Greene et al. 2008).

  2. Possibility yo impose and/or test restrictions involving parameters in different equations (Fiebig 2001).

General SUR framework

Suppose \(y_{it}\) is a dependent variable, \(\boldsymbol{x}_{it}=(x_{it1},\ldots,x_{itK})\) is a \(K\) vector of explanatory variables for observational unit \(i\), and \(u_{it}\) is an unobservable error term, where the double index \(it\) denotes the \(t\)-th observation of the \(i\)-th equation in the system. A classical SUR model is given by the system of linear regression equations

\[ \begin{align} y_{it}=& \beta_1 x_{1t} + u_{1t} \\ &\vdots \\ y_{Nt}=& \beta_N x_{Nt} + u_{Nt}, \end{align} \] where \(i=1,\ldots,N\), \(t=1,\ldots,T\) and \(L=K_1,\ldots,K_N\). Further simplification can be obtained by stacking observations either in the \(t\) or \(i\) dimension: for example, if we stack for each observation \(t\), let \(Y_t=(y_{1t},\ldots,y_{Nt})^{\prime}\), \(\boldsymbol X_{t}=\text{diag}(x_{1t},\ldots,x_{Nt})\), \(U_t=(u_{1t},\ldots,u_{Nt})^{\prime}\), and \(\boldsymbol \beta=(\beta_1^{\prime},\ldots,\beta_N^{\prime})^{\prime}\), then we have

\[ Y_t = \boldsymbol X_t^{\prime} \boldsymbol \beta + U_t. \] Alternatively, another way to represent a SUR model is to write it in a form of multivariate regression with parameter restrictions: define \(\boldsymbol X_{t}=[x_{1t}^{\prime},\ldots,x_{Nt}^{\prime}]^{\prime}\) and \(A(\boldsymbol \beta)=\text{diag}(\beta_1,\ldots,\beta_N)\), so that we obtain

\[ Y_t = A(\boldsymbol \beta)^\prime X_t + U_t, \] with the coefficients satisfying the restrictions \(\text{vec}(A(\boldsymbol \beta))=G \boldsymbol \beta\), for some full rank matrix \(G\).

In the classical SUR model we assume that for each \(i=1,\ldots,N\), \(\boldsymbol x_{i}=[x_{i1},\ldots,x_{iT}]^{\prime}\) is of full rank \(K\) and that, conditional on all the regressors, the errors \(U_t\) are \(\text{iid}\) over \(t\) with mean zero and homoskedastic variance \(\Sigma=\text{E}[u_t u_t^{\prime}\mid X]\), and that \(\Sigma\) is positive definite with \(\sigma_{ij}=\text{E}[u_{it} u_{jt}\mid X]\) denoting the \(ij\)-th element of \(\Sigma\). Under this assumption the covariance matrix of the errors is given by \(\text{E}[\text{vec}(U)\text{vec}(U)^{\prime}]=\Sigma \otimes I_T\).

SUR applied to trial-based economic evaluations

In the context of trial-based economic evaluations, particularly in a cross-sectional analysis setting based on some individual level cost and effectiveness measures \((c_{i},e_{i})\), SUR models are typically preferred to the fitting of separate linear regressions as they recognise the correlation between individual costs and health outcomes in the parameter estimation. The set of individual level covariates \((\boldsymbol x^c_{i},\boldsymbol x^e_{i})\) can also differ for each endpoint but, often, the same \(\boldsymbol x_i\) are used for each endpoint, so that the model becomes

\[ \begin{align} c_{i}=& \boldsymbol \beta_c \boldsymbol x_{i} + u_{ci} \\ &\vdots \\ e_{i}=& \boldsymbol \beta_e \boldsymbol x_{i} + u_{ei}, \end{align} \] where \(\boldsymbol x_{i}\) denotes the set of common individual level covariates, typically including a treatment indicator and other baseline variables, whose regression coefficients are often meant to represent the incremental cost and effectiveness parameters estimated via OLS approach. Typically, SUR assumes that the individual error terms follow a bivariate normal distribution with mean zero and variances \(\sigma^2_c\) and \(\sigma^2_e\), and with the correlation between the outcomes being captured through the correlation parameter \(\rho\) (Gomes et al. 2012).

How to fit SUR models in R ?

In R, the systemfit package allows a user to specify multiple equations and fit them in an SUR framework. After doing so, one can perform tests on coefficients across the equations. As an example, let’s simulate some bivariate cost and effectiveness data.

#load packages
library(mvtnorm)
library(systemfit)

#generate bivariate outcome data
n <- 1000
Sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2, byrow = TRUE) # cov matrix
mu <- c(E = 0.5, C = 0.5) #mean E and C outcome
ec_data <- rmvnorm(n, mean = mu, sigma = Sigma)
ec_data_new <- ec_data
#rescale e to 0-1 range
ec_data_new[,1] <- (ec_data[,1]-min(ec_data[,1]))/(max(ec_data[,1])-min(ec_data[,1]))
#rescale c to 0-1000 range
ec_data_new[,2] <- (ec_data[,2]-min(ec_data[,2]))/(max(ec_data[,2])-min(ec_data[,2]))*1000

#add treatment covariate (under assumption of better outcome and less costs for t=1 vs t=0)
t_data <- ifelse(ec_data_new[,1]>0.6 & ec_data_new[,2]<500,1,0)
t_data <- ifelse(ec_data_new[,1]<0.6 & ec_data_new[,2]>500,0,t_data)
t_data <- ifelse(ec_data_new[,1]<0.6 & ec_data_new[,2]<500,rbinom(1,1,0.5),t_data)
t_data <- ifelse(ec_data_new[,1]>0.6 & ec_data_new[,2]>500,rbinom(1,1,0.5),t_data)

#obtain data frame
data.df <- data.frame(ec_data_new[,1],ec_data_new[,2],t_data)
names(data.df) <- c("e","c","t")

Next, let’s define the two system of equations using the R formula notation and pass them in a list to the systemfit command. A summary of the systemfit first shows a summary of the system, then the separate equations, and then how the residuals of the two equations are related. These are followed by the OLS fits of the separate equations.

#write formulae for the two outcome equations
eq_e <- e ~ t
eq_c <- c ~ t

#fit SUR model
fitsur_ec <- systemfit(list(readreg = eq_e, mathreg = eq_c), data=data.df)

#summarise results
summary(fitsur_ec)

systemfit results 
method: OLS 

          N   DF      SSR detRCov  OLS-R2 McElroy-R2
system 2000 1996 23334544 314.158 0.12458   0.130534

           N  DF         SSR         MSE       RMSE       R2    Adj R2
readreg 1000 998 3.20731e+01 3.21370e-02   0.179269 0.000947 -0.000054
mathreg 1000 998 2.33345e+07 2.33813e+04 152.909366 0.124580  0.123703

The covariance matrix of the residuals
           readreg    mathreg
readreg  0.0321374    20.9107
mathreg 20.9106661 23381.2744

The correlations of the residuals
         readreg  mathreg
readreg 1.000000 0.762831
mathreg 0.762831 1.000000


OLS estimates for 'readreg' (equation 1)
Model Formula: e ~ t

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.4844150  0.0114066 42.4679  < 2e-16 ***
t           -0.0127861  0.0131450 -0.9727  0.33094    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.179269 on 998 degrees of freedom
Number of observations: 1000 Degrees of Freedom: 998 
SSR: 32.07315 MSE: 0.032137 Root MSE: 0.179269 
Multiple R-Squared: 0.000947 Adjusted R-Squared: -5.4e-05 


OLS estimates for 'mathreg' (equation 2)
Model Formula: c ~ t

              Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)  580.70122    9.72939  59.6853 < 2.22e-16 ***
t           -133.61961   11.21213 -11.9174 < 2.22e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 152.909366 on 998 degrees of freedom
Number of observations: 1000 Degrees of Freedom: 998 
SSR: 23334511.81372 MSE: 23381.274362 Root MSE: 152.909366 
Multiple R-Squared: 0.12458 Adjusted R-Squared: 0.123703 

From the output above we can extract via the command coef(fitsur_ec) and vcov(fitsur_ec) the coefficient estimates and the covariance matrix for each equation in the model. In the summary(fitsur_ec) output we also see results related to standard testing procedures, such as t and p-values related to the estimated coefficients for each covariate included in each equation within the system. In this case, for both equations, we only have treatment as the covariate, whose coefficients represent the mean incremental between the two groups in terms of effectiveness and cost variables. We can also compute related \(95\%\) confidence intervals using the command confint(fitsur_ec).

For today it is all and I hope you liked this quick overview of this specific topic. I might decide to do these “quick” descriptions of specific methods that are often used in economic evaluations, as I find them to be quite easy to follow and to have a sort of preliminary introduction to the topic under consideration. People may then decide to go in more depth into these topics based on their interest/curiosity, for example by consulting the literature or asking some experts. Till next time!

References

Fiebig, Denzil G. 2001. “Seemingly Unrelated Regression.” A Companion to Theoretical Econometrics, 101–21.
Gomes, Manuel, Richard Grieve, Richard Nixon, Edmond S-W Ng, James Carpenter, and Simon G Thompson. 2012. “Methods for Covariate Adjustment in Cost-Effectiveness Analysis That Use Cluster Randomised Trials.” Health Economics 21 (9): 1101–18.
Greene, William H et al. 2008. “The Econometric Approach to Efficiency Analysis.” The Measurement of Productive Efficiency and Productivity Growth 1 (1): 92–250.
Zellner, Arnold. 1962. “An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias.” Journal of the American Statistical Association 57 (298): 348–68.