Key elements of study design

Quarto
R
Academia
Medical Statistics
Exercises
Published

December 4, 2025

Welcome back dear readers to a new blog post. Today, I would like to resume the post series I started a few months ago about medical statistics, by introducing (in a very simple and brief way) the topic of study design. This is not exactly something that requires any advanced statistical skills or expertise, but I think it is also important for a medical statistician to know at least some key elements and notions related to study design. This is particularly relevant in the context of knowing which types of statistical methods are best suited for the analysis of study data, depending on the different types of biases and issues that may arise according to the way data were collected and measures, which mainly depends on the study design at hand. Well, for those who find this topic interesting or curious, let’s have a quick introduction to the classical designs and their key differences.

Study Design

Study design is one of the key components of a typical medical research process, which can be generally summarised into a series of sequential steps:

  1. Research problem identification

  2. Literature appraisal

  3. Study design

  4. Implementation: patient recruitment & data collection

  5. Data analysis and results interpretation

  6. Writing and publishing of results

It is important to highlight that the choice of the study design is a fundamental component of the research process and that it should match the specific research question formulated. In general, two main classical types of studies are distinguished in the literature: Observational studies and Randomised Controlled Trials/Intervention studies.

  1. Randomised Control Trials (RCTs) are typically characterised by the existence of a planned intervention, often being compared to a Control group which is free from the intervention. Subjects are randomly allocated to either the intervention or control group and followed over time. For instance, consider a trial to evaluate the effect of intensive insulin on retinopathy:
- The *control group* receives standard insulin therapy

- Subjects are randomised to either a standard or intensive insulin group (*intervention group*)

- Everyone is followed over time and then evaluation of retinopathy is done

The RCT design if often suitable when testing new drugs, clinical management approaches or determining prognosis.

  1. Observational studies are characterised by the lack of a planned intervention, and are often used to: study disease prevalence or trends over time; investigate the relationship between exposure to risk factors and risk of disease or other health outcomes. Typical examples include health surveys or studies investigating the association between risk factors and disease incidence

Some important methodological aspects of study design include:

  1. Specification of a research objective (e.g. description of disease occurrence)

  2. Choice of an appropriate design based on the research objective

  3. Specification of the target population (e.g. through inclusion/exclusion criteria)

  4. Outcome specification with a distinction between:

- *Primary* - based on the principal objective of the study (e.g. disease or not)
- *Secondary* - any other outcome that may or may not be related to the primary
  1. If aimed at making some comparisons, a control group is needed, where patients are not exposed to a risk factor or are disease free

  2. Plan the sample size to provide adequate power and precision:

- Power is the probability of detecting a real effect if there is one, where larger studies typically providing greater power. However, it is important to remember that: if tests show no statistically significant effect, this may be due to either a lack of an effect or insufficient power to detect it. Without a power-based sample size calculation, it is difficult to avoid the second possibility. 

- Precision often corresponds to the magnitude of the standard error or width of the confidence interval associated with the desired estimates from a study, where greater precision leads to smaller standard errors and narrower confidence intervals.
  1. Need to account for potential confounding to avoid biased results, i.e. the presence of risk factors not directly of interest, which are often assumed to be associated with both outcome and main risk factor independently. For instance, co-morbidity may be a confounding variable for the effect of the variable type of surgery (main risk factor) on post-operative morbidity (outcome). It is important that confounding variables are controlled for in the design or analysis stage:
- At the design stage: often done through matching in observational studies and through randomisation in RCTs.

- At the analysis stage: often done through regression or stratification, provided that at the design stage information on all relevant confounders is collected.
  1. It is also important to avoid other types of biases:
- *Selection bias*, i.e. the bias in the selection of subjects (e.g. due to missing observations). Selection bias may be avoided/reduced through a careful selection of an appropriate sample, attempt to ensure completeness of data, planned approach to deal with missing values.

- *Response bias*, i.e. the bias due to patients knowing about treatment/study objective leading to a distortion of subjective measures (e.g. pain relief). Possible solutions to this bias include the imposition of blinding patients to the objective.

- *Observer bias*, i.e. the bias similar to response bias but related to observers rather than patients. Possible solutions to this bias include the imposition of blinding observers to the objective.

- *Recall bias*, i.e. the bias of patients recalling past events depending on their outcome condition (e.g. subjects with disease more likely to remember). Possible solutions to this bias include the avoidance of leading questions, careful design of data abstraction forms,or  training of interviewers/data abstractors.
  1. A key element is also represented by the approaches implemented to handle the collected data with respect to the topics of data responsability, double data entry, validity checks of database, data confidentiality.

  2. Finally, it is important to produce a statistical analysis plan which outlines the statistical analysis to avoid data dredging and attempts to make the analysis as objective as possible.

Types of observational studies

Among the types of observational studies, two main groups are distinguished:

  1. Descriptive - aim to describe the frequency or patterns of a disease, help to generate hypotheses and plan health programmes. Examples include: case-report/series and cross-sectional studies.

  2. Analytic - aim to test hypotheses about the relationship between exposure to risk factors and health outcome, and to estimate measures of this association. Examples include: cohort and case-control studies.

These types of studies are often used when it is unethical or impractical to perform an intervention study.

Case report/series

A case report corresponds to the profile of a single patient reported in detail by clinicians, while a case series is its extension to include a number of patients with a given condition. Key limitations of this design include: lack of a control group, impossibility to investigate associations and generalisability issues.

Cross-sectional study

In a cross-sectional study, all information is collected at the same time point, with also the inclusion of a control group. Exposure information is ascertained simultaneously at the time of the event with no typical formats as long as its aim is to obtain information from samples regarding prevalence, distribution, and associations. Key limitations of this design include: prone to bias and can only establish association at most.

Cohort study

A cohort study often starts with identifying a suitable group of subjects, fixing an appropriate follow-up period, following the subjects over time, attempting to ensure completeness at follow-up, and finally compares event rates between the group of subjects exposed to the risk factor vs the group of subjects not exposed. The way data are collected can be either retrospective or prospective. Key strengths of this design include: suitability to study rare exposure, recruit people with uncommon exposures, study time varying exposure, multiple outcomes, temporal relationships, and the possibility to estimate incidence rates. Among its limitations, there is its unsuitability to study rare outcomes or events that require a long time to occur, and the substantial impact of attrition/loss to follow-up.

Case-control study

A case-control study is a retrospective study which examines how retrospective factors contribute to current health conditions, e.g. history of smoking and lung cancer. Recruitment involves a group of subjects with a particular health condition (case) and a group of subjects without this health condition (control), with the objective to compare their exposure to the risk factor of interest.

Within a case-control study, confounding factors may be controlled by matching or multiple regression analysis. In particular, matching may be performed at an individual level (e.g. one-to-one) or at a group level (e.g. controls selected to ensure they have similar covariate distributions to cases). The main rational for matching is to ensure that comparison groups are similar with regards to various factors, thus improving efficiency/precision by controlling for factors that are difficult to measure or not of interest.

Matching on a factor unrelated to disease or exposure results in substantial power loss, while matching on a variable only associated with exposure but not with the disease will reduce the power of the study. Matching is worthwhile on the basis of variables which are strongly confounding such as risk of factors of interest or factors strongly related to them. It is possible to match to more than one control per case to ensure adequate sample sizes when case occurrence is rare, although efficiency tends to decrease if the number of matched controls is greater than \(4\), i.e. analyses not so much different from those based on a one-to-one matching.

Key strengths of a case-control design include it suitability to study rare diseases, low costs, fast implementation, while its limitations include its proneness to bias with respect to cohort studies, and the need of a careful selection of the controls. In particular, exposure to risk factors and confounders should be representative of those in the population at risk of becoming cases; exposures of controls should be measurable with similar accuracy to those of cases; controls suffering from a disease with similar exposure should be avoided. Typical sources of bias and issues in this design include: recall and observer bias; impossibility to estimate incidence rates and risks; unsuitability for multiple outcomes.

Case-Control study - example

Let’s consider an hypothetical case-control study where subjects have a given disease or condition (cases), where in a community of \(200,000\) people \(50\%\) are exposed to an exposure factor F and the rest is unexposed. In \(1\) year, \(100\) cases of disease occur in the exposed group and \(50\) cases in the unexposed group. The data can be summarised in Table 1.

Table 1
              Exposed - Yes Exposed - No  Total
Disease - Yes           100           50    150
Disease - No          99900        99950 199850
Total                100000       100000 200000

If we compute the risks for each exposure group we obtain: \(\text{Risk}_{\text{exp}}=\frac{100}{10^{5}} = 0.001\), \(\text{Risk}_{\text{unxp}}=\frac{50}{10^{5}} = 5\times 10^{-4}\), while the overall risk across exposure groups is \(\text{Risk}_{\text{tot}}=\frac{150}{2\times 10^{5}} = 7.5\times 10^{-4}\). This means that the odds ratio (OR) and risk ratio (RR) of exposed vs unexposed are: \(\text{OR}=\frac{0.001001}{5.0025013\times 10^{-4}} = 2\) and \(\text{RR}=\frac{0.001}{5\times 10^{-4}} = 2\), respectively.

If we sample all cases and controls, then \(\text{OR}=2\) but, in reality, this does not happen since a sample of cases and controls will be selected. We could sample \(60\%\) of the cases (\(90/150\) subjects with the disease) and we then from them sample to ensure that prevalence of exposure is representative of the community cases: use a \(2:1\) ratio to have \(60\) cases in the exposed group and \(30\) in the unexposed group. Next, we can sample \(1\) control per case (\(90\) controls), and sample from them to ensure that prevalence of exposure is representative of the community of controls: use a \(1:1\) ratio to have \(45\) controls in each exposure group. The new data are summarised in Table 2.

Table 2
              Exposed - Yes Exposed - No Total
Disease - Yes            60           30    90
Disease - No             45           45    90
Total                   105           75   180

Using the new data, we can now compute the same quantities: \(\text{OR}=\frac{1.3333333}{0.6666667} = 2\) and \(\text{RR}=\frac{0.5714286}{0.4} = 1.43\), respectively. We can see that the risk ratio is now incorrect since the prevalence in of disease in each exposure group was artificially inflated by the nature of the case-control design. Thus, estimate of risks from such design is distorted because of unknown fraction of controls being sampled: risk of disease within each exposure group is distorted but OR still works. The only scenario in which a case-control study can adequately estimate a RR is when the disease is rare, so that the population prevalence is maintained.

Reliability and Causality

The type of study designs can also be ranked according to their reliability, where higher reliability is associated with an higher chance of inferring causal effects, with RCTs being the “gold standard” and followed in order by cohort study, case-control study, cross-sectional study, and case report/series. It is important to stress that analytic observational studies require a rigid set of criteria to be satisfied to allow causal inference.

Often, studies are carried out because we are interested in identifying causal associations, e.g. to use an intervention to prevent diseases. It is important to remember that association/correlation observed in a study simply describes a situation where phenomena occur together more often than is expected, but this does not necessarily mean that there is causal link between the variables, i.e. changes in one variable directly cause changes in the other. There are some general criteria that can be used to establish causation in analytic observational studies:

  • Time sequence, i.e. time precedes disease.
  • Strength of association: cause strongly associated with disease.
  • Dose-response: greater exposure to cause leads to higher risk of disease.
  • Consistency: cause associated with disease in different population & studies.
  • Biological plausibility.
  • Coherence: parallels can be drawn with examples from other established cause-effect relationships.
  • Specificity: specific cause lead to a specific effect.
  • Reversibility: stopping cause reduces risk of disease.

Conclusions

To conclude:

  • Cohort/case-control studies often focus on describing the natural history or etiology of the disease.
  • Case report/series and cross-sectional studies focus on the description of disease frequency, natural history of the disease, and generate hypotheses.
  • Observational studies and RCTs may contribute complementary evidence, i.e. identification of large effects or infrequent outcomes.

So, what do you think of today’s post? I hope you found it interesting although it was not really statistically-dense, which someone may found as a pro (I don’s but, hey, sometimes I also need to vary my way of writing). I provide below some additional references with examples of some of the different types of study designs mentioned before. See you next year!

References

Fiebig, Denzil G. 2001. “Seemingly Unrelated Regression.” A Companion to Theoretical Econometrics, 101–21.
Gomes, Manuel, Richard Grieve, Richard Nixon, Edmond S-W Ng, James Carpenter, and Simon G Thompson. 2012. “Methods for Covariate Adjustment in Cost-Effectiveness Analysis That Use Cluster Randomised Trials.” Health Economics 21 (9): 1101–18.
Greene, William H et al. 2008. “The Econometric Approach to Efficiency Analysis.” The Measurement of Productive Efficiency and Productivity Growth 1 (1): 92–250.
Zellner, Arnold. 1962. “An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias.” Journal of the American Statistical Association 57 (298): 348–68.