**David J Pasta and Dave P Miller at ICON Clinical Research take a look at the statistical issues that regularly occur in observational studies**

There are a number of statistical issues that commonly arise in observational studies that are different from those that arise in randomised controlled trials (RCTs). There are, of course, many issues in randomised controlled trials, but the act of randomisation itself provides statistical benefits not available in observational studies. In a study where group membership is assigned randomly, one can assume that the groups are roughly equivalent at the point of randomisation on every observed characteristic as well as on unobserved characteristics. After randomisation, differential dropout may cause the groups who complete the entire study to differ over time, but the randomised intent-totreat cohort has special unalterable properties. An unlucky assignment may result in unbalanced groups, but the laws of statistics put limits on how unbalanced the groups can be on average in repeated studies.

Many of the statistical issues in randomised studies relate to either of two issues: unlucky assignment; and transitions that arise as the composition of the groups changes over time. There are also issues in RCTs related to noncompliance, missed scheduled visits or other assessments, early termination, administration of the incorrect treatment and other deviations from the planned conduct of the trial. All of these statistical headaches threaten the validity of a simple group comparison that relies on randomisation to ensure all the groups are the same. Nevertheless, the randomisation does provide a basis from which to analyse and interpret; at the time of randomisation, the groups can be assumed to be approximately equivalent.

**GROUP DIFFERENCES IN OBSERVATIONAL STUDIES **

In observational studies, there is usually no randomisation (there are exceptions, but these are rare enough to be ignored for the purposes of this article). In the absence of randomisation, there is no reasonable basis for assuming that groups are equivalent – indeed, they generally are not. In the context of clinical research, where interest often focuses on treatment alternatives, patients receiving different treatments generally differ systematically from each other. One way to differentiate is by using the ‘indication bias,’ which loosely translates to ‘sick patients get treated’ (and those who are worse get treated more aggressively). Looking at the number of patients with fever who are taking non-steroidal anti-inflammatory drugs (NSAIDs), one might be tempted to conclude that the NSAIDs cause fever (or at least, that they do not help). Yet, there is ample evidence from randomised trials that NSAIDs help reduce fever, which is why we find so many people with fevers taking NSAIDs. Another example of this sort of bias, although unrelated to drug treatment, is the observation that the death rate from tuberculosis is very high in Switzerland. Is this because Switzerland is particularly bad for patients with tuberculosis? No, it’s because the cool dry mountain air of Switzerland is especially good for these patients and so they tend to go there in the later stages of the disease.

There are a number of statistical techniques that can be used to address group differences in observational studies in an effort to gain useful insights (1,2). The most common approaches are matching and modelling.

**MEASURING DIFFERENCES: MATCHING **

In this method, an attempt is made to match members of one group with members of another based on observed characteristics. Perhaps the most common form of matching is in case-control studies, in which patients with the disease (cases) are matched to control patients who do not have the disease. Case-control studies are often retrospective: cases are identified and then matched to controls and their history studied, often through the review of charts, electronic medical records or insurance claims. Special statistical techniques are needed with retrospective case-control studies (compared to prospective cohort studies), but they are a very useful tool.

Another benefit of matching is to compare treatment groups, rather than just those with and without the disease. These studies generally try to match patient groups at a point in time and then follow them over time as cohorts. The studies might be performed entirely prospectively, but more frequently they are used when data has already been collected. For example, to compare the effect of five different statin drugs on preventing myocardial infarction (MI), one might examine medical claims from previous years to assign patients to groups based on treatment and then evaluating the rate of MI in subsequent years.

Matching is an appealing approach, but is not as desirable as randomisation since it cannot control for unobserved differences. In addition, it is difficult to find good matches for ‘unusual’ patients; thus either match quality is compromised or patients are omitted as unmatchable. Developing an appropriate matching algorithm is challenging as exact matches on all measured characteristics is unrealistic. What are the most important characteristics to make sure are matched exactly or at least closely – and how close is close enough? A statistical technique that was developed to address this question is ‘propensity score matching’ (3).

The mathematics behind propensity scores is elegant and somewhat complex, but the idea is simple enough – the important variables to match up are those that are the most different between the groups. The idea is to use the observed characteristics to try to accurately predict group membership. In the simplest case of two groups, such as a comparison of treatment A and treatment B, the idea is to try to predict (using a logistic regression model) which patients are most likely to get treatment A. The predicted probability of receiving treatment A is calculated from the logistic regression model and is called the propensity score, as it represents the propensity of that patient to receive treatment A. Either treatment can be modelled, as the propensity score for treatment B is just one minus the propensity score for treatment A. It turns out that matching on the propensity score effectively matches the patients on all the characteristics that were included in the model. The multivariate matching problem has been reduced to matching on a single variable.

The calculation of propensity scores offers the additional benefit of providing insight into the group characteristics. The more accurately the propensity score model predicts which patients get what treatments, the more different the two groups are. If the model does little better than chance, then there are few systematic differences between the groups (on the measured characteristics). In some extreme cases, the prediction of group membership is so close to perfect that patients in the two treatment groups have very little overlap in propensity scores. This makes it possible to match only a small fraction of the original cases and accurately reflects the difficulty of reaching conclusions comparing the two treatments (as they are being administered to markedly different patient populations). There are additional technical considerations when using propensity score matching, including the need to evaluate the resulting matches for quality rather than accepting them blindly, but they are beyond the scope of this article.

**MODELLING **

An alternative to matching is to adjust statistically for group differences by including measured characteristics as covariates in a statistical model. Most models are linear in the covariates, so the adjustment takes account of the linear effect of the covariates; this is usually good enough, as the linear component generally encompasses most of the material effect of a variable. The traditional approach is to include any variables in the model that are considered material. Those covariates can be important to statistically adjust for differences between groups.

The principle of propensity scores can be used in statistical models in much the same way as they are used for matching (3). Propensity scores are calculated the same way as one would in preparation for matching, and the diagnostic information of how much the groups overlap is similarly useful. Instead of matching on the propensity score, the scores are usually divided into five or so categories and the resulting categorical variable added to the model. Adding the propensity score to the model sometimes substantially changes the magnitude, or even the sign, of the variable representing the group difference; although more typically the value changes only a little but the estimated variability goes up considerably. This accurately reflects the fact that the information about group differences arises mostly from the overlapping part of the patient populations, and that the less the groups overlap the smaller the effective number of patients. One way to think about this approach is to think of the categories of propensity scores as types of patients. With five categories of the propensity score variable, patients are being divided into five groups and comparisons are carried out within groups. The two extreme categories generally have few (or even no) patients from the ‘other’ group, so most comparisons are based on the middle three groups (and even then there may be substantial imbalance between the groups – which is statistically inefficient).

These two approaches – adjusting using covariates directly and adjusting using the propensity score – tend to give similar results, but there are differences. Table 1 summarises the advantages and disadvantages of propensity scores.

**CHOOSING TIME ZERO **

In clinical trials, ‘time zero’ is usually the randomisation date (or some nearby date) at which the groups can be assumed to be similar. Having a clearly defined ‘time zero’ is critical for time to event analyses and for any analyses that attempt to adjust for baseline characteristics. In some observational studies, this might be the date of diagnosis, which also has some external validity, but in many observational studies it may not be obvious for analytical purposes. In particular, using the date of enrolment as ‘time zero’ may not be appropriate as this date is not especially distinguished from the perspective of the course of the disease.

Consider patients in an observational study of cancer or another disease that puts patients at a high death risk. Usually the patients are a mix of newly-diagnosed (incident) and previously diagnosed (prevalent) cases. When performing statistical analyses, it is important to recognise that the patients who enrol, say, three years after diagnosis, are representative at best of the subset of patients who were diagnosed three years earlier and survived. All the patients who died within three years of diagnosis are of course not available to enrol. This does not even consider that patients who are more ill may be less likely to enrol in an observational study. If all eligible patients are consecutively enrolled, statistical methods are available to handle this left-truncation but it adds considerable complexity to the analysis (4,5).

If the date of diagnosis is unknown, or it is not of special relevance, some other ‘time zero’ or ‘index date’ might be selected. It might be: a calendar date of particular importance, such as the date of approval of a new therapy; the date a patient achieved a certain age; the date of a specific event such as a biopsy or surgical treatment; the start of a new drug therapy. It is sometimes difficult to find an appropriate corresponding ‘index date’ for a control group. One approach would be to use, for example, the first clinic visit in a given calendar year, but because of possible seasonal effects, it might be better to use the first clinic visit on or after the patient’s birthday in that year. This assures a reasonable distribution of index dates throughout the seasons.

**MISSING DATA IN OBSERVATIONAL STUDIES**

Missing data is an issue in any study, but it is especially prominent in observational studies (6). In clinical trials there are often mandated visits at various points during the study. Some of those visits may be missed, and that is important for the statistical analysis, but large numbers of patients generally have data for most mandated visits. In observational studies, there may be only one visit per year or maybe none at all. It is often unrealistic to assume no change in status between visits, and indeed the absence of visits is often an indication that the patient is much better or much worse than at the last visit.

One of the most common approaches to handling missing data is simply to use the available data to estimate values (for example, using mixed models to estimate patient-specific trend lines and then combining the information across patients). Another approach is to assume the best (or worst) of the omitted values, either deliberately biasing the data against the desired alternative, or evaluating both extremes to establish a range of possible results.

Another approach to handling missing values is to use multiple imputation (7). With multiple imputation, an analysis is performed after imputing missing values. The missing values are then imputed again and the analysis repeated. This is done multiple times and the results evaluated. The multiple imputations have the effect of ‘propagating the uncertainty’ associated with the missing values (if you impute the missing values just once, confidence intervals and measures of uncertainty tend to be overconfident because you are acting as though the formerlymissing data were actually known). The multiple imputation approach provides answers that capture the uncertainty introduced by the missing data.

**CONCLUSION**

Good statistical methods exist for most of the common problems that arise in observational studies. However, most of these methods mean stronger assumptions than those required in randomised trials need to be made. Ideally, these statistical issues are considered when developing the protocol and designing the case report forms in order to minimise the impact of the assumptions. Given all of these challenges, some may ask whether it would be better for all research studies to be randomised. One difficulty with that approach is that randomised trials do not do a very good job of mimicking actual care. In clinical practice, treatments are not assigned at random, and follow-up assessments are not nearly as rigorous as they are in clinical trials. Patient care, from the initial assignment of treatment throughout the course of the disease, is very unlike a clinical trial and observational research is necessary to provide balance to the clinical trial results. Additionally, not every comparison of interest is amenable to randomisation (8). Gender differences may be very helpful to explore in order to understand a disease, but we certainly cannot randomise patients to be male or female. Additionally, it is important to understand time-to-treatment in assigning medical resources, but we cannot ethically withhold or delay treatment for some patients. As observational studies become more and more common, the statistical challenges need to be more broadly appreciated.

**References **

- Cochran WG and Rubin DB, Controlling Bias in Observational Studies: A Review, Sankhya, Ser A 35: pp417-446, 1973
- Cochran WG, Planning and Analysis of Observational Studies, New York: John Wiley, 1983
- Rosenbaum PR and Rubin DB, The Central Role of the Propensity Score in Observational Studies for Causal Effect, Biometrika 70: pp41-55, 1983
- Tsai W, Jewell N and Wang M, A note on the productlimit estimator under right censoring and left truncation, Biometrika 74: pp883-886, 1987
- Keiding N, Independent Delayed Entry, in Klein JP and Goel PK (eds), Survival Analysis: State of the Art, pp309-326, 1992
- Little RJA and Rubin DB, Statistical Analysis with Missing Data, 2nd edition, New York: John Wiley, 2002
- Rubin DB and Schenker N, Multiple Imputation in Health- Care Data Bases: An Overview and Some Applications, Statistics in Medicine 10: pp585-598, 1991
- Smith GCS and Pell JP, Parachute Use to Prevent Death and Major Trauma Related to Gravitational Challenge: Systematic Review of Randomised Controlled Trials, BMJ 327: pp1,459-1,461, 2003