1 School of Population Health, The University of Queensland, Brisbane, Queensland, Australia.
2 Population Studies and Human Genetics Division, Queensland Institute of Medical Research, Brisbane, Queensland, Australia.
Received for publication July 9, 2004; accepted for publication August 31, 2004.
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
epidemiologic factors; longitudinal studies
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the absence of standard reporting guidelines, authors may refer to theoretical papers and texts describing observational longitudinal research designs (2224). Although some of these sources provide comprehensive coverage of aspects of observational longitudinal research on which internal and external validity of results depend, others are brief. A few authors have developed checklists with which to assess the quality of reporting of articles, including observational longitudinal research (10, 2528). These checklists differ in their coverage of elements relevant to the design of observational longitudinal research. The majority are brief or nonspecific and focus on quality judgments. The most comprehensive of these is the Transparent Reporting of Evaluations with Nonrandomized Designs (TREND) statement (10), which is very detailed and places a particular emphasis on interventions. It provides a detailed assessment of the quality of these designs and has suggestions for better reporting. However, none of the checklists offers a simple or straightforward set of guidelines for how observational longitudinal studies should be reported. Adequate reporting is the only means by which proper interpretation can occur (20). The success of CONSORT illustrates the benefits to be gained from improved communication between authors, editors, and readers about research design fundamentals.
The aim of this study was to identify desirable elements in the reporting of observational longitudinal research, construct a CONSORT-style checklist and flow diagram, and test the checklist against published observational longitudinal research. Like other authors (3, 27), we focused on the adequacy of reporting (i.e., whether or not an aspect was reported) and did not attempt to assess quality per se. The secondary focus was to explore the likely value to editors and authors of developing a checklist and flow diagram, covering desirable reporting elements, that would help readers evaluate observational longitudinal research.
![]() |
MATERIALS AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A draft outline of essential elements related to threats to the internal validity of observational longitudinal research was created. A working group of nine epidemiologists, biostatisticians, and social scientists, with a wide range of qualifications, experience, and clinical interests, contributed and revised checklist criteria. For each essential element identified (e.g., selection bias), the most important criteria (descriptors) to describe an observational longitudinal study were identified (e.g., sampling frame, consent rates, loss to follow-up, item nonresponse). Through this iterative revision process, other criteria fundamental to describing observational longitudinal research adequately (e.g., setting) and to considering generalizability were added to the checklist. Criteria were to be scored as reported (yes), not reported (no), or not applicable to report. To score "yes," each criterion must be reported in enough detail to allow the reader to judge that the definition had been met. If inadequate information about a criterion was reported, it was scored "no." If authors referred readers to another publication for specific details about the study methods (e.g., sampling or eligibility), the criterion was scored "no."
The draft checklist was piloted by the first two authors (L. T. and R. W.), who independently rated 10 articles describing observational longitudinal research (defined as studies in which any designated group of persons was followed or traced over a period of time) (37). Following the pilot study, the criteria were reviewed and were modified by the working group. Once the final checklist was agreed upon, it was tested on a random selection of articles describing observational longitudinal research. The clinical area of stroke was chosen as an example because it is the current field of interest of the first author. None of the other authors or members of the working group had substantive experience in stroke research. Six journals publishing epidemiology, clinical, and rehabilitation stroke research, with a range of impact factors (from 0.9 to 8.6), were chosen: American Journal of Epidemiology, Journal of Epidemiology and Community Health, Stroke, Annals of Neurology, Archives of Physical Medicine and Rehabilitation, and American Journal of Physical Medicine and Rehabilitation.
Ten articles reporting observational longitudinal research were randomly sampled from each journal. The sampling frame was every volume of the six journals published between June 1999 and June 2003 inclusive. Ten randomly generated volume/issue "pairs" (e.g., issue 3, 2002) were produced for each journal. Potentially eligible articles were identified from words such as "longitudinal," "follow-up," "outcomes," "prospective," or "observational" appearing in the title or abstract. Content eligibility was assessed by the presence of any of the words "stroke," or "cerebrovascular accident," or "CVA," or "acquired brain injury," or "infarct" coupled with a structure or hemisphere of the brain; or words illustrative of stroke symptoms, for example, "hemiplegia," "hemiparesis," or "neglect." Exclusion criteria were words indicating that the study was randomized; an intervention; a case series; a case-control, cross-sectional, or retrospective study; or a systematic review. Studies of animals were also excluded. When more than one eligible article was identified in a particular volume/issue pair for a selected journal, all were numbered and one selected randomly. When a volume/issue pair had no eligible articles, a new volume/issue pair was randomly generated for the same journal. The American Journal of Epidemiology and the Journal of Epidemiology and Community Health had only three and six eligible articles, respectively, within the sampling frame, so all were included. None of the authors or the members of the working group was an author of any of the sampled publications.
Of the 49 articles selected, six were published from June to December 1999, 11 during 2000, 10 during 2001, and 11 each during 2002 and from January to June 2003. The article list is available at the following website: http://www.sph.uq.edu.au/hisdu/bias_refs.html. Each article was independently rated with the checklist by the first two authors, who then compared ratings and resolved disagreements by consensus. When disagreements could not be resolved, a third independent rater made the final judgment. Besides the rating of each article with the checklist, it was noted whether the study was primarily etiologic (n = 20), prognostic (n = 25), or both (n = 4). The text word count of each article was also estimated. The working group also drafted a summary flow diagram to represent the essential elements of participant recruitment and follow-up in observational longitudinal studies.
Statistical analysis
Descriptive statistics were computed for each checklist criterion by type of study (etiologic or prognostic), journal, and word count. Agreement between the two raters on the 33 criteria was summarized by percentage agreement, presented here by the median and quartiles. For each article, the number of criteria reported was divided by the number of relevant criteria to give a score reflecting the proportion of relevant or applicable criteria reported. For example, if 12 criteria were reported when 33 were applicable, the proportion was 0.36; if 12 criteria were reported when 31 were applicable, the proportion was 0.39. The comparison between type of study (etiologic or prognostic) and proportion of criteria reported was analyzed by using an independent-samples t test. The association between estimated word count and the proportion of criteria reported was analyzed by using Spearmans correlation coefficient. Analyses were performed with SPSS software (version 11.5; SPSS Inc., Chicago, Illinois).
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Across the 49 articles, the mean proportion of applicable criteria reported was 0.51 (standard deviation (SD), 0.15; range, 0.120.82). The association between type of study (etiologic or prognostic) and proportion of criteria reported was not statistically significant (t(43 df) = 0.31, p = 0.76, two sided; studies with both an etiologic and prognostic focus, n = 4, were not included). When analyzed by journal type, the mean proportions of applicable criteria reported by the journals were 0.66 (SD, 0.03) for the American Journal of Epidemiology (impact factor = 4.2), 0.57 (SD, 0.11) for the Journal of Epidemiology and Community Health (impact factor = 2.1), 0.54 (SD, 0.13) for Archives of Physical Medicine and Rehabilitation (impact factor = 1.3), 0.49 (SD, 0.13) for Stroke (impact factor = 5.1), 0.46 (SD, 0.13) for the American Journal of Physical Medicine and Rehabilitation (impact factor = 0.9), and 0.46 (SD, 0.19) for Annals of Neurology (impact factor = 8.6). We found no relation between word count and proportion of checklist criteria reported (Spearmans correlation coefficient = 0.12, p = 0.41, two sided).
Table 2 shows the total number of articles that reported each of the 33 criteria overall and by type of study. The table also shows the total number (and percentage) of articles where it was applicable to report each of the criteria. Eleven articles had one or more criteria that were not applicable to report. Table 2 shows that "reasons for loss to follow-up at each stage," accounting for "loss to follow-up in the analysis," and "missing data in the analysis" were the criteria to which "not applicable" most often applied.
|
The best reporting was for criteria describing the study rationale and population as well as how data were collected and analyzed (each criterion reported in 45 or more articles). Qualitative and quantitative assessments of bias (3035 articles) and confounders (38 articles) were also generally well reported. The most poorly reported criteria (reported in fewer than 10 articles each) were justification for the numbers in the study (e.g., in terms of power to detect effects), reasons for not meeting eligibility criteria, numbers consenting/not consenting, reasons for nonconsent, comparison of consenters with nonconsenters, and accounting for missing data items or loss to follow-up in analyses. Also notable was the general lack of reporting of measures of absolute effects, even though it is regularly described in epidemiology textbooks as a particular strength of observational longitudinal studies.
Development of the flow diagram
As a result of developing the checklist and rating the articles, we produced a flow diagram, modeled on CONSORT (5), to help clarify the numerical history of an observational longitudinal study (figure 1). It records the numbers, and reasons for, eligibility, consent, participation in each wave, and attrition. These main elements were chosen because they provide information at a glance on probable selection-driven threats to internal and external validity.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We have shown variable reporting of some of the major threats to the internal and external validity of observational longitudinal studies. In the articles sampled, on average about half of the 33 checklist criteria were reported, with no differences found between study type or by word count. The criteria in the checklist representing selection bias were the least frequently reported overall, although issues of measurement quality were also neglected, with fewer than half of the articles discussing either reliability or validity. These findings are concerning because if observational longitudinal studies are to be accepted as valuable sources of evidence, complete reporting is required.
Aspects of recruitment, particularly the proportion of sampled subjects meeting the eligibility criteria and then consenting to participate, were poorly reported. In addition, the reasons that people did not consent, and comparisons of consenters with nonconsenters in terms of baseline demographic or clinical features, were also typically not reported. These aspects of selection bias are potentially important; if consenters differ from nonconsenters, the study findings may be affected. Dunn et al. (38) recently showed nonconsent in five large epidemiologic studies to be about 30 percent and illustrated how nonconsenters and nonresponders can account for 3060 percent of the original sample. They recommend that researchers plan a priori their sample sizes to account for potential losses and consider the biases likely to be associated with nonconsent and dropout.
Although the numbers of participants at each stage of a study were recorded in half the articles, accounting for loss to follow-up and missing data items in the analyses were rarely reported. Data missing not at random can be a source of bias affecting internal validity and can also influence estimates of absolute prevalence or incidence (39, 40). In this study, we assessed how missing data were handled by whether the articles described imputation, weighting, or sensitivity analyses. It is acknowledged that while authors may not statistically account for missing data in this way, they may postulate on the likely impact of missing data on results. When authors did so, it was captured under criterion 31 of the checklist: "Was the impact of biases estimated quantitatively?" Approximately 60 percent of articles acknowledged the possible quantitative impact of various biases, illustrating a general awareness by authors, or determination by editors, of the necessity for doing so. Methods for dealing with missing data in observational longitudinal research range from simple analysis of between-group differences to complex imputation techniques (41). Although debate exists about the benefits of using such imputation methods, it is at least desirable to determine the pattern of missingness, how ignorable or informative the missing data are, and the potential impact that imputation or other approaches may have on the final estimates (40).
None of the 49 articles included any justification for the sample size. An issue for many longitudinal observational studies is lack of statistical power or precision to determine real differences until sufficient follow-up time has passed to accumulate enough outcomes (42). Although the appropriateness of calculating statistical power for these research designs has been questioned (41), a priori consideration of the precision of a longitudinal study to accurately quantify the difference between effects of exposures on an outcome is desirable (35, 38).
Absolute effect sizes, defined in this study as the difference in rates of disease between groups defined by an exposure, for example, attributable risk, were also infrequently reported. Inclusion of this criterion was strongly debated by the working group because it is not relevant for all observational longitudinal studies. However, absolute effect estimates are a useful measure of association in epidemiologic research (39) and are an underutilized strength of observational longitudinal studies. In the checklist, absolute effects can be seen primarily as a descriptive criterion rather than as an element representing threats to internal validity.
About 40 percent of the articles reported the reliability and validity of instruments used. In a study of reporting of psychometric qualities of measures in 171 articles describing rehabilitation studies, Dijkers et al. (43) also found poor reporting, with reliability and validity mentioned in only 20 percent and 7 percent of articles, respectively. Having reliable and valid instruments is one of the best ways of reducing measurement bias in epidemiologic research. Requiring authors to report these psychometric properties may improve the quality of the instruments used, and the confidence with which conclusions can be drawn from the results. Obviously, this requirement is unrealistic for every measure in a long list of variables, but it is desirable to have some assessment of measurement quality for the core variables, including confounders, in a particular analysis.
Only four criteria were universally reported in the articles: the study objectives, the study population, the number of participants at the beginning, and the method of data collection. Criteria about confounding, and actions to account for confounding in the analysis, were also generally well reported (in more than 60 percent of the articles). This issue is important because confounding is one of the major limitations of nonrandomized designs such as observational longitudinal studies, and adjustment in the analysis is essential for identifying true effects.
Despite the variable reporting of actions taken to reduce bias, chance, and confounding, three quarters of the articles discussed generalizability of the results to the target population. In some cases, authors acknowledged caveats to generalizability because of limitations such as selection bias. However, it is important to recognize that generalizability should be considered only once assumptions of internal validity are satisfied.
We have shown a need for improved reporting of observational longitudinal research, through application of a reasonable set of criteria and a flow diagram. Even though the clinical example used in this study was stroke, the checklist and flow diagram are independent of topic and so are directly applicable to other fields. If authors are required to report criteria such as those listed in the present study, they may think more carefully about design and analysis issues from the beginning of the study, thus raising the overall quality of research (23, 34). Epidemiologists and biostatisticians may be more prone to report these features because of their training (44), which may partially explain why the articles in epidemiology journals in this study reported the most checklist criteria. Journal policy toward reporting observational longitudinal research can clearly contribute. A review of authors guidelines for the six journals used in this study showed a rather low level of required detail specific to nonrandomized designs. The reporting of methodological detail about aspects that threaten internal validity are the domains of editors (and journal policy) and authors. Higher journal quality indicators, such as impact factors, have been linked to better overall reporting in randomized and nonrandomized studies (45); however, we failed to show a clear trend in this study.
We developed a flow diagram that summarizes sample selection, participant recruitment, eligibility criteria, consent and reasons for nonconsent, timing of follow-ups, and attrition at each stage. The choice of criteria to include was based on the desire to capture the key aspects that allow editors and readers to rapidly judge threats to the internal and external validity of the study, balanced with the need to keep the diagram relatively simple. Detail about the analysis was not included to avoid complicating the diagram. As expressed by Rennie, commenting on the benefits of CONSORT, "[when using a] ... checklist and flow diagram, it takes a fraction of the time to get the essential information necessary to assess the quality of a trial" (46, p. 2006).
We recommend that editors move to require authors to use a structured approach to presenting the architecture of observational longitudinal research to communicate essential details about the study design. Doing so may force researchers to organize their thinking during an early stage of their research. The combination of a checklist such as ours, a flow diagram, and, ideally, a structured abstract (47) offers a starting point for consideration.
![]() |
ACKNOWLEDGMENTS |
---|
Assistance is acknowledged from Drs. A. Barnett, Z. Clavarino, J. Najman, A. Lopez, P. Schluter, G. Williams, J. Van Der Pols, A. Mamun, and R. Alati from the Longitudinal Studies Unit at the School of Population Health, University of Queensland (URL: http://hisdu.sph.uq.edu.au/lsu/); and from Dr. A. Green from the Queensland Institute of Medical Research.
![]() |
NOTES |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|