Assessing research outcomes by postal questionnaire with telephone follow-up

Cj Parkera, Me Deweyb and on Behalf Of The Total Study Group,c

a Division of Rehabilitation and Ageing, University of Nottingham, Nottingham, UK.
b Trent Institute for Health Services Research, University of Nottingham, Nottingham, UK.
c The TOTAL Study Group. Nottingham: John Gladman, Avril Drummond, Michael Dewey, Nadina Lincoln, Chris Parker, Philippa Logan, Kate Radford; Aintree: Department of Medicine for the Elderly Research Team: Anil Sharma, Caroline Watkins, Michael Leathley, Hazel Dickinson, Elaine Mackie, Jan Rhodes, Liz Lightbody; OT service: Julie Murray, Geralyn Lennon, Carol Mullarkey, Hazel McCormick, Val Chisnall, Sean McCann; Bristol: Meg Birch, Helen Smith, Dave Gamm, Mary Vincent, Sally Chorlton, Dee Sessions; Edinburgh: Martin Dennis, Alison Chalmers, Lesley Moffat; Glasgow: Peter Langhorne, Joyce Peters, Karen Blackwood, Louise Gilbertson; Newcastle: David Barer, Susan Fall.

Reprint requests: Dr ME Dewey, Trent Institute for Health Services Research, Floor B Medical School, Queens Medical Centre, Nottingham NG7 2UH, UK. E-mail: michael.dewey{at}nottingham.ac.uk


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Background Face-to-face assessment of research outcomes is expensive and may introduce bias. Postal questionnaires offer a cheaper alternative which avoids observer bias, but non-response and incomplete response reduce the effective sample size and may be equally serious sources of bias. This study examines the extent and potential effects of missing data in the postal collection of outcomes for a large rehabilitation trial.

Methods Questionnaires containing a number of established scales were posted to participants in a trial of occupational therapy after stroke. Response was maximized by telephone and postal reminders, and incomplete questionnaires were followed up by telephone. Scale scores obtained by imputing values to questionnaire items missing on return were compared with those achieved by telephone follow-up.

Findings Response to the initial posting was 60%, rising to 85% after reminders. Participants receiving the experimental treatment were more likely to respond without a reminder. There were no significant differences on any known factors between eventual responders and non-responders. Of the questionnaires, 43% were incomplete on return: partial responders were significantly different to complete responders on baseline disability and home circumstances. Of the incomplete questionnaires, 71% were resolved by telephone follow-up. In these, the scale scores achieved by telephone were generally higher than those derived by conventional imputation.

Conclusion Postal outcome assessment achieved a good response rate, but considerable effort was needed to minimize non-response and incomplete response, both of which could have been serious sources of bias.

Keywords Research outcomes, bias, postal questionnaires, stroke therapy

Accepted 8 May 2000


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
In both population-based studies and trials of health care procedures, outcome measurements are often obtained from people living in their own homes. Face-to-face interviews are frequently used to collect the data, but it is known that this can introduce bias. For example, interviewer characteristics such as age, sex and race, and interviewer knowledge of the purpose of the study, have been shown to influence the results of surveys.1,2 Trial outcomes may be assessed by an independent clinician who is masked to treatment group, but opportunities for accidental unmasking are difficult to eliminate3 and may result in serious bias.4 In studies conducted over several geographical areas the use of separate interviewers or assessors in each area presents another potential source of bias.

There are also practical difficulties in conducting face-to-face interviews or assessments in the community. The costs can be substantial, particularly if the study population is geographically dispersed. Study participants who move out of the area may be lost to follow-up. Standardizing interview or assessment procedures across a large study requires extensive training at the outset and periodically thereafter, particularly in a lengthy study when there may be substantial staff turnover.

Measurement of outcomes by postal questionnaire offers both methodological and practical advantages over home visits. The potential bias introduced by interviewer or assessor is avoided, and the measurement procedure can be standardized across a large study. Respondents may answer sensitive questions more honestly in a self-completed questionnaire.5 Costs are considerably reduced,6 and provided that participants who move can be traced there is no additional difficulty or cost in obtaining their data. However, some outcomes are clearly unsuitable for postal assessment, and in others the data obtained by this method may be of poor quality. Response rates to postal questionnaires can be low, introducing bias if responders differ from non-responders.7 Returned questionnaires may contain missing, unclear or invalid item responses, making the calculation of scale scores problematic and again potentially introducing bias. Both non-response and incomplete response reduce the effective sample size, making it more likely that a true effect will not be detected.

The aim of this report is to describe the practical and methodological implications of a postal assessment process enhanced by telephone contact. The results presented here derive from a rehabilitation trial with a large multi-centre sample and a number of well-known outcome scales, and are relevant to planning future studies using postal assessment of outcome, both in rehabilitation research and more widely.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
TOTAL (Trial of Occupational Therapy and Leisure) was a randomized trial comparing two forms of occupational therapy for stroke patients with a no-treatment control group: the methodology has been described in full elsewhere.8 Patients with a recent stroke were recruited at five centres: Aintree (Fazakerley Hospital), Bristol (Southmead Hospital), Edinburgh (Western General Hospital), Glasgow (Royal Infirmary) and Nottingham (University Hospital). On discharge, baseline information was collected on eligible and consenting patients and comprised date of birth, sex, marital status, date and side of stroke, date of discharge, main co-resident, pre-stroke modified Rankin score9 and current Barthel score.10 Patients were randomly allocated to three groups: one receiving occupational therapy based on leisure activities, one receiving conventional occupational therapy based on activities of daily living (ADL), and one receiving no occupational therapy within the trial (control).

Outcome assessments were conducted by post from Nottingham 6 months and 12 months after random allocation: this paper presents data from the 6-month assessment. The follow-up procedure was conducted by the trial secretary who was masked to group allocation. Each centre provided details of patient deaths and changes of address. The patient follow-up questionnaire contained a number of scales with established postal use: (1) the Barthel ADL Index,10 a 10-item scale of self-care ADL ability; (2) The Nottingham Extended ADL scale (NEADL),11 a 22-item scale of instrumental ADL such as outdoor mobility, household tasks and leisure pursuits; (3) the Nottingham Leisure Questionnaire (NLQ),12,13 a 30-item scale of leisure activity; (4) the General Health Questionnaire, 12-item version (GHQ12),14 a widely used measure of emotional health; and (5) the London Handicap Scale (LHS),15 a 6-section scale of handicap.

The questionnaire was designed for ease of completion, with large font size and multiple choice tick boxes for each question. Any carer helping the patient to complete the questionnaire was asked to record their relationship to the patient. A carer questionnaire was also included, with items on the level of care provided and the GHQ12. A covering letter reminded patients of the purpose of the trial and emphasized the value of each participant's contribution.

If no response was received within 2–3 weeks an attempt was made to contact the patient by telephone, and a further questionnaire was posted if the original had been mislaid or if telephone contact was not achieved after repeated attempts. Non-responders were traced through computerized hospital systems and GP surgeries with the help of staff at each centre. Each patient received a maximum of one telephone reminder and one postal reminder. Any incomplete or ambiguous information in returned questionnaires was clarified by telephone if contact could be made. Items resolved by telephone were coded for data entry such that the nature of the problem (response missing or ambiguous) and the response achieved by telephone contact were captured.

Statistical methods
The relationship between response and known characteristics of participants was examined using logistic regression. The characteristics entered into the regression were those collected at baseline, such as age, sex and level of disability, as well as treatment group and whether a carer helped the patient to complete the questionnaire. Baseline age and Barthel score were treated as categorical variables to avoid constraining them to having a linear effect on the outcomes. The odds ratios (OR) presented are those derived from the logistic models and are adjusted for all other significant predictors of that outcome.

The conventional imputation strategy appropriate to the scale was used to compute scale scores when one or more item responses were missing. Thus for the Barthel, NEADL (scored 0,1,2,3) and NLQ (scored 0,1,2), missing was treated as ‘never’ or ‘not at all’ and counted zero towards the total score. For the GHQ12 (scored 0,1,2,3), missing was treated as ‘same as usual’ and counted one towards the total. The scoring system for the LHS does not allow a total to be computed if any item is missing. Multiple responses to single-choice items were treated in the same way as those missing.


    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Response to initial posting
During the period July 1996 to June 1998, 466 patients were recruited and randomized; of these 26 (6%) died before the 6-month follow-up. Questionnaires were posted to the remaining 440: 264 (60%) returned the questionnaire without reminder. The only participant characteristics predicting response to the initial posting were treatment group and centre: patients in the leisure treatment group were more likely, and patients at the Aintree centre less likely to respond. The proportions responding, together with adjusted OR and 95% CI, are shown in Table 1Go.


View this table:
[in this window]
[in a new window]
 
Table 1 Predictors of response to initial posting
 
Response to reminders
After a total of 106 telephone and 111 postal reminders the eventual response was raised to 374 (85%). The median time interval between despatch and return for all questionnaires was 12 days, quartiles (7,24). Contact was made with 47 (71%) of the non-responders: of these 29 refused overtly or implicitly and 10 were too ill. The remainder were confused (3), had a language difficulty (3) or were unable to complete the questionnaire and had no help (2).

The only significant predictor of eventual return of the questionnaire was centre (Table 2Go). There was a non-significant trend towards lower response among patients with a lower baseline Barthel score (P = 0.14).


View this table:
[in this window]
[in a new window]
 
Table 2 Predictors of eventual response
 
Incomplete questionnaires
The number of patient questionnaires returned was 373 (one responder returned only the carer questionnaire). Of these, 160 (43%) were incomplete, with missing responses (none of the multiple choice boxes ticked for that item) and/or multiple responses (more than one box ticked for an item requiring a single response). The median number of problem items per incomplete questionnaire was 3, quartiles (1,7). There were more questionnaires with missing responses (142, 38%) than with multiple responses (71, 19%). Not all the measurement scales were equally affected: Table 3Go shows the proportion of questionnaires with missing or multiple responses on each scale. The NLQ and NEADL were the most likely to have missing responses (25% and 19% of questionnaires, respectively) and the LHS to have multiple responses (12%). There was a median of two problem items for each scale except the GHQ12 where it was four.


View this table:
[in this window]
[in a new window]
 
Table 3 Missing or multiple item responses on return of questionnaire
 
Significant predictors of a questionnaire being returned incomplete were whether a carer had helped with completion, whether the patient lived alone, and baseline Barthel score. Patients who had no help with completion and those living alone were more likely to return incomplete questionnaires. There was a non-linear effect of Barthel score at baseline, with both severely disabled and non-disabled patients being more likely than moderately disabled patients to return incomplete questionnaires (Table 4Go).


View this table:
[in this window]
[in a new window]
 
Table 4 Predictors of returned questionnaire being incomplete
 
Telephone follow-up of incomplete questionnaires
Of the 160 participants with incomplete questionnaires, 113 (71%) were contacted by telephone and the missing and multiple responses resolved. Successful resolution of an incomplete questionnaire was not related to any of the known variables. The proportion of questionnaires with unresolved missing or multiple responses on each scale is shown in Table 5Go. Overall the proportion with any problem items fell from 43% before telephone follow-up to 13% afterwards. On the two scales most affected by missing responses, the NLQ and NEADL, the proportions fell from 25% and 19%, respectively, to 6% on each. The other main problem area, multiple responses on the LHS, was reduced from 12% to 3%.


View this table:
[in this window]
[in a new window]
 
Table 5 Missing or multiple item responses after telephone follow-up
 
Where telephone follow-up was achieved, the scale scores thus obtained were compared with those computed from the incomplete questionnaire using conventional imputation. On the Barthel scale, 19 incomplete scores were resolved by telephone, 70% of those with incomplete response on this scale. In those resolved, telephone follow-up resulted in a change to the imputed score in 15 (79%). The proportion of the entire followed-up sample affected by score change was 4%. The size of the change in score ranged from +1 to +20 points (median +5): this includes five scores completed entirely by telephone. These results are summarized in Table 6Go together with similar information for the other scales used. They show an overall tendency for ADL and leisure scores to be increased by telephone follow-up, with a larger proportion of the sample being affected by this for the NEADL and NLQ. On the GHQ12 the score could increase (worsen) or decrease (improve) because missing was counted as a score of 1: on average there was a small increase. Because of the method of scoring the LHS, all resolved questionnaires had a change in score and the size of the change was typically large. Taking all the scales overall, the proportion of the followed-up sample affected by score change varied between 4% and 14% and the median score change in affected questionnaires ranged between 1.5 and 63 points.


View this table:
[in this window]
[in a new window]
 
Table 6 Effect of telephone follow-up on scale scores
 

    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
Postal assessment of outcomes in the TOTAL study achieved a high response rate. However, our experience emphasizes the need for considerable efforts after the initial posting of questionnaires to maximize response and reduce the opportunities for bias. On the basis of our results the expected workload to obtain 100 fully completed questionnaires, in addition to any efforts to trace patients, could be described as: 118 initial postings, 28 telephone reminders, 30 postal reminders and 43 telephone contacts to resolve incomplete responses.

Responders who needed reminders were different from those who did not, with patients in the leisure treatment group (the experimental treatment in this trial) being more likely to respond to the initial posting. Postal assessment without the capacity to remind non-responders would therefore have resulted in a higher proportion of followed-up patients in this group, raising the possibility of bias. Previous research has found differences between participants who respond readily and those needing more persistence,16 but the relationship between readiness to respond and treatment group has particular implications for trials in which masking of participants and clinicians is not possible, and may indicate a greater enthusiasm being generated by the ‘new’ treatment. There is no evidence that the level of non-response remaining after reminders was a source of bias, as the probability of eventually returning a questionnaire was not significantly related to treatment group or to any of the baseline factors such as age or disability. There were, however, significant differences in response rates across centres, which was perhaps surprising when the follow-up procedure was conducted centrally. Possible explanations include differences between the populations and varying encouragement to respond by local therapists: if the latter was an important factor it would have implications for maximizing response in future multi-centre studies.

Almost half the returned questionnaires had some missing or ambiguous item responses; a much higher proportion than that typically found in questionnaire surveys of the general population.7 Incomplete responders differed significantly from complete responders: they were more likely to have had no help with completion, to live alone, and to have been at either end of the disability spectrum at baseline. Failure to use data from incomplete responders would therefore have presented another opportunity for bias to enter the study.

The outcome of the main study depended not on the individual item responses but on the scale scores derived from them. We compared two ways of calculating scale scores when one or more item responses were missing: the simple, conventional imputation strategy and telephone follow-up. Neither method represents a gold standard for filling in missing responses. Conventional imputation is a broad-brush technique making assumptions that are unlikely to be justified. There are of course more complex imputation techniques which are mainly used in large-scale population surveys.17 The response achieved by telephone may not have accorded perfectly with the one which would have been entered on the questionnaire. The method of administration has been found to influence response to questions, for example in depression scales with similar items to the GHQ.18 The task of resolving incomplete questionnaires by telephone was large in terms of the number of patients to contact but was usually limited to a small number of items per questionnaire. Patients were generally willing and able to answer items by telephone, although other studies have identified problems in conducting telephone interviews with elderly respondents.19

We found that conventional imputation underestimated scale scores measuring independence in activities of daily living, involvement in leisure activities, and emotional ill-health, relative to telephone contact. As the number of problem items per scale was typically small the resulting difference in scores was also on average small, but for some scales it affected a significant proportion of the followed-up sample. The analysis of the main trial used the scores achieved by telephone as the least imperfect of the two available measures.

In summary, postal questionnaires offer a practical method of collecting research outcomes, avoiding the potential bias related to interviewer or independent assessor, but should not be considered an option requiring little effort. Failure to invest in the administration time needed to maximize the quantity and quality of information obtained could replace one source of bias by others which are equally serious.


    Acknowledgments
 
We acknowledge with thanks the financial support provided by the NHS Research and Development Programme (Cardiovascular Disease and Stroke), by the NHS R&D Programme for Health Technology Assessment, and by Lothian Health. We are grateful for the assistance provided by the people listed below and for the co-operation of the patients taking part in the trial. Nottingham: Viv Kirk, Sandra Harrison, Patricia Church, Ward F21 staff, Occupational Therapists at QMC, Lings Bar, Highbury, City Hospital; Aintree: Ann Brierly, Betty Poole; Bristol: Carol Flower, Sonia Hardy, Pam Holt, Chris Wood, Sue Simeone, Sue Robins; Glasgow: Sheena Jamieson, Rachel McCaffrey, Laura Gordon, Claire O'Reilly; Newcastle: Randomisation service staff.


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 References
 
1 Streiner DL, Norman GR. Health Measurement Scales, A Practical Guide to Their Development and Use. 2nd Edn. Oxford: Oxford University Press, 1995, pp.189–205.

2 Sackett DL. Bias in analytic research. J Chron Dis 1979;32:51–63.[ISI][Medline]

3 Siemonsma PC, Walker MF. Practical guidelines for independent assessment in randomised controlled trials of rehabilitation. Clin Rehab 1997;11:273–79.[ISI][Medline]

4 Noseworthy JH, Ebers GC, Vandervoort MK, Farquhar RE, Yetisir E, Roberts R. The impact of blinding on the results of a randomised, placebo-controlled multiple sclerosis trial. Neurology 1994; 44:16–20.[Abstract]

5 Schwartz N, Strack F, Hippler HJ, Bishop G. The impact of administration mode on response effects in survey measurement. Appl Cognit Psychol 1991;5:193–212.[ISI]

6 Siemiatycki J. A comparison of mail, telephone, and home interview strategies for household health surveys. Am J Public Health 1979; 69:238–45.[Abstract]

7 Foster K. Evaluating Non-response on Household Surveys. Government Statistical Service Methodology Series No. 8. London: Office for National Statistics, 1998.

8 Parker CJ, Gladman JG, Drummond AER et al. on behalf of the TOTAL study group. A multi-centre randomised controlled trial of leisure therapy and conventional occupational therapy after stroke. Submitted.

9 Van Swieten JC, Koudstaal PJ, Visser MC, Schouten HJA, Van Gijn J. Interobserver agreement for the assessment of handicap in stroke patients. Stroke 1988;19:604–07.[Abstract]

10 Mahoney FI, Barthel DW. Functional evaluation: the Barthel Index. Maryland State Med J 1965;14:661–65.

11 Nouri FM, Lincoln NB. An extended activities of daily living scale for stroke patients. Clin Rehab 1987;1:301–05.

12 Drummond AER, Walker MF. The Nottingham Leisure Questionnaire for stroke patients. Br J Occup Ther 1994;57:414–18.

13 Parker CJ, Logan PA, Gladman JRF, Drummond AER. A shortened version of the Nottingham Leisure Questionnaire. Clin Rehab 1997; 11:267–68.

14 Goldberg D, Williams P. A User's Guide to the General Health Questionnaire. Windsor: NFER-NELSON, 1992.

15 Harwood R, Gompertz P, Ebrahim S. Handicap one year after stroke: validity of a new scale. J Neurol 1994;57:825–29.

16 Rao PSRS. Callbacks, follow-ups, and repeated telephone calls. In: Madow WG, Olkin I, Rubin DB (eds). Incomplete Data in Sample Surveys. Vol. 2: Theory and Bibliographies. New York: Academic Press, 1983, pp.33–44.

17 Report of the Task Force on Imputation. Government Statistical Service Methodology Series No. 3. London: Office for National Statistics, 1996.

18 Geerlings SW, Beekman ATF, Deeg DJH, Van Tilurg W, Smit JH. The Center for Epidemiologic Studies Depression scale (CES-D) in a mixed-mode repeated measurements design: sex and age effects in older adults. Int J Meth Psychiatr Res 1999;8:102–09.

19 Wilson K, Roe B. Interviewing older people by telephone following initial contact by postal survey. J Adv Nurs 1998;27:575–81.[ISI][Medline]