Selection of Control Groups by Using a Commercial Database and Random Digit Dialing

Sara H. Olson, Laura Mignone and Susan Harlap

1 Epidemiology Service, Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY.
2 Department of Obstetrics and Gynecology, New York University Medical Center, New York, NY.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Identifying a control group when cases come from a specialized hospital is a challenge for epidemiologists. The authors compared controls recruited by using a commercial database with those recruited by random digit dialing in the context of a hospital-based case-control study of ovarian cancer. This part of the study was conducted in 1997–1998 among women aged 18 years or older who resided in the New York metropolitan area. A mailing list owner grouped cases into "lifestyle" clusters based on US zip+4 postal code microneighborhoods and generated a random sample of potential controls with the same distribution across the clusters. Controls recruited from the commercial database (n = 82) and from random digit dialing (n = 90) were similar in age and race. Women from the commercial database had somewhat more education and higher incomes and were more similar to the cases on these measures. The control groups resembled each other closely in terms of oral contraceptive use, nulliparity, and religion and differed from the cases on these measures. Response rates were similar for the two groups. Only 28% of the cases were included on the mailing list, indicating that it did not reflect the source population of the cases. Use of a commercial database provided a control group whose socioeconomic factors were similar to those of cases at a lower cost than when random digit dialing was used but did not result in a higher response rate. Am J Epidemiol 2000;152:585–92.

case-control studies; databases; epidemiologic methods; socioeconomic factors


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Selection of an appropriate control group is an important concern in designing case-control studies. Controls should be selected from the same population as the cases and independently of their exposure status, so that they represent the source population with regard to exposure. Controls should be eligible to be cases if they were diagnosed with the disease of interest during recruitment (1Go, 2Go). Studies based in specialized cancer hospitals have an advantage in that a large number of cases is available for study. In this setting, however, selection of an appropriate control group is difficult, since the source population for the cases cannot be defined. As both geographic and socioeconomic factors determine use of a particular hospital, controls may be sought from patients' neighborhoods. "Modified" random digit dialing, in which cases' telephone numbers are used to construct a sampling frame for locating controls, often has been used in these circumstances.

Companies such as Claritas Inc. (San Diego, California) have developed systems that classify neighborhoods into "lifestyle" clusters. Marketers and charities use these systems to describe their customers or donors. The PRIZM system developed by Claritas Inc. categorizes every zip+4 postal code microneighborhood in the United States into one of 62 clusters defined by such factors as residents' income and education, age of head of household, household size, length of residence, race, foreign birth, population and housing density, home ownership/rental, and home value. These clusters are based on the US census and are augmented by information from surveys and data on consumer purchases, for example, from the use of credit cards. The 62 PRIZM clusters can be collapsed into 15 larger groups defined by socioeconomic status and type of area (urban, suburban, second city (smaller cities or satellite cities of major urban areas), small town, or rural). As examples of the type of information available on the clusters, residents of Winner's Circle areas, within the Elite Suburbs socioeconomic group, are described as executive suburban families with a head of household aged 35–64 years and a median income of $90,700 that have a passport and read epicurean magazines. Residents of Old Yankee Rows, within the Urban Midscale group, are described as empty-nest, middle-class families with a head of household aged 25–34 or >65 years and a median income of $34,600 that belong to a union and buy pop music. The clusters and broader socioeconomic groups are listed and described in table 1. More information can be found at Internet site www.claritasexpress.com.


View this table:
[in this window]
[in a new window]
 
TABLE 1. PRIZM lifestyle clusters used to classify US neighborhoods*

 
We questioned whether PRIZM clusters could be used to provide a sampling frame for the selection of controls in studies in which it was appropriate for socioeconomic or geographic characteristics of controls to resemble those of cases. To determine the feasibility and cost of doing so, we devised an experiment to compare controls selected this way with controls selected by using random digit dialing. This experiment was applied during selection of controls for a hospital-based study in New York City.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
We conducted this study in the context of a case-control study of ovarian cancer at Memorial Sloan-Kettering Cancer Center. To establish the sampling frame for both random digit dialing and the commercial database, we used a subset of patients eligible for this study: all those diagnosed between October 1993 and December 1996. There were 301 such patients in the hospital database with addresses in the New York metropolitan area and with telephone numbers.

Controls recruited from the commercial database
We obtained patients' zip+4 codes and sent them to Experian (Allen, Texas), a company that owns a commercial database that classifies households by PRIZM codes. Of the 301 zip+4 codes sent to Experian, 286 were included in Experian's database and were assigned lifestyle clusters. Our patients were found in 40 of the 62 clusters, with the highest concentrations in the following clusters: Winner's Circle, 10.1 percent; Money and Brains, 9.4 percent; Old Yankee Rows, 8.7 percent; and Urban Gold Coast, 8.0 percent. About three-quarters of the cases were included in 13 of the clusters. By using the distribution of our cases across the clusters, Experian sampled from the database and provided us with a list of 1,503 women with the same distribution across lifestyle clusters as our cases and living in the same counties. All households in this database have listed telephone numbers. Between July 1997 and January 1998, we randomly selected about 20 names per week from this sampling frame, for a total of 421 names. We sent a letter to each woman selected, explaining the purpose of the study and that we would follow up with a telephone call. The letters were written on hospital letterhead and were signed by the study's principal investigator; the envelopes were hand addressed and stamped. These procedures are recommended for increasing response rates to mail surveys (3Go).

Controls recruited from random digit dialing
Roper Starch Worldwide Inc. (Princeton, New Jersey), a company that specializes in telephone survey research, conducted the random digit dialing. We provided Roper Starch Worldwide Inc. with the telephone numbers (minus the last three digits) of the 301 cases and age quotas by 5-year age groups based on the age distribution of the cases. Modified random digit dialing was used. Roper Starch Worldwide Inc. generated a list of numbers that began with the same first seven digits as the cases' 10-digit telephone numbers. After the sample was drawn, the company used a computer program to cross-check the numbers selected against listings in the yellow pages of the telephone book to eliminate business numbers and then automatically dialed the remaining telephone numbers, eliminating those that triggered a message that the number was nonworking. This procedure was repeated every 4–6 weeks during the study. Between February and September 1997, interviewers at Roper Starch Worldwide Inc. called randomly selected numbers up to 16 times, using a computer-generated algorithm to distribute callbacks over different days and times. Interviewers administered a very brief questionnaire, ascertaining whether there was a woman in the household who was eligible in terms of age and, if so, obtaining her name and address as well as the best time to call.

We received the names of 298 age-eligible women from Roper Starch Worldwide Inc. over an 8-month period. Because of time constraints, we telephoned only 231 of these women. The 67 women who were not telephoned by our interviewers were mainly those who had been contacted initially by Roper Starch Worldwide Inc. during the last 2 months of the project. Roper Starch Worldwide Inc. called 1,637 telephone numbers, of which 200 (12 percent) were of unknown usability (i.e., not answered after 16 tries) and 90 (5 percent) were known to be for households but eligibility could not be determined in the time frame of the study. The response rate for this phase of the study was 72.2 percent, calculated as the number of calls completed (those eligible plus those ineligible) divided by the number completed plus the number of women who refused.

Interviewer contacts
Interviewers employed and supervised by the Epidemiology Service at Memorial Sloan-Kettering Cancer Center made telephone calls to potential controls from both sources. The interviewers obtained preliminary verbal consent, mailed the consent form, and, after the signed form was returned, called again to schedule the interview. The interview, conducted by telephone, took on average 68 minutes to complete. The consent form included consent to give both blood for genetic testing and saliva; however, we included participants whether or not they agreed to give biologic specimens. We paid respondents in the control groups $50 for their participation. All procedures and instruments were approved by the Institutional Review Board.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Comparison of response rates
Of the 421 women from the commercial database to whom we sent letters, we located 350 who were eligible. For the other 71 (16.9 percent) women, the telephone number was not in service, the woman no longer had that number, or the woman did not speak English or Spanish. From the Roper Starch Worldwide Inc. list of 298 women who agreed to be contacted, we telephoned 231 of them. The response to interviewers' telephone conversations with potential controls is shown in table 2. For the sample recruited from the commercial database, 56.3 percent of the women gave preliminary verbal consent, while 30.6 percent refused to consider taking part and 13.1 percent could not be reached. For the sample recruited from random digit dialing, 86.6 percent of the women gave verbal permission to receive the consent form, while 10.4 percent refused and 3.1 percent could not be reached. The difference between the two groups reflects the fact that the controls in the random digit dialing group had already agreed to be contacted again, while the controls recruited from the commercial database had only received a letter before being contacted by our interviewers.


View this table:
[in this window]
[in a new window]
 
TABLE 2. Response to interviewers'* telephone calls to secure verbal consent from potential controls recruited from the commercial database and from random digit dialing, New York, 1997–1998

 

The response rates for the two types of controls after they received the consent form are shown in table 3; rates were the same for the two groups, about 45 percent. The overall response rate for the controls recruited from the commercial database was 91/350, or 26.0 percent. When the response rate from the initial telephone calls to locate eligible respondents was taken into account, the overall response rate for controls located by using random digit dialing was 28.1 percent ((90/231) x 72.2 percent). The proportion of controls who gave blood was similar in the two groups, 76.8 percent for the commercial database and 72.2 percent for random digit dialing (data not shown in tables).


View this table:
[in this window]
[in a new window]
 
TABLE 3. Response from potential controls recruited from the commercial database and from random digit dialing who received a consent form, a New York, 1997–1998

 
The response rate among eligible cases at Memorial Sloan-Kettering Cancer Center was 41.3 percent. The reasons for nonparticipation of cases were refusal (34.0 percent), being deceased or too ill (9.0 percent), physician refusal (4.4 percent), and decision pending when the study closed (6.9 percent); for 4.4 percent, the reason was not recorded. The proportion who gave blood was 80.8 percent (data not shown in tables).

Characteristics of participants
Demographic characteristics of the controls identified from the commercial database and from random digit dialing who completed the study (excluding nine women from the commercial database who signed the consent form but could not be interviewed before the recruitment period ended) are shown in table 4. Women in the two control groups were similar in terms of age, race, and religion. Those identified by using the commercial database were of somewhat higher socioeconomic status as indicated by measures of education and income, although these differences were not statistically significant. There was a large and significant difference in area of residence; more commercial database controls lived in New Jersey and Connecticut and more random digit dialing controls in New York City and New York State.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Demographic characteristics of controls recruited from the commercial database and from random digit dialing and of cases who completed interviews, New York, 1997–1998*

 

Because the sampling frame for the commercial database included only women who had listed telephone numbers, we were interested in determining how many of the women located by random digit dialing had unlisted numbers. We determined whether numbers were listed by looking them up in telephone books or on the Internet or by calling directory assistance. We used the same procedures to locate the telephone numbers of controls selected by using the commercial database to account for women who, after the database was assembled, might have changed their names or addresses or requested that their numbers be unlisted. We did not attempt to look up telephone numbers for the cases, since they had been identified earlier (October 1993 to December 1996) and we did not have access to all telephone books from those years. As expected, more controls identified from random digit dialing had unlisted telephone numbers, 27 versus 5 percent.

Demographic characteristics of cases are also shown in table 4. By design, we intended the characteristics of cases and controls to be similar, with the exception of religion. We found cases to be similar to both control groups in terms of age and race. Their education and income levels were similar to those of the controls recruited from the commercial database and higher than those identified by using random digit dialing. Cases were more likely to refuse to answer the question on income or to say they didn't know. Their geographic distribution was similar to that of the random digit dialing control group. Cases were more likely to be Jewish.

We compared the commercial database and random digit dialing control groups in terms of their use of oral contraceptives and parity, two factors related to risk of ovarian cancer (table 5). The two groups were very similar in terms of oral contraceptive use and the percentage of women who were nulliparous, but the commercial database controls were much more likely to have two or more children. As we expected, there were substantial differences between cases and each control group; cases were more likely to be nulliparous and less likely to have used oral contraceptives.


View this table:
[in this window]
[in a new window]
 
TABLE 5. Use of oral contraceptives by and parity in controls recruited from the commercial database, controls recruited from random digit dialing, and cases who completed interviews, New York, 1997–1998

 
We were concerned that differences in parity between the commercial database controls and the cases might be attributable to differences in state of residence, since more controls were from suburban areas. To address this issue, we looked at the relation between case-control status and parity for the controls recruited from the commercial database by using the following geographic strata: Connecticut, New Jersey, New York City, and other areas of New York State. In each of the four strata, parity was lower for cases than for controls identified by using the commercial database. We also computed crude and adjusted odds ratios for the risk of ovarian cancer associated with parity and found some evidence of confounding by area of residence. For these controls, the crude odds ratio for any children versus none was 0.56, while the Mantel-Haenszel adjusted odds ratio was 0.81. We repeated this analysis for the random digit dialing controls but did not find evidence of confounding by area of residence: the crude and adjusted odds ratios were 0.47 and 0.50, respectively.

Participation of cases and commercial database controls by area and socioeconomic status
The availability of lifestyle cluster codes for the women in the commercial database sampling frame and for cases enabled us to compare results of recruitment attempts according to the groups based on area and socioeconomic status (table 6). Among the controls identified from the commercial database, about one-third of the women in the Elite Suburbs and the Landed Gentry groups who were approached signed the consent form; in contrast, only 15 percent of those in the Urban Midscale group did so. Among the cases we approached, the percentage of women who consented was lowest in the Second City group. The largest differences between cases and controls were in the Urban Midscale and all other socioeconomic groups, those representing less-affluent areas, in which the proportion of women who consented was much higher among cases than among controls.


View this table:
[in this window]
[in a new window]
 
TABLE 6. Percentage of controls recruited from the commercial database and of cases who signed the consent form, according to PRIZM* socioeconomic group, New York, 1997–1998

 

Completeness of the sampling frame for the commercial database
An additional analysis was undertaken to evaluate the completeness of the commercial database as a source of controls. We did so by determining how many of the 301 patients on our original list were included in the commercial database. The list owner provided us with a file of all of the 1,216 women listed who lived in the zip+4 areas in which our patients lived. We found that 85 (28.2 percent) of the 301 cases appeared on this list. The 301 patients lived in 253 different zip+4 areas. In these areas, the number of households on Experian's list ranged from 1 to 21, with a mean of 4.8 and a median of 4. Since some of the women on the case list might not have been included in the database because they had moved or died, we also analyzed the proportion of households with women identified by random digit dialing who were included in the commercial database. Experian conducted this analysis by computer matching the telephone numbers from random digit dialing to their database. Of those households in which the random digit dialing study determined that there was an adult woman (n = 466), 36.3 percent of them were in the commercial database. Since about 27 percent of the study women from the random digit dialing control group had unlisted telephone numbers (table 4), a maximum of about 73 percent could have been included on the commercial database list.

Costs
Obtaining the Experian mailing list of 1,503 names with the same distribution across the lifestyle clusters as our cases cost $1,500. The cost to send 421 letters included about 10 person-days for drawing the weekly samples, producing the letters and envelopes, and postage: about $1,600. Having Roper Starch Worldwide Inc. provide the names of 298 eligible women cost $22,000. Since we did not use all names provided by Roper Starch Worldwide Inc. before the data collection phase ended, we prorated this amount to estimate the cost of the 231 names we actually used; the prorated cost was $17,050. The cost of obtaining names for potential respondents from the commercial database was therefore 18 percent of the cost of obtaining names by using random digit dialing.

We investigated whether it took longer for our interviewers to reach controls recruited from the commercial database list than those recruited by random digit dialing, whom Roper Starch Worldwide Inc. had already contacted by telephone. The mean number of days on which potential respondents were called was similar for the two groups: 3.0 (standard deviation, 2.5) for the commercial database and 2.7 (standard deviation, 2.2) for random digit dialing. Any additional cost involved in reaching controls identified from the commercial database compared with random digit dialing appears to be minor.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
We used a commercial database to select controls matched to cases on lifestyle clusters and compared the characteristics of this control group with those of a control group selected by random digit dialing and matched to cases on age group. The two control groups were similar in terms of age and race, use of oral contraceptives, and religion, factors related to risk of ovarian cancer. The response rate was similar for the two control groups. The main advantages of using the commercial database versus random digit dialing as a source of a control group were that the socioeconomic factors (education and income) of participants from the list were more similar to those of the cases and the cost was considerably lower. Similar socioeconomic factors might make these controls likely to be in the case group if diagnosed with ovarian cancer and provides control of unknown risk factors related to socioeconomic status.

A disadvantage of using the commercial database was that the sampling frame did not include most of the cases, indicating that the list contains a relatively small proportion of the source population. While part of the reason that we could not locate our cases on this list may have been that they had died, had moved, or had unlisted telephone numbers, results for the households identified by random digit dialing confirmed the incompleteness of the commercial database sampling frame. This problem might be overcome in future studies if other databases, or perhaps a combination of databases, were used. In addition, study investigators who use a commercial database might check before purchasing one to determine what proportion of their cases is included on the list being considered. In contrast, the sampling frame for random digit dialing by definition includes all cases, since it is based on their telephone numbers.

Another potential disadvantage of the commercial database was that a large proportion of respondents were residents of New Jersey or Connecticut compared with New York. While they were similar to cases in terms of other demographic measures, those respondents who lived further away from New York City might have been less likely to use our medical center had they been diagnosed with ovarian cancer. In addition, the geographic distribution of these controls appeared to confound the relation between parity and case-control status. We could have averted this problem if we had frequency-matched the commercial database list by county as well as by lifestyle cluster. While a fairly large number (16.9 percent) of women on the list were ineligible to participate, indicating that the list was out of date, this problem was minor because the amount of time and the cost of determining that these women were ineligible were negligible.

Overall response rates were low for both control groups. The procedures for this particular study are likely to have discouraged participation: we required signed informed consent forms before the interview, conducted a long interview, and requested blood and saliva samples. A large proportion of women who considered participating refused after reading the consent form, which was long and drew attention to the potential risks of genetic testing. There is general agreement in the epidemiologic community that response rates have declined (4Go). Social and economic factors such as families in which all adults work, the prevalence of telemarketing, and the use of answering machines and caller identification to screen telephone calls are likely to affect response. These problems may be particularly severe in urban areas in which tertiary centers are located. Although the New York area is an attractive setting for epidemiologic research because of the concentration and diversity of the population, these social and economic factors may be particularly important here.

Low response rates raise the potential for biased results if those persons who respond are different from those who do not. In most situations, including those in which random digit dialing is used, little information is available on the characteristics of persons who do not respond, so it is difficult to evaluate the potential for bias. Use of a commercial database with PRIZM codes assigned to cases and controls enabled us to compare responses according to broad socioeconomic groups. We found that the greatest discrepancy between cases and controls recruited from the commercial database was in the participation rates of women in somewhat lower socioeconomic groups. This finding is consistent with other studies that have evaluated characteristics of persons who respond and those who do not (5Go, 6Go). This information on cases and controls would enable investigators to adjust results for nonresponse, which is not possible in most epidemiologic studies.

We know of only one other report of a novel strategy for recruiting controls for case-control studies when cases come from a tertiary center. Hudmon et al. (7Go) recruited controls who were smokers from a large, multispecialty health maintenance organization, which did not strictly represent the source population for their cases. They gave screening questionnaires to patients at the health maintenance organization and included a question on their willingness to take part in a study. Although these authors were unable to assess the proportion of patients who completed the screening questionnaires, they reported that about three-quarters of those who did answered yes or maybe to the question about participating in a future study and that they were able to recruit 87 percent of those patients for a study.

Because of the degree to which the demographic characteristics of the respondents recruited from the commercial database were similar to those of the cases and the lower cost of obtaining these controls, we conclude that commercial databases can provide an alternative to random digit dialing. However, in future studies in which this source of controls is used, an attempt should be made to find a more complete database or a combination of databases that more closely resembles the source population for the cases. The problem of lower response rates needs to be addressed by the epidemiologic community.


    ACKNOWLEDGMENTS
 
This research was supported by National Cancer Institute awards RO3 CA72456 (Dr. Olson) and RO1 CA61088 (Dr. Harlap) and a grant from the Thomas G. Borowik Foundation (Dr. Harlap).

The authors thank the interviewers—Christine Nakraseive, Monica Melo, and Lauren McGuinn.


    NOTES
 
Reprint requests to Dr. Sara H. Olson, Box 44, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021 (e-mail: olsons{at}mskcc.org).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Rothman KJ, Greenland S. Modern epidemiology. Philadelphia, PA: Lippincott-Raven, 1998.
  2. Miettinen OS. The "case-control" study: valid selection of subjects. J Chronic Dis 1985;38:543–8.[ISI][Medline]
  3. Dillman DA. Mail and telephone surveys. The total design method. New York, NY: John Wiley & Sons, 1978.
  4. Hartge P. Raising response rates: getting to yes. Epidemiology 1999;10:105–7.[ISI][Medline]
  5. Olson SH, Kelsey JL, Pearson TA, et al. Evaluation of random digit dialing as a method of control selection in case-control studies. Am J Epidemiol 1992;135:210–22.[Abstract]
  6. Groves RM, Lyberg LE. An overview of nonresponse issues in telephone surveys. In: Groves RM, Biemer PP, Lyberg LE, et al, eds. Telephone survey methodology. New York, NY: John Wiley & Sons, 1988:191–211.
  7. Hudmon KS, Honn SE, Jiang H, et al. Identifying and recruiting healthy control subjects from a managed care organization: a methodology for molecular epidemiological case-control studies of cancer. Cancer Epidemiol Biomarkers Prev 1997;6:565–71.[Abstract]
Received for publication April 26, 1999. Accepted for publication November 4, 1999.