CORRESPONDENCE

Cancer Information on the World Wide Web: Gross Characteristics

Craig W. Trumbo

Correspondence to: Craig W. Trumbo, PhD, University of Vermont, Office of Health Promotion Research, 1 South Prospect St., Burlington, VT 05401 (e-mail: craig.trumbo{at}uvm.edu)

Web-based cancer information has been examined (17), but its gross characteristics have not been previously described. This study provides such a description of sites in March 2001 and in March 2003.

In the sample of Web content taken in March 2001, iterative trials were used to determine the simplest search string that would yield results free of erroneous material. The search string used was "cancer OR oncology OR neoplas* OR tumor OR malignan* -astro* -horoscope -tropic -crab -zodiac." Searches were executed with the search engines Northern Light (www.northernlight.com) and Alta Vista (www.altavista.com). In 2001, the search engine Google did not allow for Boolean searches. The top 200 results from the two search engines were combined, and any remaining erroneous links and duplicates were removed, including sites for academic journals and Web portals. The result was 306 universal resource locators (URLs).

A second sample was taken in March 2003. At that time, Northern Light had suspended operation and Google had introduced Boolean searches. The second sample was done exclusively with Google, again taking the top 400 returns and removing the same variety of material. The result was 326 URLs. These samples represent a fair picture of the body of material available to a casual searcher at those times.

Coders were trained to examine the content on a number of manifest variables: home page status, provision of a dated indicator of page currency, and the presence of information on 15 forms of cancer (plus "other"). An index of content breadth was calculated as the sum of the number of forms of cancer present on the page.

To gauge the embeddedness of the page in the Web, the number of links pointing to each given page was assessed by use of the site http://www.linkstoyou.com/Checklinks.htm. The variable reports the number of links pointing toward the page found in AltaVista, excluding internal links. The longevity of the pages in the first sample was assessed by checking to see if the URL was still live 1 and 2 years later (March 2002 and March 2003).

Finally, two readability scores were calculated. The Flesch Reading Ease score provides a rating of 1–100, with higher scores being easier to read (a score of 60–70 is desirable). The Flesch–Kincaid Grade Level scale places the text on a U.S. grade school level. Both were run through the utility provided in Microsoft Word. To execute the readability scores, a block of text (e.g., a paragraph) was randomly selected from the page. When the opening page did not provide sufficient text, a link was randomly selected and followed to the next page. It was not necessary to proceed further.

When we analyzed the characteristics of the page, several interesting differences were observed between the two samples (Table 1). The 2003 sample included a much greater percentage of home pages, of pages that were more thoroughly linked to, and of content with improved readability. However, it must be noted that readability at even the 10th grade level that was found in the 2003 sample is more difficult than is recommended for a general audience.


View this table:
[in this window]
[in a new window]
 
Table 1. Page and content characteristics

 
When we analyzed the characteristics of the content on cancer, the only categories of cancer that did not increase in their representation were childhood cancers and other. There was only a slight change in the rankings of the various cancers across the two samples (Spearman rank order correlation, rs = .70; P = .003). As reflected by the jump in the breadth score, it appears that the typical Web site in the 2003 sample addresses a greater variety of cancers than that in the 2001 sample. This might represent a shift away from more specialized Web sites and toward sites that serve a broader audience. Alternatively, this is possibly a consequence of overall site development and expansion. Search engine selection criteria may have also evolved to favor more broadly defined sites.

The assessment of page longevity found that 1 year later 79% (95% confidence interval [CI] = 73% to 78%) of URLs were alive and that 2 years later 58% (95% CI = 52% to 63%) were still alive. Home pages were much more likely than pages to remain alive across the 2-year study period. At 1 year, 96% of home pages were alive versus 73% of pages ({chi}2 = 16; P<.001), and at 2 years, 94% of home pages remained alive versus 47% of pages ({chi}2 = 48; P<.001).

In the broadest sense, it might be argued that our analysis shows that in a number of ways the Web's cancer content may have improved during the 2 years of this study. Nearly all forms of cancer are better represented on a greater percentage of home pages that are more extensively linked through the Web. Readability has also improved.

This approach to evaluating Web content may hold some promise. All of the studies done to date on the Web's representation of cancer have focused on single cancers as exemplars, generalizing the results to the broader content. The differences seen in this study between specific cancers may suggest that such generalizations are not entirely reliable.

That said, it must also be pointed out that the picture painted by this examination of the Web's cancer content is still only partial at best. Sampling broad categories of Web content is problematic, and the sheer volume of material available makes a complete characterization nearly impossible. Changes in search engines, especially across longer time spans, make comparisons difficult. Although it might be argued that the population of Web sites is unknowable and all that can ever be observed are the results provided by idiosyncratic and evolving search engines. In that sense, our study is not characterizing the Web but, rather, is characterizing the nature of search returns. In any case, the refinement of this manner of Web content evaluation could provide valuable feedback for those working to provide cancer information to the public.

NOTES

Supported by Public Health Service grant CA88604-02 from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services.

REFERENCES

1 Wilson FL, Baker LM, Brown-Syed C, Gollop C. An analysis of the readability and cultural sensitivity of information on the National Cancer Institute's Web site: CancerNet. Oncol Nurs Forum 2000;27:1403–9.[Medline]

2 Biermann JS, Golladay GJ, Greenfield ML, Baker LH. Evaluation of cancer information on the Internet. Cancer 1999;86:381–90.[CrossRef][ISI][Medline]

3 Bichakjian CK, Schwartz JL, Wang TS, Hall JM, Johnson TM, Bierman JS. Melanoma information on the Internet: often incomplete–a public health opportunity? J Clin Oncol 2002;20:134–41.[Abstract/Free Full Text]

4 Hellawell GO, Turner KJ, Le Monnier KJ, Brewster SF. Urology and the Internet: an evaluation of internet use by urology patients and of information available on urological topics. BJU Int 2000;86:191–4.[CrossRef][ISI][Medline]

5 Hoffman-Goetz L, Clarke JN. Quality of breast cancer sites on the World Wide Web. Can J Public Health 2000;91:281–4.[ISI][Medline]

6 Meric F, Bernstam EV, Mirza NQ, Hunt KK, Ames FC, Ross MI, et al. Breast cancer on the world wide web: cross sectional survey of quality of information and popularity of websites. BMJ 2002;324:577–81.[Abstract/Free Full Text]

7 Tamm EP, Raval BK, Huynh PT. Evaluation of the quality of self-education mammography material available for patients on the Internet. Acad Radiol 2000;7:137–41.[ISI][Medline]



             
Copyright © 2004 Oxford University Press (unless otherwise stated)
Oxford University Press Privacy Policy and Legal Statement