The version of the DTD files that were current when the Medline subset was captured are archived here on the TREC Genome data site. It is recommended that you download the DTDs from here, and then modify the XML files to use your local copy of the DTDs. The XML files are currently set up to use the DTDs at NLM. This will work, but the NLM DTDs are subject to change, and the TREC Medline subset is static. The DTDs are a bit confusing due to the large numbers and the fact that they are all referenced in a hierarchical order. All of the DTDs here are required to give the full picture of the XML subset layout. The DTD files are referenced in the following order: pubmed_021101.dtd nlmmedline_021101.dtd nlmmedlinecitation_021101.dtd nlmcommon_021101.dtd NOTES: 1) There is a good description of the DTD files located at the following URL: http://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html This web page describes the various XML tags and tells you what you can expect with regards to the tags. 2) Here is a web page that lists the various diacritic and UTF-8 encodings that can be found in the XML files: http://www.nlm.nih.gov/databases/dtd/medline_character_database.utf8