The CHEMDNER corpus of chemicals and drugs and its annotation principles
The CHEMDNER corpus was developed to advance chemical named entity recognition (NER) in scientific literature, featuring 10,000 annotated PubMed abstracts with over 84,000 chemical entity mentions. These annotations, guided by structured rules, cover diverse classes like systematic names, formulas, and identifiers, ensuring the corpus reflects chemical language across major scientific disciplines. The dataset has facilitated training and evaluation for NER systems in the BioCreative challenge, achieving high accuracy. By providing a gold-standard resource, the CHEMDNER corpus supports robust evaluation and improvement of NER tools for applications in biomedical and chemical informatics.