Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patent Texts

This study presents a manually annotated gold standard corpus for chemistry-disease relations extracted from patent texts, aimed at supporting the development of reliable relation extraction methods. Using OCMiner, the study first identifies named entities for chemicals and diseases, followed by the manual annotation of relevant relations, such as “treats” or “induces.” Two reasoning methods—chain reasoning and enumeration reasoning—are introduced to infer additional relations indirectly expressed in patents, improving annotation precision and efficiency. This high-quality corpus provides a benchmark for evaluating automated extraction methods and enhancing knowledge retrieval in pharmacological and chemical research applications.

Download Publication