Automatic identification of relevant chemical compounds from patents

This study presents an automated approach to identify and classify relevant chemical compounds within patent documents, focusing on the chemical compounds essential to the patent’s core inventions. Combining dictionary-based and morphology-based recognition tools, the system extracts chemical entities and classifies their relevance based on context, achieving an F-score of 86% for compound recognition. This method significantly reduces data size by filtering irrelevant compounds, improving patent database utility. The system’s effectiveness was tested on a specialized corpus and aims to enhance the efficiency of databases like Reaxys, enabling faster access to key compounds for scientific and commercial research.

Download Publication