Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS

In efforts to enhance chemical data extraction from scientific literature, researchers have explored approaches for identifying and analyzing specific compounds within texts.

This study examines the extraction and analysis of chemical structures, particularly PFAS, from literature and patent documents using open-source cheminformatics tools such as CDK, RDKit, and OpenChemLib. It addresses the complexities of standardizing chemical data, given the varied representations of compounds in texts, and uses PFAS as a case to compare structural definitions and their impact on data extraction. Consistent cheminformatics approaches are shown to be essential for effective environmental monitoring and regulatory assessment of PFAS, which are persistent “forever chemicals” with significant health concerns.

Download Publication