Find. Extract. Predict.

Comparing software tools for optical chemical structure recognition

28 May 2024

The development of transformer-based machine learning models has given rise to a new interest in images processing followed by creating of AI tools for optical chemical structure recognition (OCSR).

Given recent technological developments, the OntoChem R&D team reviewed modern OCSR tools to see whether they could replace the current code used in the OC processor. The review was published in Digital Discovery, from the Royal Society of Chemistry. 

OSRA, the current – software used in the OC processor, was originally developed at the National Cancer Institute in the USA to recognize chemical structures in documents.  In the OC Processor pipeline, OSRA is used to extract chemical compounds and reactions from a variety of sources, including images, tables, and scanned documents.

In the study, the OntoChem R&D team tested the ability of the newest publicly available, machine learning tools to decipher depictions of simple chemical structures, multiple chemical structures and chemical reactions. The tools  —  MolScribe, RxnScribe, DECIMER, Molvec, ReactionDataExtractor, SwinOCSR, and OCMR  — were compared with OSRA. The team used 2702 images from patents that contain chemical structures and reactions and at each stage four human chemists supported the work with independent quality control. 

“In general we were impressed with the quality of structure recognition from images, particularly for simple chemical structures and reactions, where modern AI tools  significantly outperformed OSRA,” says the lead author, Lutz Weber. Each tool had a high recall rate of close to 100%, indicating that the methods were able to predict molecules for each or most of the chemistry containing images.

However, for multiple chemical structures in one image and also for large molecules, OSRA was found to be a better tool, and there, it still outperformed the others. 

As a result, a new chemical image classifier (ChemIC) has been developed to funnel the image into the most appropriate OCSR tool. 

Using MolScribe for single images, OSRA for multi structure images and RxnScribe for reactions would yield a better overall:
Precision = 74 % (+12)
Recall = 90 % (+12)
F1 = 80 % (+12)

The research was not intended to look at the underlying reasons for any strengths and weaknesses but to identify an overall improved process for image recognition to deliver improved information to SciWalker and databases such as Google Patents or PubChem. 

“As such, we have identified some issues that could be addressed to help AI methods to perform better in the near future. Improving resizing of images could help with predictions of larger molecules, as well as using training sets with multiple chemical structures,” says Weber.

“Such improvements would support researchers and we are working to build them into OntoChem as part of our drive to refine and enhance the service provided to researchers,” he adds.

To learn more about OntoChem’s solutions, please contact us.

Misha Kidambi