Processing images to chemical structures
A lot of scientific information is captured in images – we are using machine learning techniques such as deep neural networks to classify images. For example, we have applied transfer learning to train a deep convolutional neural network for developing a ML classifier that detects if an image contains a chemical structure. If so, this image is processed using an OSR (optical structure recognition) software such as OSRA (https://sourceforge.net/p/osra/wiki/Home/).
Thus, we have processed all images from US, European Patent Office (EPO) and World Patent Office (WIPO) patents, and converted all images containing chemical structures to compound structures, including their SMILES string, InChI and a unique ontology concept ID (OCID, https://registry.identifiers.org/registry/ocid). For example, the following information is extracted from an image of patent US-08754081-B2.
If an unknown compound is found in an image, it is being registered with our registration system that makes those compounds openly available as Google’s BigQuery on SciWalker Open Data project (https://console.cloud.google.com/bigquery?project=sciwalker-open-data). This table currently contains more than 130 million compounds with unique InChI-Keys.
Our software OC|image2structure provides a RESTful service for extracting chemical structures from images. More specifically, it emcompasses a pipeline for
- classifying images (i.e. deciding if an image depicts chemical structures) and
- extracting chemical structures (via OSR, i.e. optical structure recognition) from them.
OC|image2structure is designed as a client-server solution. It can run either locally on a single machine or on a server that is accessible within a network.
If of interest for you please let us send more on OC|image2structure processing pipeline.