Find. Extract. Predict.

OntoChem extracts U.S. Food and Drug Administration SPL files

23 February 2021

The “Structured Product Labeling” (SPL) files of the United States FDA are a valuable public ressource for drugs on the market. Thus, UNII (Unique Ingredient Identifier) numbers are assigned to each drug and its chemical structure information by the FDA registration system. UNII numbers are also used in several databases such as drug labels in or chemical databases such as PubChem, DrugCentral and others – drug INN, trade names and other information can be combined from RxNorm of the NIH.

OntoChem is now extracting chemical structural information from FDA SPL files. Thus, small molecule structures in the form of MOL files that are extracted and registered in OntoChem’s chemical registration system and database. An ontology concept identifier (OCID) is assigned to each compound ( Proteins and peptides are extracted as amino acid sequence data and registered with OCIDs in OntoChem’s BLAST server. Registration of these small molecules and sequences is made publicly available in Google’s BigQuery “SciWalker Open Data” dataset with a daily update. So far, OntoChem’s compound registry on BigQuery contains 134 million unique chemical structures extracted from patents and publications.

Also, while we provide more than 12 million protein sequences extracted mainly from USPTO, EPO and WIPO patents, FDA SPL is now the first non-patent source from which sequences are extracted automatically. Of particular interest are those SPL files that combine both sequence as well as canonical MOL data, for example for antibody-drug conjugates (ADC) like Kadcyla, UNII:SE2KH7T06F.

OCIDs can be used for federated searching of different chemical databases using SQL queries on BigQuery or searched in OntoChem’s SciWalker system using established chemical and BLAST searching. Please see for reference.

Felix Berthelmann