Find. Extract. Predict.

ACS Spring Meeting talk about “Open Data Exchange” with Google and Collabra

13 April 2021

Have a look at our latest talk we gave together with Google and Collabra at the ACS Spring Meeting 2021 about “Synergy through integration of data sources”.

It isn’t self-evident that access to open-source data comes for free, especially if the information is constantly optimized with updating and increased cross-referencing to new content. What are the costs, then, in making open-source content widely accessible and useful?

Challenges of sharing life science data on open platforms will be examined. Analyzing the cost of accessing open-source data starts with distinguishing between the casual and power user. Casual users want access to analyze as many data resources as possible with minimal or no infrastructure overhead or cost, which currently limits them to small bulk datasets that fit on a laptop or web-based interfaces to retrieve data from a single system. For the power user, analyzing multiple large datasets requires building and maintaining in-house data repositories with dedicated machines or clusters to enable analysis. Exposing scientific information in a way that enables ad-hoc integration and analysis between many datasets has previously required concerted effort from a team to normalize and collect data into a single database release. With a common storage and analysis platform with the right cost-sharing model, the scientific community can collaboratively expand and access seemingly unrelated datasets and pose intradisciplinary questions relevant to the advancement of science. Examples of such cross-querying include identifying molecules in litigation, associating government agencies with specific drug- and disease-funding, tracking clinical trials with chemical structures in patents, grant applications and in scientific documents. In semantic aggregation, tagging compounds according to their structure and properties allows association of knowledge from different data sources to be cross-correlated as well as fed into machine learning and prediction systems.

Going forward in establishing shared data platforms, it’s important that the scientific community be realistic about costs associated with sustainability in maintaining quality data sources.

Presentation: ACS presentation.pdf (2.6 MB)

Felix Berthelmann