SN SciGraph is the largest Linked Open Data aggregation platform in the scholarly domain. It makes data that is machine-readable, interoperable and re-usable freely available to the global community. Launched in February 2017, SN SciGraph was designed to help researchers, scholars, editors, librarians, funders, and developers overcome the challenge of extracting and merging large datasets from multiple repositories, and in multiple formats.
With millions of Linked Open Data* connections on one platform, SN SciGraph speeds up content discovery and broadens the scope of research by exposing previously unseen patterns and presenting new perspectives. By linking Springer Nature publications to other data types such as grants, conferences, and freely available taxonomies, SN SciGraph helps academic institutions and commercial organisations conduct more accurate analysis to make more effective investment decisions. The team behind SN SciGraph is also working closely with developers at Hack Days and other interactive forums, encouraging the re-use of datasets to create new applications that support the scholarly community.
Hennning Schoenenberger has a background in data validation, data standards and delivery, and is Director of Product Data & Metadata at Springer Nature. Markus Kaindl is Senior Manager for Semantic Data and has a background in computational linguistics. He is responsible for document enrichment projects with the aim of making content smarter and acts as business owner for SN SciGraph and SN Insights. Henning and Markus talk here about the origins of SN SciGraph; the importance of open research and analysis of open data; and how they imagine the future of scholarly data integration.
One of the main challenges for researchers, developers, librarians (and anyone else dealing with large volumes of scholarly data), is that they regularly need to pull publications data from multiple repositories and combine it themselves. This means locating and extracting data in different formats (HTML, PDF, XML, CSV etc) and from numerous sources (databases, APIs, zip files etc) - all of which can be cumbersome, time-consuming and error-prone work. Part of the reason this is still a common challenge, is because many of the tools that overcome data silos are typically very costly. With SN SciGraph we wanted to help data experts overcome this obstacle, and we’re working on making it easier for those less specialised in data extraction and analysis to get to that same point.
The SN SciGraph project began after the Springer Nature merger, when we uncovered a joint desire to bring data silos together in a meaningful way for our colleagues and customers. Our goal was to increase the visibility of vast publications datasets by combining and linking information from multiple sources, on one open access platform. This would significantly speed up discovery via library catalogues and our content platforms, including SpringerLink and Nature.com. There was already a Linked Open Data pilot (focused on proceedings in Computer Science) running on our conference portal, and a similar prototype on Nature.com. Together they formed the foundation for SN SciGraph. We’ve also responded to significantly growing interest from the library community in the power of Linked Open Data to facilitate discovery. An illustration of this is the recent joint initiative led by Stanford University library to ‘integrate library data into the Web, in a semantic way, so it can be discovered intelligently in Web searches as well as in a library’s catalogue’ (Stanford Libraries, 2018).
"Ultimately, SN SciGraph was inspired by the Linked Open Data and Library Communities, who are the real evangelists of this important and growing movement.”
- Henning Schoenenberger
We’re proud of our status as a publisher of first class research with the largest global book portfolio and second largest journals repository, but we also want to focus our efforts on really getting into this content and aggregating its associated data in a machine-readable form. Linked Open Data represents one of the most promising technologies for doing just that. Our vision is to become a truly smart, data-driven organisation and optimise the knowledge we’re responsible for to advance scientific progress. The data in SN SciGraph is publicly available, so anyone can interact with it, which we’re already seeing have a positive impact on the open research movement.
We were an early adopter of open access and open research, even before pressure from the market grew. Today, there’s huge market demand for publishers to have processes in place that really support open research and open data. Because we started down this path early, we now have the facilities to help authors make their work more discoverable by enriching it with metadata. We’re also helping researchers make new discoveries and inspiring new applications for scientific advancement. Our intention is to be bold in the open access, open research and open data movements – we want to harness and refine Linked Open Data as an effective enabler of open research.
"We help authors to manage, publish and describe their open data, but we also try to make this information smarter, more connected and re-usable in the web of data."
- Henning Schoenenberger
One early outcome of SN SciGraph is that we’ve been increasingly able to engage with groups such as the research data community. By collaborating with partners such as DBpedia, the Knowledge Media Institute, and SemSpect, we’re able to become more open, responsive and forward-looking.
We’re seeing real demand for greater transparency of the data that publishers harvest from different sources. And as a publisher, we are so much more than a content pipeline. There’s growing potential for us to support and further the open research movement and we want to be avant-garde in this area:
"We’re encouraging industry partners to build applications on top of the linked data in SN SciGraph. Publishers need to collaborate with each other and with technology providers to make more valuable tools for the research community, and really deliver what it expects today."
- Markus Kaindl
Publishing linked open data has given us another channel (in addition to existing standards such as MARC records for libraries and ONIX feeds for retailers), for distributing our metadata openly on the web in machine-readable and human digestible ways. Librarians, researchers and analysts can download our offering for free and use it in their applications without any additional charge. They can also use SN SciGraph’s data explorer to browse and examine the data connections we make available in the platform. We provide programmatic access to SN SciGraph data via an API as well, so there are lots of ways for individuals and organisations to access this information.
By adding millions of data connections to the Linked Open Data cloud, we’re not only making our publications more discoverable to the research community, we’re also helping them find more valuable information about authors, affiliated institutions and even individual research projects.
"By being compliant with Schema.org - a standard that’s actively promoted by search engine giants like Google, Microsoft and Yandex - we make it even easier for researchers, institutions and commercial organisations to work with our data because we’re following a simple data model that they are already familiar with."
- Markus Kaindl
We’ve worked with several specialist partners to develop SN SciGraph. Digital Science has supported us with both the data to build the platform, along with expertise from senior data scientists. And the semantic technology developer Ontotext provided us with their graph database GraphDB. Our long-term vision is to ingest data from other providers including publishers, funders and conference organisers into SN Scigraph, to increase the potential for generating new knowledge from links in a continually expanding scholarly data universe.
We’ve got a way to go with AI, but we’re laying strong foundations. The potential for harnessing machine-learning to generate new insights from millions of existing connections within the scholarly knowledge graph, is difficult to overestimate. Think of a dataset that connects proteins, genes, diseases, publications, targets, methods, clinical trials and drugs. If you took millions of those data points across several levels of connectedness, the human mind couldn’t easily process that level of information to derive new knowledge. Artificial intelligence will uncover brilliant new patterns that we weren’t aware of by connecting enough data in a meaningful way.
That’s the ultimate vision for SN SciGraph and we’re on the path to achieving it. The first Hack Day we held in London last year invited developers to start using Linked Open Data to see what kind of solutions they could generate. Using data from SN SciGraph and other freely available LOD offerings, they developed a tool that can help authors get published in the optimal journal and a smart peer-review assignment system – both in a very short timeframe. So, we’re already identifying lots of potential and we’re just really getting started.
Our first priority is to release data more quickly through automation. We also want to simplify the data model to encourage re-use by the wider community. After that, the next logical step is to make our Linked Data offering pan-publisher by releasing core metadata and connections to and from other publications outside of Springer Nature - starting with citations and references. We also want to add more data to SN SciGraph so that it continues to grow organically as it has in the past, and improve the API to make it more powerful and easier for developers to use.
Encouraging other publishers to add their data to SN SciGraph or to a publisher-independent platform underpins our whole mission. If publishers are reluctant to add their data to a Springer Nature platform, we should all add it to a global repository of Linked Open Data.
"For me, the future of data integration means one day having the ultimate scientific knowledge graph that sits in the linked data cloud and is open and re-usable for everyone."
- Markus Kaindl
To that end, we’ll continue to reach out to other publishers to convince them of our mission and the benefits this will have for advancing discovery and solving the grand challenges of humankind. We believe that with SN SciGraph we’ve made the first step in that direction, but there’s still a way to go.
* The metadata contained in SN SciGraph is encoded in RDF (Resource Description Framework) and constructed as ‘Triples’. Each triple consists of a subject, a predicate and an object to create a fact. Users can model millions of triples connected by URIs (Uniform Resource Identifier) to form a graph.