Text and Data Mining: Uncovering Hidden Data Points and Powering New Discoveries

By: Guest contributor, Tue Aug 9 2022

Author: Guest contributor

The path to innovation requires the systematic analysis of millions of documents. But completing this process manually takes considerable time and effort. Text and data mining (TDM) enables researchers to speed up and enhance this work, allowing them to make new discoveries faster. In this blog, we look at how TDM works, what it means for librarians, and what Springer Nature is doing to enable it.

The digital age has given us unprecedented access to information. Researchers can now obtain far more research into their subject areas than ever before. On the one hand, this is incredibly exciting, providing opportunities to make new discoveries by building on the incredible wealth of existing research. But on the other hand, it presents an overwhelming challenge – trying to analyse findings from the millions of academic articles published every year.

Even within niche subject areas, the sheer volume of papers, pre-prints, and data published is far too great for an individual researcher to stay abreast of. Yet, within this wealth of research could lie the answers to some of our biggest societal challenges. So how can researchers best use the information available to them?

While there are many options, one of the most promising areas being explored to make new discoveries and identify important patterns is text and data mining (TDM). TDM was the subject of a recent webinar presented by Springer Nature’s Director for Data Solutions, Dr. Prathik Roy. Dr. Roy described in detail how TDM is being used in the research community and what Springer Nature is doing to support it.

What is TDM?

First, it’s worth taking a minute to explain exactly what TDM is. In short, it’s an automated process of selecting and analysing large amounts of text or data resources for purposes such as searching, finding patterns, discovering relationships, semantic analysis and more. This is done in a way that can provide valuable information needed for studies and further research.

The goal of TDM is to filter through information, identify pieces of data, and find the relationships and patterns among them. What is revolutionary is the ability of researchers to explore a dataset without knowing what specific questions to ask. Essentially, AI is now maturing from a role where it simply surfaces information to one where it can make recommendations and decisions, as well as generate content.

“Essentially what tends to happen is that these machine learning or AI algorithms go through the full text of articles and are able to classify the various aspects of each article,” explained Dr. Roy. “For instance, it will ask questions like, is it talking about a gene? Is it talking about a specific disease? Or is it talking about specific symptoms? And then it’s able to cluster the articles based on this.”

Once the algorithm has categorised articles in this way, it can then score the relationship between two types of categories. For instance, it could be used to assess the relationship between symptoms and a specific disease, by analysing how often that symptom is mentioned in relation to a disease. A high score – where there is a clear correlation between mentions of the symptom and mentions of the disease – could help identify the best drug to treat that disease. And this is just one example. TDM has a variety of uses across all fields.

Discoverability and pattern discernment

While TDM has a whole range of use cases, two of the most important right now are ‘discoverability’ and ‘pattern discernment’, as Dr. Roy described during the webinar.

The ultimate goal of discoverability, according to Dr. Roy, is to “match what you're looking for and then eliminate any irrelevant material from this discovery process.” It should mean that when you’re searching for particular keywords or phrases, only highly relevant articles are delivered back in that search.

For example, say you were searching for articles that showed a link between carcinogens from tobacco and a specific type of cancer such as lung cancer. A ‘traditional’ search could deliver you any number of articles that mention carcinogens, tobacco and/or lung cancer. Using TDM techniques, however, you could retrieve only those where specific carcinogens have an effect on the lungs.

The goal of pattern discernment, meanwhile, is to find patterns and trends across a dataset. The outcome of this will be hypotheses and predictions of likely prospects for therapy, material design, or strategy, as opposed to articles. For example, this technique could be used to match the biochemical properties of molecules to a viral protein's properties in order to identify a molecule likely to bind to the virus.

There are already many, many examples of where TDM can (or has already) made a significant impact in speeding up research discoveries and making the previously impossible possible. Just a few were touched on in Dr. Roy’s presentation, including:

There’s no doubt that there is huge potential for the future of TDM and what it could do to power new and innovative research.

What does this mean for librarians and information professionals?

As information professionals, knowledge workers and librarians, you have a long familiarity with managing and searching within large sets of information. It’s likely you’re responsible for evaluating and managing subscriptions to value-added online services, you identify and acquire specialized datasets for researchers, and you manage and make discoverable internal resources and collections.

This knowledge means you can bring a unique perspective to TDM projects – after all, you understand how information is used within your organizations, and you know how to make that information more discoverable and hence more valuable.

The value of TDM depends on knowing what sources to include, what kinds of connections to monitor and what types of metadata are necessary for a particular project. Again, librarians and info professionals bring the ability to ask the right questions, which enables them to see the larger context and identify the specific sets of information that would provide the richest insights.

For lots more insight on this topic, take a look at our whitepaper on TDM for librarians and information professionals.

Springer Nature’s TDM tools

As the volume of scientific publications increases and TDM software tools improve, Springer Nature has created a formalized process to enable TDM, with the aim to make it as simple as possible for researchers.

A growing number of Springer Nature’s journal articles are published open access. TDM is usually allowed without restrictions on these publications since the majority of Springer Nature open access content is licensed under CC-BY.

Dr. Roy concluded his webinar presentation by giving an overview of the various tools Springer Nature has created to facilitate TDM of our content. The key ones you need to be aware of are:

  • Metadata API: Metadata and abstracts for online documents (journal articles, book chapters, protocols, etc.)
  • Meta API: New versioned metadata for online documents with additional fields and links to source content.
  • Fulltext API for Open Access content: Fulltext content (where available) for Springer Nature Open Access XML
  • Fulltext API for Open Access and pay-walled content (under license): Fulltext content (where available) for all Springer Nature XML
  • Journal header data API: "journal-level" API that provides XML based on the Journal ID
  • Citations API
  • SN SciGraph APIs: Linked Data API (using SciGraph URLs) or Redirect API (using common identifiers such as DOIs).

You can access all the APIs mentioned above on our API portal. Springer Nature is also participating in the Crossref TDM working group and we recommend Crossref services for pan-publisher TDM.

Helpful resources

Interested in finding out more about text and data mining? Here are some useful links:

And don’t forget, you can also watch the webinar with Dr. Roy and download the presentation slides.


Author: Guest contributor

Guest Contributors include Springer Nature staff and authors, industry experts, society partners, and many others. If you are interested in being a Guest Contributor, please contact us via email: thesource@springernature.com.