White Paper

Bringing Insight to Data: Info Pros’ Role in Text and Data Mining

Information professionals, knowledge workers and librarians have a long familiarity with managing and searching within large sets of information. They are often responsible for evaluating and managing subscriptions to value-added online services such as Springer Nature; they identify and acquire specialized datasets for researchers; and they manage and make discoverable internal resources and collections.

What has revolutionized how info pros look at the research landscape is the development of sophisticated tools for text and data mining (TDM) of large data sets. Info pros bring a unique perspective to TDM projects—they understand how information is used within their organizations, and they know how to make that information more discoverable and hence more valuable. 

The goal of text and data mining is to filter through information, identify pieces of data, and find the relationships and patterns among them. What is revolutionary is the ability of researchers to explore a dataset without knowing what specific questions to ask. 

TDM Basics

A TDM project usually starts with a large corpus of data, such as a bibliographic database of citations and abstracts of research articles, or an authoritative reference source. Each record is analyzed and individual pieces of information are extracted by a TDM tool in a structured format. These individual information units, called “semantic triples”, consist of three elements that reflect a piece of knowledge, in the format subject – predicate – object. A fact such as The sky is blue could be represented by the triple the_sky – has_the_color – blue. Similarly, semantic triples generated from a bibliographic record might include this_article – has_the_author – John_Doe and John_Doe – is_affiliated_with –Drexel_University. As simple as this looks, it can be transformative when applied to all the types of information in a dataset—a book chapter, an organization profile, an author, and so on.

The Google Books Ngram Viewer (books.google.com/ngrams) demonstrates the power of TDM when applied to the full text of books. This Google project analyzed the digitized content of millions of books, parsing each word and sentence. For example, each word in the sentence The school nurse treated the boy is analyzed for meaning and relationship – the word “school” is an adjective modifying the noun “nurse”, and the subject “nurse” is conducting the action of “treated” to the object “boy”. We can query the Ngram Viewer for instances in which the word “nurse” is used as a noun (rather than a verb, as in nursing someone back to health) and is modified by an adjective. Note that it is not necessary to specify what words were used as adjectives modifying the word “nurse”; the query retrieves and sorts all the adjectives to identify the ten most frequent phrases and their relative frequency over time. Figure 1 shows that use of the phrase “head nurse” peaked during the 1940s and 1950s.

Figure 1: Google Books Ngram Viewer search result

New Content Item​​​​​​​

The Google Books Ngram Viewer provides insight into books, but TDM projects can also involve multiple databases or data collections.  For example, an info pro may want to improve his researchers’ discovery process with an API that takes a query, identifies any search terms that reference medical concepts, and then looks up each concept in the US National Library of Medicine’s Medical Subject Headings (MeSH) hierarchical thesaurus. The original query could be expanded to include not only the MeSH descriptor for that medical concept but also all the descriptors that fall within that concept. A search for opioid dependence, for example, would be expanded to include the MeSH descriptors Opioid-Related Disorders [C25.775.675], Heroin Dependence [C25.775.675.400], Morphine Dependence [C25.775.675.600], and Opium Dependence [C25.775.675.800]. 

Today, info pros face the problem of having more digitized information available than can be contained in any database, and their concern is how to make sense of the knowledge buried within all that information. As it turns out, access to the full-text of an article or report isn’t necessarily the richest form of content anymore. Now, what’s important is not just the information within individual records but the relationship among those pieces of information. An information scientist at a large pharmaceutical company described how TDM enabled her to find relationships based not only on abstracts but also the full text, and that the value of TDM grows with full-text availability. “TDM enables us to do more complex searches using a large number of synonyms through ontologies, where regular search systems reach their limitations. Moreover, we can extract the relevant passages and information for a certain question from huge amounts of textual literature or patent information instead of delivering only a list of hit documents.”

Mining Content with APIs

Info pros have used APIs for years to facilitate and streamline access to information. With the expansion of open access content and smart data annotation, info pros have even more options for enhancing information discovery. Springer Nature, for example, has created APIs to enable info pros and researchers to provide new insights to STM content from both Springer Nature and its partners in the scholarly domain (see dev.springernature.com for more information on their APIs). A simple use of an API with linked open data is to generate the number of citations to a given article based on its DOI – a meaningful piece of information both for researchers as they read through retrieved material and for decision-makers as they evaluate the impact of their organization’s research. The Springer Nature Journal Suggester (journalsuggester.springer.com) helps researchers identify the journals best suited for a particular manuscript; an author provides a title and abstract and the API returns a ranked list of recommended journals, along with impact factor and acceptance rate for each title. Taking advantage of Springer Nature’s addition of chemical compound annotations to its content, an API could take a researcher’s query and expand the search to retrieve content with any synonym of that compound.

Competitive intelligence professionals are also using APIs to retrieve hidden information. They can monitor published papers and conference proceedings for news on competitors or new players in their field as well as developments in adjacent markets. APIs can even tap into insights from job posting sites such as Glassdoor.com to identify the top skills or professions that a competitor is recruiting.

Role and Skills of Info Pros

An effective TDM project is like a really smart, well-connected researcher. Imagine what she brings to each new project:

  • She regularly monitors professional journals, conference proceedings, books, videos, webinars, reports, patents and other material in her field.
  • She participates in professional conferences, where she meets other researchers in related fields and learns about their current projects.
  • She collaborates with colleagues to publish her findings in peer-reviewed journals.
  • She monitors organizations and funding sources in her field and analyzes grant patterns.

Because she is familiar with information from a wide range of sources, she can see trends and relationships among concepts that would not be obvious to the casual observer. Perhaps she knows to watch for new developments from South Korea, based on a conference presentation she heard and a recent uptick she noticed in grants to universities there. She probably has an internal taxonomy of all the topics she follows, so she intuitively sees connections between related concepts.

Now imagine exponentially expanding that researcher’s perspective to include all the information available in her field. And imagine a similar superhuman researcher for every imaginable field of inquiry. That is what text and data mining initiatives offer to an organization and, as with any information or knowledge management project, information professionals can play a key role. 

As researchers bring more data analytics skills to the table, and as more information—both free and in subscription services—is available, there is a greater need for information professionals who understand how to find, enhance, manage and preserve information, particularly in the arena of text and data mining.

The value of TDM depends on knowing what sources to include, what kinds of connections to monitor and what types of metadata are necessary for a particular project. Info pros bring the ability to ask the right questions, which enables them to see the larger context and identify the specific sets of information that would provide the richest insights. Info pros know which resources to use, weighing the limitations, restrictions and cost of each source. They understand how researchers use information—their approach to a problem, their information-seeking behavior, and what they do with the information next. Info pros know that their clients aren’t interested in which specialized search terms to use or how to harmonize data from multiple sources; info pros build portals and APIs to help their clients get from question to insight as quickly as possible.

Info pros know what data sources to look at—government agency data sets and open-data initiatives, collaborative repositories of data underlying scientific publications such as Dryad (datadryad.org) and ICPSR (www.icpsr.umich.edu/icpsrweb/index.jsp) as well as commercial services such as Springer Nature.

One of the underappreciated skills of info pros is that of what has been called the “reference interview” and is now more properly called an information-needs interview. Before an info pro can connect a researcher with the right TDM tools, the right data sets and the right approaches to find meaningful insights from the information, they have to understand what the researcher’s underlying needs are, including the aspects the researcher may not even think to ask. Info pros are also accustomed to dealing with questions that don’t have easy answers. They know that solving a client’s problem often means pulling material from a variety of sources, collaborating with other groups and figuring out how else they could get to the answer.

Scott Attenborough, TDM industry observer and owner of Content Capital LLC, commented “Sure, info pros have the skills to create the right queries or build the hierarchies, but the real fun is learning the business of the person you are working with. Info pros’ clients often don’t even know what questions to ask, so our job is to understand each client’s use case and then create the right tool to help them understand something important to them—who's working on what molecule, or how this company is working on that disease.”

However, info pros’ familiarity with a wide range of information sources can sometimes get in their way. They are accustomed to searching bibliographic databases, combing through millions of articles, conducting more and more focused searches until they retrieve a manageable number of articles for their researcher. TDM projects, on the other hand, involve searching for patterns and for the unexpected insights while looking at how information pieces fit together. Searchers do not necessarily know what they will find when they start their research, and the “answer” will as often be a series of graphics as a collection of articles.

A problem familiar to any online searcher is the difficulty of finding relevant material on a topic that is not consistently indexed. A pharmaceutical compound may be referenced differently based on the writer’s country, language or custom; on top of that, the subject indexing may not include all of its component parts. A disease may be known by different names—what is called amyotrophic lateral sclerosis or Lou Gehrig’s disease in the U.S. is known as Motor Neuron Disease in the U.K. The same word may have different meanings depending on the context—hearing aids and AIDS (Acquired Immunodeficiency Syndrome), for example. Articles about specific cancers—retinoblastoma or Kahler’s disease—may not mention the word cancer. When multiple datasets are being searched simultaneously, the problem with inconsistent terminology becomes even greater.

 TDM projects can address this problem head-on by bringing in authoritative datasets that enable disparate information findable by linking all versions of a concept to a single authoritative entry. Take DBpedia, for example. DBpedia (wiki.dbpedia.org) is a crowd-sourced open data project that is creating a semantic knowledge graph based on trusted information. Structured data from Wikipedia is extracted and a dataset is created of the information in a consistent, searchable format. Content providers as varied as Springer Nature, Eurostat and the BBC can integrate backlinks from DBpedia to their content, increasing the discoverability of their content and enabling researchers to identify new insights from their data.

As an information scientist at a large pharmaceutical company noted, “with TDM we are able to find more reliably and precisely numeric/quantitative information (e.g., dosages) and can even extract them as metadata. The same is true for the extraction of other parameters. Ontologies are of great use to discriminate the context for results/extracted values.” 

TDM Project Examples

One of the biggest challenges when designing a TDM project is to be as expansive as possible when considering what resources and which data elements to include. The following are a few examples of the wide range of use cases for text and data mining technologies.

  • An info pro may want to identify predatory publishers and conferences—organizations that purport to provide true editorial support and peer review and that charge authors or speakers fees without providing the promised services. Problematic practices might be detected by charting the number of different organizations represented by conference speakers and comparing that year to year and with known reliable conferences, or by comparing citation and reference metrics of a suspect journal’s authors or conference speakers to those of other authors or speakers.
  • To enhance information exploration for internal staff, an API could be developed for researchers who are reading an article of interest and want to learn more. The API could take the article’s DOI (digital object identifier) and show a graphic mapping the reader to articles on the topic written by other employees, links to relevant datasets, information on research grants on this topic, links to conferences at which the author has presented, and so on.
  • A university library could develop a curated collection of relevant datasets and research data for use by its researchers. By combining data elements from a repository of the data underlying scientific and scholarly publications such as Dryad and from a bibliographic database of published material such as Springer Nature, info pros could monitor the literature for available datasets on research topics of particular interest to their users. Related applications could include monitoring any further references of a dataset and tracking all references to datasets by university researchers.

Just as info pros play key roles in bringing the best, most authoritative and most cost-effective online resources into their organizations, so they can bring a unique set of skills and expertise to TDM projects within their organizations.

                                                                                  # # #

Springer Nature recently created a portal with information on TDM tools and resources as well as Springer Nature policy regarding TDM usage at www.springernature.com/text-and-data-mining.

Interested in seeing how Springer Nature can help you with your institutional TDM activities? Contact rd@springernature.com.