New scientific literature is being published at an exponential pace. For an individual researcher, it might take months to do an extensive review of even a single topic. Science has harnessed the potential of machines for this job. What if there was a machine that could read all the papers in a couple of minutes and give the lead forward? Although this might seem impossible for now, a novel approach has been applied to utilize the literature for the discovery of materials with very little supervision.
Growth in Scientific Literature
The paper establishes that text mining of scientific literature can uncover hidden knowledge, and that pure text-based extraction can establish basic scientific knowledge.said Ceder, who also has an appointment at UC Berkeley’s Department of Materials Science and Engineering.
The algorithm employed for this unsupervised learning is called Word2vec. This algorithm takes a large chunk of text as input and produces a vector space. The vector space is comprised of hundreds of dimensions with each unique word in the chunk assigned a corresponding vector in space. Words sharing common contexts are in close vicinity of each other. Due to this ability, a neural network can be trained to learn similar representations for similar target words.
A famous example of such a semantic relation is
“king” – “queen” ≈ “man” – “woman”. Here the subtraction is performed between the vectors of corresponding words representing the concept of gender on both sides.
A team in Berkeley Lab’s Energy Storage and Distributed Resources Division collected 3.3 million abstracts of published scientific literature and fed them into the algorithm. The algorithm predicted discoveries of new thermoelectric materials years in advance. It also suggested the materials already discovered which could be candidates for thermoelectric materials. It was also able to learn the meaning of scientific terms and concepts such as the crystal structure simply based on the positions of the words in the abstracts and co-occurrences with other words.
The algorithm looks to be a genie possessing magical powers. It has been able to predict discoveries years in advance. The team tested the top thermoelectric candidates suggested by the algorithm to verify the predictions. The algorithm ranked each compound by the similarity of its word vector to the word “thermoelectric”.
The top 10 results were a surprise. All had power factors slightly higher than the average known thermoelectrics; whereas the top three had power factors above the 95th percentile of known thermoelectrics.
The algorithm was also tested if it could perform experiments “in the past” by giving it abstracts up to say the year 2000. It surprised again by its accuracy. Of the top predictions, a staggering amount turned up in later studies. An example in this regard is that three of the top five predictions trained using the data until the year 2008 have been discovered whereas the remaining two are toxic elements.
The code for the algorithm is available at https://github.com/materialsintelligence/mat2vec