NOMAD Kaggle 2018

NOMAD 2018 Kaggle research competition: A paradigm shift in solving materials science grand challenges by crowd sourcing solutions through an open and global big-data competition

Innovative materials design is needed to tackle some of the most important health, environmental, energy, societal, and economic challenges. Improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity to a high enough degree to be used as transparent conducting materials.

To address the need for finding new materials with an ideal target functionality, the Novel Materials Discovery (NOMAD) Centre of Excellence has organized a crowd-sourced data analytics competition with Kaggle, which is one of the most well known online platforms for hosting big-data competitions. Kaggle has a community of over a half of a million users from around the world with various backgrounds in computer science, statistics, biology, and medicine. The competition occurred from December 18th 2017 to February 15th, 2018 and involved nearly 900 participants. The goal of this competition was to develop or apply data analytics models for the prediction of two target properties: the formation energy (which is an indication of the stability of a material) and the bandgap energy (which is an indicator for the potential for transparency over the visible range) to facilitate the discovery of new transparent conductors and allow for advancements in (opto)electronic technologies. A total of 5,000.00 euros in prizes were awarded to the top-three participants with the best performing models (i.e., lowest average root mean square log error (RMSLE) of the formation and bandgap energies). The RMSLE is defined as:

where N is the total number of observations, ŷi is the predicted value, yi is the reference value for either the formation or bandgap energies.

The dataset consists of 3,000 materials, 2,400 of which made the training set and the remaining 600 were used as the test set (i.e., only structures and input features were provided), with the target properties kept secret. Of that test set, 100 materials were used to determine the public leaderboard score so that the participants could assess their model performance on the fly (but the exact values used in this assessment were kept secret). The top three winners of the competition were determined the private leaderboard score based on the test set of 500 materials with the target properties kept secret.

Because only 100 values were used for assessing the performance on the leaderboard, participants had to ensure the predictive accuracy of their model for unseen data, even if a disagreement was found with the public leaderboard score. This is evident in the summary of the average RMSLE for all of the participants with scores below 0.25 in Figure 1, where a large shift in the values between the public leaderboard (100 compounds) and private leaderboard (500 compounds). The winning score has a RMSLE of 0.0509, while the 2nd and 3rd places winners were closely stacked together with a RMSLE of 0.0521 and 0.0523. However, within the first bin, there were a total of four participants with an RMLSE 0.053 (i.e., 0.45% of participants).

Figure 1. Histogram of averaged RMSLE in the predictions for bandgap and formation energies.

Discussion of top performers

Table 1. A summary of the winners of the NOMAD2018 Kaggle competition

Ranking

Name

Public

Private

Approach keywords

1st

Tony Y.

0.0381

0.0509

Crystal graph n-gramsa+ Kernel Ridge Regressionb

2nd

Dr. YuryLysogorskiy

0.0461

0.0521

Derived and BOP-based featuresc+ Gradient boosting treesd

3rd

Lars Blumenthal

0.0446

0.0523

SOAP-based descriptore + Neural networkf

a Ref. 1 ; b Ref. 2; c Ref. 3-5 ; d Ref. 6; e Ref. 7-8; f Ref. 9

1st place winning solution

Tony Y., who is the CEO of a startup in Japan, is the 1st place winner.

The 1st place winning solution was obtained using the metal-oxygen coordination number derived from the number of bonds that are within the sum of the ionic Shannon experimental radii (which were enlarged by 30-50% depending on the crystal structure type). These ionic bonds are then used for building a crystal graph, where each atom is a node in the graph and the corresponding edges between nodes are defined by the ionic bond, which are shown as coordination numbers for each atom for a sequence of 6 atoms in Figure 2.

Figure 2. Depiction of a crystal graph representation of In3Ga1O6 showing the connections between each atom (node) that are defined by the ionic bonds.

Using this crystal graph, these sequences of the coordination numbers combined with the element symbols (n-gram items) could then be counted for the unique one site atoms (unigram), the combination of nearest neighbors (bigram), three atoms (trigram), and four atoms (quadgram). These counts are then either divided by the number of atoms in the unit cell or the cell volume in order to deal with the various unit cells sizes included in the dataset. These features were then used in a kernel ridge regression (KKR) model with the Gaussian radial basis function kernel; hyperparameters were obtained by performing grid searches with 5-fold cross-validation (CV), which can be seen to compare well to the private leaderboard score (Table 2).

Table 2. Performance summary of n-grams of various lengths for the RMSLE for the training set of the formation energy and bandgap energy, the averaged 5-fold CV and the RMSLE for the public and private leaderboard.

Model

Formation energy(eV/cation)

Bandgap(eV)

5-foldCV

Public

Private

Unigram

0.0248

0.0859

0.0553

0.0519

0.0596

Bigram

0.0248

0.0860

0.0554

0.0513

0.0568

Trigram

0.0239

0.0834

0.0537

0.0439

0.0542

Quadgram

0.0250

0.0922

0.0586

0.0525

0.0566

The winning solution was constructed from an ensemble score of the trigram and quadgram:

Pmix = amixP (trigram) + (1 - amix)P (quadragram)

with a mixing parameter of 0.64 for formation energy and 0.69 for the bandgap energy, which resulted in a RMSLE of 0.0387 and 0.0509 for the public and private leaderboards. The distribution of errors between the predictions and actual DFT reference values for both the formation energies (left panel) and bandgap energies (right panel) for the total test set of 600 materials is shown in Figure 3.

Figure 3. Differences between the actual values and predictions for the formation energies (left) and the band gap energies (right) for the total test set of 600 materials using the crystal graphs n-gram+KRR 1st place model.

2nd place winning solution

Dr. Yury Lysogorskiy, a postdoctoral researcher at ICAMS, Ruhr-Universität Bochum in Germany, is the 2nd place winner, who worked with Dr. Thomas Hammerschmidt, a research group leader, at the same institution.

The strategy was to generate several derived features for each atom, which included geometrical properties (e.g. bond distances), elemental properties (e.g., electronic affinity, ionization potential, atomic volume, covalent radius), calculated bandgap and formation energies obtained from the Materials Project for In2O3, Ga2O3, and Al2O3 that are linearly scaled based on composition. In addition, the local atomic environment of each atom was characterized on the basis of self-returning hopping paths as used in the methodology of bond-order potentials (BOP).3-5 This per-atom information was then transformed to per-structure information by clustering and statistical aggregation (average and standard deviation). In this way, a total of 6,950 derived features was produced. Of these, the top performers were selected based on calculated feature importances using extreme gradient boosting regression tree approach (XGBoost) separately for the formation energy and band gap energy. This procedure led to the identification of 212 (174) top features for the formation energy (band gap). Although the most important features were the mean of the indium atom nearest neighbor distances for the formation energy and the average bandgap energy of the binary elements, about 50% of the overall feature importance was assigned to BOP-based features. These features were then used with light gradient boosting regression tree (LightGBM) with the hyper-parameters tuned using 10-fold shuffled CV, which resulted in a score of 0.0462 and 0.0521 for the public and private leaderboards.

The difference between the predictions from this approach and actual DFT reference values for both the formation energy (left) and bandgap (right) for the total test set of 600 materials is shown in Figure 4.

Figure 4. Differences between the actual and predictions for the formation energy (left) and the band gap energy (right) for the derived features combined with LightGBM.

3rd place winning solution

Lars Blumenthal, a 3rd year PhD student at the EPSRC Centre for Doctoral Training on Theory and Simulation of Materials in the Department of Physics at Imperial College London, is the 3rd place winner.

In this approach, a smooth overlap of atomic positions (SOAP) based descriptor developed by Bartók et al.7-8 was used for incorporating information of the local atomic environment. The SOAP kernel describes the local environment for a given atom by constructing a Gaussian density distribution produced by the neighbors within a specific cutoff radius, which is expanded in a basis of radial basis functions and spherical harmonics. This forms the spherical power spectrum corresponding to the neighbor density for each atom. In this approach, the SOAP vector derived for each atom in a material is then averaged to obtain one feature vector to describe the chemical environment of the average atom. These mean feature vectors for each compounds were then scaled so that each dimension had zero mean and unit variance.

These SOAP features were used in a three-layer feed-forward neural network (NN) using Pytorch with batch normalization and 20% dropout in each layer. For predicting the bandgap energies and the formation energies, the initial layer had 1024 neurons and 512 neurons, respectively. In both cases, the remaining two layers had 256 neurons. The neural networks were trained for 200 and 250 epochs for the prediction of the bandgap energies and the formation energies, respectively. The final predictions were based on 200 independently trained neural networks which all shared the identical architecture but with different initial weights. This approach resulted in a score of 0.0448 and 0.0523 for the public and private leaderboards, respectively. The distribution of errors between the predictions and actual DFT reference formation energies (left) and bandgap energies (right) for total test set of 600 materials are shown in Figure 5.

Figure 5. Differences between the reference and predictions for the formation energy (left) and the band gap energy (right) for the SOAP+NN 3rd place model.