Global pharmaceutical company Bristol-Myers Squibb recently concluded ML Hackathon on Kaggle.
The Bristol-Myers Squibb – Molecular Translation competition was held between March 2 to June 3, 2021. Participants were asked to interpret old chemical images and predict their respective International Chemical identifier (InChI) text string. Contestants were given access to a large set of synthetic image data from Bristol-Myers Squibb, and submissions were evaluated on the mean Levenshtein distance between the InChi strings a particular team submits and the ground truth InChi values.
A total of 874 teams and 1171 participants participated in the competition from across the globe. The winners were announced on June 11 2021, with Team ‘SIMM DDDC’ bagging the first prize, which entailed a cash prize worth $25,000; Team ‘∫∫ℳℎ | Zootropolis’ ranking second, and winning $15,000; Team ‘kyamaro’ ranked third and won$10,000. Besides this, teams ranking between fourth and eleventh won gold medals; twelve to fifty won silver medals, and those who ranked between fifty-one to a hundred won bronze medals.
Japanese team ‘kyamaro’s solution to the chemical image problem consisted of three phases.
The first phase was image captioning training. The team noticed that the test data had more salt-and-pepper noise than the training data, and they solved this by augmenting the noise during training.
Phase two required quality generation of a variety of InChI candidates. The team did this through-beam search and using various models, like Swin Transformer and BERT Decoder, and transformer in transformer + BERT decoder.
Finally, Phase three looked after the re-ranking of InChI candidates. The code uses the rdkit.Chem.MolFromInchi function to validate each InChI candidate (using ‘is_valid’).
After this, for each candidate, the loss used for training is calculated in multiple models— using cross-entropy / focal loss— and averaged across models. The candidates are then arranged ‘is_valid’ in descending order and ‘loss’ in ascending order. The InChI with the highest score is the final output. One can find the code used for this solution here.
Hungry for Gold
Another possible solution was presented by team ‘Hungry for Gold’, consisting of four Indians and one Japanese participant. The group secured the seventh position and a gold medal. Their final solution consisted of four vital elements:
- Training transformer to transformer effectively.
- Logit Level Ensemble at each time step
- Rdkit Post Processing
- Analysis of all submissions and validation INCHI’s
They trained a ViT model by manually adjusting the LR and increased Random Scale as augmentation. They also added a Swin transformer for their ensemble. One can read about their solution and the code they used here.
Yet another solution was presented by a team of five, ‘MFL Eindhoven’. Participant Murat Cihan Sorkun, who admitted to being a Kaggle’ newbie’, spoke about his team’s solution in a post. The team first started image captioning with the ResNet+LSTM notebook. However, they decided to shift to an end-to-end TPU notebook to enable a faster training model. They used eff_B1+LSTM to train their models.
As shown above, the team then had several models and was now looking at assembling them to boost their scores. They used three merge algorithms: Gold Rush, (which merges two submissions by selecting valid molecule detected by Rdkit); ‘The Collector of Lost Souls (which allowed inference from each of the above ‘epochs’ for only invalid molecules); and Multimerge (which further validated the valid molecules based on voting and minimum distance). The team initially started with two people but later added three more members who added further models that significantly increased their scores.
MFL Eindhoven ranked thirteenth, winning a silver medal.
Team Stas SI
One-person team Stas SI (Kaggle username) used molecular graphs and atom coordinates to arrive at their solution. Stas SI trained the neural network through object detection to detect atoms and bond coordinates.
Stas SI ranked 24th.
Team Nikita Kozodoi
Finally, another team of one from participant Nikita Kozodoi—that came 47th—provided a solution that used an ensemble of seven CNN-LSTM Encoder-Decoder models.
While the top two winners are yet to share their work, one can also go through solutions provided by teams winning:
- 4th Place
- 5th Place
- 6th Place
- 8th Place
- 9th Place a
- 9th Place b
- 10th Place a
- 10th Place b
- 12th Place
- 15th Place
- 18th Place
- 50th Place
While many publications use machine-readable chemical descriptions to help researchers work with large data sets, the approach has limitations. Firstly, many scanned documents cannot be automatically searched for specific chemical depictions. Secondly, many public data sets tend to be too small to support modern machine learning models, and finally, dated sources usually have some level of image corruption, reducing performance. This makes it challenging to convert scanned chemical structure images into ML compatible formats reliably. Competitions like Bristol-Myers Squibb – Molecular Translation help explore chemical research and development efforts via machine learning frameworks.