MachineHack successfully concluded Embold’s Hackathon — GitHub Bugs Prediction Challenge — on 18th October 2020, where the participants were asked to predict bugs on the GitHub titles and text body. The emboldened hackathon was greatly welcomed by data scientists with active participation from close to 500 practitioners.
In this hackathon, organised in partnership with Embold, participants were challenged to come up with an algorithm that can predict the bugs, features, and questions based on GitHub text data. Embold.io is a software quality platform that enables leveraging quality code within a short duration and for this hackathon the participants’ code quality score using the Embold Code Analysis platform.
After the two-stage evaluation that includes evaluating participants based on their standing on the private leaderboard and their Embold Scorecard, three participants topped our leaderboard. Here, we will introduce you to the champions of this Embold hackathon — GitHub Bugs Prediction Challenge and will describe their approach to solve the problem.
Winner 01: Ankur Kumar
A problem solver at heart, Ankur Kumar strives to build creative and effective solutions through cutting edge AI techniques. In his current role as an assistant manager of group data and analytics cell at Aditya Birla Group, he enables opportunities in the development of business-specific NLP and computer vision models to drive business objectives with a potential impact. Ankur started his career in 2017 in the data science industry with Dataval Analytics Inc. as a data analyst and developer, where he contributed significantly to the development of complex products/solutions based on natural language understanding and computer vision. He has worked across various domains like finance services – capital and insurance, retail services as well as in the agricultural sector.
Further to this, Ankur is an active open source contributor of Keras and has created and open-sourced NLP Docker, which has over 75K users actively using this contribution. Along with that, he has created open-source python packages including NLP-preprocessing and model-x, which has got over 9K and 5K downloads respectively. He is also an active contributor in the big data and machine learning domain at Stack overflow — a question and answer site for professional programmers.
To solve the complex problem of the Embold Hackathon, Ankur started with basic data exploration, which included language detection, n-grams frequency analysis, word-cloud, and target class distribution. And then, advanced the process by building a Transformer (Bert-base-uncased)- based model which gave decent accuracy. Alongside, Ankur also trained the mask language model on given data — both training and test dataset, and then fine-tuned the model on a classification task, which, in turn, improved the accuracy. Ankur also built multiple weak learner models based on transformers and trained on MLM and used them for making ensemble models.
Winner 02: Saurabh Kumar
Saurabh Kumar is a data scientist who got interested in the field back in 2014 when he first heard about a machine learning algorithm named Random Forest, which was performing well in classification tasks as compared to traditional classifiers. This got Saurabh overwhelmed — not only was he shocked by the amount of information available online but also astounded to know the variety of real-world problems that can be solved with the potential machine learning algorithms. “Since then, I have managed to keep curiosity and consistency in learning about the field,” said Saurabh.
To solve Embold’s GitHub Bugs Prediction Challenge, Saurabh started with transfer learning models on GPUs, considering the size of the data was massive and a huge amount of time was required to train a single model. “I quickly exhausted all my GPU resources,” said Saurabh.
Once the GPUs were exhausted, Saurabh switched to TPUs provided by Google, which in turn drastically reduced the training time and enabled more experimentations. To train the models, Saurabh used XML–Roberta–Large, Roberta–Large, and Roberta–Small. The final solution was a blend of simple training and KFold training.
When asked about the experience, Saurabh said, “My experience on this platform is great, as MachineHack is continuously evolving in enriching users’ experience. Also, moderators are helpful and prompt in answering participant’s queries.”
Winner 03: Salim Shaikh
Currently, working in the data science team of HDFC Bank, Salim Shaikh always had an inclination towards playing with numbers and getting insights from it. With a master’s degree in statistics and work experience in the telecom and banking sector, Salim got a chance to get his hands on various algorithms.
“Earlier I was not aware of data science as a field, but after my placement at Vodafone Idea, I found my future roadmap,” said Salim. “Having worked in telecom and banking domains, two of the largest sources of data I got multiple opportunities to try out various algorithms which also led me to participate in various hackathons organized by Kaggle, Analytics Vidhya, Zindi, HackerEarth and of course MachineHack.”
This led to collaborative learning for Salim, by reading the solutions of other participants, team up with them, and exchange ideas, etc. “Data Science is an ocean, and of course, there is a lot more to cover. I would like to thank MachineHack and other platforms for giving us a chance to keep ourselves updated with the trend,” added Salim.
When asked about the hackathon in hand, Salim explained that the process started by building a model using just the body from a training dataset, which provided a decent score. Once that’s done, Salim concatenated the title and body and retrained the model which provided a lift over the previous score.
Finally, he went ahead to append Train_extra with a training dataset of 450000 rows x 3 columns, however because of the huge size of data it took one and a half to two hours for one epoch to train. After training various models of the Bert family, Salim noted that Roberta Base was outperforming all of them and became the preferred choice. The final solution was an ensemble of three Roberta models with different parameters which provided a score of 0.85289 as the best score and 0.85427 as the final score.
When asked about the experience, he said, “I have had quite a decent experience on Machine Hack — every weekend we get some new problems covering all the aspects like tabular, text, vision, etc. to brainstorm and learn. I look forward to more such hackathons in the future.”