“Can you come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles?”
An average smartphone OS contains more than 10 million lines of code. A million lines of code take 18000 pages to print which is equal to Tolstoy’s War and Peace put together 14 times! There is always a simpler, shorter version of the code along with a longer more exhaustive version.
The number of tools, languages, techniques, and applications that the machine learning ecosystem has nurtured can be overwhelming to a developer. What can be even more daunting is saving the code from going stale. The hidden technical debts within a pipeline can make the product dysfunctional. So, what if there is a tool that does this job for us; to serve us with clean code and answer all your queries?
If you are one of those ML fanatics who think that this can be done and should be done then you should definitely check out this new hackathon brought to you by MachineHack in association with Embold.
Embold.io is a software quality platform that enables companies to leverage quality code within a short duration and an easy-to-navigate interface. Embold combines machine learning, rigorous statistical algorithms, and powerful programming techniques to develop cutting edge products for the industry.
Why Should You Participate?
- Chance to win bounties worth INR 25,000 by competing against top MachineHackers.
- Can deploy state of the art language models like BERT.
- Exposure to solving use cases at the organizational level
Overview Of The Hackathon
In this hackathon, we are challenging the machine learning community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big.
Leverage the state-of-the-art NLP models like BERT and other pretrained models at your disposal to come up with a best model. The winner’s model will be evaluated using a code quality score check up on the Embold Code Analysis platform.
Dataset Description:
- Training set: 150000 rows x 3 columns (Includes label Column as Target variable)
- Test set: 30000 rows x 2 columns
Attribute Description:
- Title – the title of the GitHub bug, feature, question
- Body – the body of the GitHub bug, feature, question
- Label – Represents various classes of Labels