Machine Learning and Data Science have stolen the attention of the young generation who are seeking an exciting and well-paid career. With the advent of growth in data volume and technology the gap between Machine Learning Engineers and Data Scientist is narrowing leading companies to look for full-stack data scientists. The demand for such folks with Machine learning and Data Science skills is exponentially increasing and organizations are struggling to find their best fit. Good MLDS profile not only requires a strong analytical and engineering background but also needs a good understanding of the algorithms, advanced statistics, optimization approaches, coding & distributed computing skills. One of the few possible ways to master this new domain is through constant learning, practice and showcasing of skills through hackathons. This is where MachineHack is going to help. With the help of our top hackathons, you can become an expert in handling data and build a path to a successful Machine Learning and Data Science career.
Armanik, a multinational pharmaceutical company based in Texas, USA, is one of the largest pharmaceutical companies by both market capitalization and sales. Armanik manufactures drugs across multiple therapy areas – Cardiovascular, Diabetes, HIV and Immunology therapy. Their innovative SGLT 2 is the market leader in diabetes therapy. Recently, the company announced that they have successfully completed a Phase 3 trial for an Anti-TNF drug in Rheumatoid Arthritis therapeutic. The company expects to get approval for its new drug in the next 6 months. Given the competition in the market, Prakash Vishwanathan, CEO of Armanik, has reached out to ZS to help identify the patient population in the U.S who are likely to switch any product in the RA market. ZS has proposed a machine learning-based approach using medical transactional data to first identify the factors that are most closely associated with the switching RA patients that will help predict patients who are likely to switch in the near term.
Can you help ZS in achieving the below-mentioned objectives?
File Descriptions :
- Train_data.csv – Data for training (This is a transactional data for feature creation)
- Train_labels.csv – Outcome flag for train patients (1/0)
- Test_data.csv – Data for testing (Create final features for modelling)
- Fitness_values.csv – Fitness values for features created with train data. Must be used to match the fitness values with the Feature created using Train Data Only.
- Sample Submission.csv – Sample submission Format to submit the Predictions (Don’t Shuffle the Patient_ID, Keep the sequence entact).
- Train Data: 627 MB
- Test Data: 273 MB
Packages allowed (Applicable only for Objective 1) :
- Cython extension
ZS is a professional services firm that works side by side with companies to help develop and deliver products that drive customer value and company results. We leverage our deep industry expertise, leading-edge analytics, technology and strategy to create solutions that work in the real world. With more than 35 years of experience and 6,000-plus ZS-ers in 23 offices worldwide, we are passionately committed to helping companies and their customers thrive. Our most valuable asset is our people—a fact that’s reflected in our values-driven organization in which new perspectives are integral and new ideas are celebrated. We apply our knowledge, capabilities and innovation-oriented approach in industries ranging from healthcare and life sciences to high-tech, financial services, travel and transportation, and beyond.
- Hackathons are open to all registered users at www.machinehack.com, a participant must be 18 years or older.
- Only one account is allowed per participant; submissions from multiple accounts will lead to disqualification. This is an individual exercise.
- We expect that you respect the spirit of the competition and do not cheat. Privately sharing code or data is not permitted—any case of code plagiarism will result in the disqualification of all users involved.
- It is obligatory to submit a well-commented and reproducible source code that you would generate as part of this contest in .zip or .tar compressed archive or the submission will not be considered.
- The ideal candidate is expected to hold 2-6 years of experience in working as a Data Scientist or a Machine Learning Engineer
- The leaderboard will be updated on the basis of the AUC score of the submitted predictions.
- Users must have an updated MachineHack profile and must specify their LinkedIn for final shortlisting.
Submission limits :
- You can make a maximum of 3 excel file submissions in a day.
- Hackathon will be live from 3rd January 2020 to 13th January 2020.
- Phase 1: AUC evaluation of submitted predictions – 3rd January 2020 to 10th January 2020.
- Phase 2: Time complexity evaluation of Submitted code files – 11th January 2020 to 13th January 2020.
Submitting Your Files For Evaluation
Submission window: January 3rd to January 10th 2020
The hackathon assignment requires participants to submit an excel file (Only .xlsx files are allowed) containing the unique identifier ‘patient_id’ and the corresponding prediction classes ‘outcome_flag’. The submissions are evaluated on AUC score and the leaderboard is updated.
Submission window: January 11th to January 13th 2020
The phase 2 submission window (FINAL SUBMISSION) will be enabled to share the following files as a zip post 10th January :
- The phase 1 submission file with the best AUC score on the leaderboard. (Best Score.xlsx)
- A csv file containing a list of all the features along with their fitness value. (Fitness_Score.csv)
- A python script file that can be executed on a machine with RAM of 16 GB to recreate the features and fitness values. (Feature_Pipeline.py)
- A well commented and reproducible source code(python script) for the best AUC which will be evaluated for time complexity. (Model.py)
- Fully documented approach in a PPT.
Participants must upload the above files in a zip archive to a submission portal which will be enabled on January 11th 2020.
- The best submission will receive a Macbook and a free pass to Machine Learning Developers Summit 2020.
- The shortlisted participants will get an opportunity to present their approach at MLDS 2020 to a panel of experienced leadership of ZS-ers and participants of the summit.
- The top 15 participants will get an opportunity to be interviewed for the role of Data Scientist/ Machine Learning Engineer at ZS Associates (The role can be chosen based on your expertise area between the two roles)
THE CANDIDATES WILL HAVE TO MAKE SUBMISSIONS AS PER THE GUIDELINES MENTIONED BELOW:
Objective 1: Auto Feature Engineering
- A CSV file containing a list of all the features along with their fitness value.
Once the set of mandatory features is created using the training data, evaluate the fitness of these features using the following methodology. (A starter python notebook is available here)
Any submission with >1% of error – sum of % errors across fitness values of all features (~40000 features), will be considered as Invalid.
Objective 2: Patient Switching Probabilities
Test Set Predictions: The test set predictions are evaluated using AUC. An excel file (.xlsx only) containing the prediction probability for all the patients in the test data.
The top submissions will be decided based on the leaderboard score and evaluation of other submission documents (code file, .csv file, supporting notebooks, model pipeline- .py file or .ipynb format). All the good submissions will be eligible to receive the bounties