Listen to this story
MachineHack has recently concluded Data Engineering Championship – a hiring hackathon for data scientists and data engineers, organised in association with Publicis Sapient, iMerit, USEReady, Tiger Analytics & The Math Company.
The hackathon was a part of the Data Engineering Summit 2022, presented by Google Cloud and organised by Analytics India Magazine, and was a huge success with over 700 registrations. The winners stood a chance to present their solution approach at DES 2022 & got an opportunity to land an interview with one of the leading analytics organisations.
You can read more about the dataset here.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Here are the solution approaches of the winners who secured the top three positions in the Data Engineering Championship.
Rank 01: Sylas John Rathinaraj
Rathinaraj got interested in predictive analytics in 2017. He attended Coursera and Udemy courses in statistics, exploratory data analysis (EDA), machine learning, data science and deep learning to improve his skills. In addition, he has participated in a slew of ML hackathons on different platforms to test and build his knowledge.
The participants were provided details about an airport along with a weather information dataset. They had columns such as ‘DATE’,’ LOW’,’ HIGH’, and’ TIMESTAMP’ for which the participants could impute the constant value. In the year column missing records, you can impute 2020 as they had 2020 as a year for all other records. It is the same with the month column where one can impute with 01 as we had 01(Jan) for all other records. The main challenges in the datasets were:
- Data missingness
- Formula column with uncertainty
In the airport details with the weather information dataset, for several columns, 20 per cent of the data are missing. The bar chart below shows the non-missing records count of the columns.
The formula for computing can be easily formulated with other dependent columns. For missing records in the dependent columns, used imputation based on group-by of the mean value.
Formula column with uncertainty
The definition for the WIND_CHILL column given in the competition was “the perceived temperature due to the cooling effect of wind blowing”. Rathinaraj utilised information from the TMAX (temperature max0, AWND (MAX wind speed of the day), SNOW and timing of the day when the flight departs. He used a combination of this information and calculated the WIND_CHILL columns. WIND_CHILL column is in ranges from 0 to 80 Fahrenheit. The WIND_CHILL column is vital in the competition to get the best score as the mean absolute error increases in the same range(0 to 80) for wrong calculation.
Rathinaraj feels that MachineHack provides participants with different domains of the ML and Data Engineering competition. “Participating in the competition helps me to become more knowledgeable. After the competition ends, I always spend time exploring the top-ranked achiever’s solution approach and codes,” he adds.
Check out Winners Solutions here.
Rank 02: Jeena Binex
Jeena has been working as an embedded system engineer for nine years in Mumbai, and for the last five years, she has been working in a courier company in Singapore, where her profile is to maintain the In-house ERP system which is built on .Net Framework and SQL Database and analyse the data available to identify the trends for sales, operations, customer service etc.
“I started analysing the data with the limited knowledge I had, and my interest in data analysing started here and hence decided to have in-depth knowledge in this field. So, in July 2021, I enrolled in a data science online course. After spending 12 months in the course, I studied supervised and unsupervised Machine learning and Time Series. Then, I moved on to deep learning, NLP.
Jeena’s approach to the problem included the following steps:
- Reading through the dataset and understanding the meaning of each column of the dataset (26 columns)
- Reading through the features to be created and identifying the columns of the dataset contributing to the creation of features. The main agenda was to fill the missing values of these columns,
- She used two approaches for filling the missing values-Regressive imputation and Mean and median imputation.
- Finally, she calculated the features using the formulas.
“Solving hackathons helped put into practice the knowledge I gained from the theory, which was a huge confidence booster for me,” concluded Jeena.
Check out Winners Solutions here.
Rank 03: Suresh Arunachalam
Suresh has always been passionate about data science and curious about understanding its connection with real-world business use cases. “This curiosity enabled me to spend additional effort during the day and the weekends to learn more about it from the internet, which eventually created a pathway to knowing about the hackathon events happening across the globe in the data science space,” he said.
Suresh says that a use case was given to calculate Wind Chillness, Airline Seat Distribution, Snow Ratio and a few other useful pieces of information along with the date and time stamp, which helps the airline companies to plan their trips from the airport data dump. The dump contained about 200k rows and 26 columns with various information (such as wind speed, latitude, longitude, snowfall, flight ID, etc.).
He followed these steps:
- At first, he removed the unwanted columns and replaced the null values using Max () and Median () methods from NumPy.
- He did a column split to form the date and timestamp using Pandas.
- He then performed some basic arithmetic operations to calculate the expected use case results.
“I was delighted to be part of this hackathon event conducted by MachineHack, which helped me to improve my analytical and problem-solving skills. Moreover, the rules and guidelines set by MachineHack for such events helped in intuiting my competitive skills to keep myself in the top three positions every day on the leaderboard,” Suresh adds.