Last updated February 17, 2020

Introduction To GitHub For Aspiring Data Scientists

Share

Published on February 17, 2020

by Anu Thomas

Popularly considered to be the exclusive domain of software developers, Git – the most widely used modern version control system in the world – is increasingly being used by data scientists today. This may strike some people as odd – how can this platform add value to the daily work of a data scientist, commonly known to work in silos?

The key here is to broaden their scope of work. Any software professional worth his salt sees value in working as a team, and it is no different for a data scientist. To add to this, they are increasingly expected to write code to put models into production.

As a result, experience with version control is progressively seen as a requirement for all data scientists today. Imagine this scenario: a data scientist is working with another in a team on the same function to build a machine learning model.

Now, presume that he/she makes some changes on the function, pushes it to a remote repository and gets the changes merged with the master branch. Let us call this model version 1.1. Concurrently, his/her colleague also makes some changes on the same function using version 1.1, and the new changes are now merged with the master branch (version 1.2). If bugs are discovered in version 1.2 at any point in time, it is possible to recall version 1.1 or fix it and make a new commit.

Bearing in mind the numerous use cases, it would be good for data scientists to master the basics of GitHub – a firmly established platform that uses Git to apply version control to a code. Home to over 40 million developers, it also opens up a lot of opportunities for data scientists to collaborate and manage projects together.

Now, you do not need to be an expert in Git to be able to use GitHub effectively – the key here is to understand the workflow of Git and how it can be used in the daily work of a data scientist. If you are already familiar with the platform and want to build a better data science portfolio, check out these seven tips. However, if you are still struggling to get your basics right while collaborating, here are some guidelines:

How to create a repository

To give you some context, files for projects are stored in a central remote location called ‘repository’. To get started, first create an account by signing up – for free – here. Once, you have created an account, click on the ‘new’ button to create a repository.

The next step would be to type a name for your project. Based on whether or not you want to make this public or private, choose the appropriate option. Following this, check the box that says ‘initialise with README.md’ and click on ‘create repository’. However, if you want to collaborate on others’ projects. You have to open the repository you wish to work in and click on ‘Fork’ button.

You can now add and make changes to your files, but first, you need to clone it on your local machine. Now make the changes as you wish but ensure that you create commits for the changes you want to add. This is where the concept of ‘branching’ comes in – that is, making a temporary copy of the file where you can make changes first without fear of breaking anything.

The local version of your repository is ready!

How to build a branch

As mentioned earlier, branching allows the project host to review the changes you have made before merging it into the master copy. This is especially important if you are working on a project where there is a feature that is reliant on the code working.

It is always a good practice to check if your local project is up-to-date with the remote repository before creating a branch. To update your local repository, type ‘git pull’. Now, to create a branch, type the following: git branch my-branch.

Pull requests

Pull requests facilitates merging changes into the master version by allowing the reviewers to check for conflicts before accepting alterations.

Now, before opening a pull request, you need to add and commit your changes. When you push from a new branch the first time, you need to add the argument ‘—set-upstream origin my-branch’.

A message will pop up. Click on ‘compare and pull request’ and then click on ‘create pull request’.

If you are collaborating with someone, they can add comments as well. Once they are resolved and all changes reviewed, the reviewer can merge the pull request, following which your changes will be merged with the master branch.

Access all our open Survey & Awards Nomination forms in one place