When we talk about moving to a cloud data science environment from laptops/desktops, we think about working with a large amount of memory and sharing the load for large datasets. Particularly, with the amount of data present, running codes faster and working with a large amount of information becomes essential, so doing data science tasks on cloud is more advantageous. But, choosing a cloud platform actually depends on one’s priorities, and among many others, Google Cloud Platform (GCP) is one such platform to create a data science environment comfortably.
This walkthrough carries the steps after you have created a GCP account and a project. First, create an account with the ‘try it free’ option provided and then create a new project from the Resource Management page.
GCP has many services and allows you to easily install various libraries and tools desired, GCP requires access to a virtual computer on it which is called virtual machine instance. While this does not give access to a dedicated computer, it provides the CPU and memory needed. The Google Compute Engine (GCE) offers various pricing details on its website. One can choose any of the options based on the requirements.
First one needs to move to the VM instances page, then GCP asks login credentials after one has logged in. Then create a project, if already created, select the project.
Click on Create instance on the VM instances page, which will appear after one has created a project. Then:
- Name instance in the ‘Name’ block.
- One needs to select a geographical zone closer to where they work under ‘Zone’. GCP provides details on various zones on their site.
- Next step is to decide how powerful the machine one needs to carry out tasks. This comes under the ‘Machine Type’ option. The details are provided here.
- Next is to choose ‘Boot Disk’ which is the OS for the virtual instance to work on. One such example is Ubuntu LTS, an accessible version of Linux.
- Then, the last step; click Create.
Data Science Environment Set Up
When one launches a Google Compute Engine instance, it is launched from the Google Compute Engine page. This console is used to set up the data science environment.
First Step: Install Anaconda from the command line. The Unix tool curl is the easiest way to download the binary setup file. Curl downloads a file from a specific URL, and instead of displaying the contents of the download immediately, it uses the -0 flag to write to a file.
curl -O https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
After the download, one can use bash to start the process:
Type ‘yes’ to accept licenses, and when the installation is finished, one needs to add ‘conda’ program to their PATH, which lets one specify directories that are required for the OS to search for programs.
This finishes the anaconda installation.
Setting up a static IP Address
Setting up a static IP address is crucial because the dynamic IP address that one is on can expose their libraries (Jupyter notebook) while creating an environment.
To do this go to Networking-> VPC Network-> External IP Addresses on the GCP platform.
To claim a static IP address, GCP usually charges a small fee if one doesn’t have a running machine with it. Get more details here.
Add Firewall Exceptions
Setting up static IP address means that the local computer and cloud instance can now communicate but, most cloud providers have a firewall which disables incoming access to most of the ports. The Jupyter Notebook server uses the port 8000 for which an exception should be added for the incoming requests.
One can modify the firewall settings so that incoming network packets can access the servers. Go to firewall rules then click on ‘Create Firewall Rule’.
- Next, fill in the required spaces:
- Add a name to firewall rule
- Source IP ranges: 0.0.0.0/0
- Allowed protocols and ports- tcp:8000
- The last step is to configure Jupyter Notebook to use the TCP port specified in the firewall.
Configure Port Which is Compatible With Jupyter Notebook
To configure the port:
Run the following command for config file generation:
jupyter notebook --generate-config
Start Jupyter Notebook with the following flags:
jupyter notebook --no-browser --port=8000
Navigate to the URL- http://IP_ADDRESS:8000 on the local computer. Jupyter Notebook loading page appears, and one can upload files directly to cloud Jupyter Notebook using the Upload option.
This finalises a data science environment on Google Cloud Platform (GCP).