In a simple term, data science derives insight from data to help businesses make well calculated and evaluated decisions. Nowadays, we are generating 1000’s of billion bytes of data. This generation rate is behind the popularity of data science tools such as Artificial Intelligence, machine learning and data-handling tools. Data science involves extracting, processing and analysing tons of data. You might know SQL is one of the most prominent tools used to deal with such a large amount of data. Every application that manipulates data over the web needs to store that data somewhere. Whether it is Big data or a few rows table, a small startup or government or databases spanning multiple platforms such as mobile phones, you will find SQL for all these applications.
SQL can be used to store, access and extract massive amounts of data to carry out the whole Data Science process smoothly. The beginning process involved in Data Science is to perform a lot of querying operations, a lot of search operation, extraction operation, editing or modifying operation to do all that while having Big Data we need a huge management system along with that we need language to perform all the operation that we want to do with data that’s why SQL comes in picture.
So what is SQL? It stands for Structured Query Language; it is the language of database and specially built to communicate with databases. SQL is the simplest language, as simple as the English language, because commands are structured straight as general English words. Those words are structured like a declared statement. That’s why SQL is sometimes also called a declarative language.
Sign up for your weekly dose of what's up in emerging technology.
As it is a database language we need to choose a database management system (DBMS) in order to perform queries. Before choosing DBMS, we will go through what database means and what are the types of database.
The database is an organized collection of data stored and accessed electronically from a computer system and is a bidirectional process. Different kinds of databases are used nowadays based on features, capability, functionality, size, and performance. Those are Distributed database, Object-oriented database, Centralized database, operational database, NoSQL Database, Relational database, etc. Among these databases, Relational databases and NoSQL databases are popular Data Science.
A relational database, information in a dataset is stored in objects which are called Tables. These are uniquely identified by their name and number of columns, rows comprised in it. And this many tables will have relationships or links between them, so this database is called a Relational Database. In simple terms, this database is like an Excel spreadsheet.
Let’s see some of the popular Database management systems used widely and are MongoDB, Postgre, Microsoft Access, Microsoft SQL Server, MySQL and Oracle DB; each of these databases has its own pros and cons and is used accordingly.
To demonstrate the SQL will be using MySQL database management system, it is an open-source relational database system (RDBMS). In addition to Relational databases, SQL and RDBMS like MySQL works with an operating system to implement a relational database in computer storage systems, manage users, allow for network access and facilitate testing database integrity and creation of backups.
To proceed with MySQL, we need to download and install MySQL server to your machine from the official website next, you need to install mysql.connector to connect python script to the MySQL database this can be done by installing the MSI file from here or directly you can use pip command to install it like;
! pip install mysql-connector-python
If you wish to work directly to SQL without using python in that sequence, you can use MySQL Workbench, which provides all the functionality we are going to perform for that you need to download and install MySQL workbench from here. After installing and connecting to the server, the interface looks like this;
Python Implementation of MySQL:
Connect to MySQL server and database:
After successfully installing the MySQL server, the connect() constructor is used to connect with the MySQL server and returns the connection object as shown below;
import mysql.connector db = mysql.connector.connect(host='localhost',user='root', passwd='password', database='tutorial')
All the parameters defined in connect are available on the server-side; if the server is hosted at a local machine host and the user will be the same, passwd will be as defined while creating the server, and the database holds the name of the database to which connect.
For fresh server there will not be any dataset available you need to create one as shown below;
import mysql.connector db = mysql.connector.connect(host='localhost',user='root', passwd='password') cursor = db.cursor() cursor.execute('CREATE DATABASE tutorial') cursor.execute('SHOW databases') datasets = cursor.fetchall() for dataset in datasets: print(dataset)
The cursor class is used to execute the SQL statement in python, where the execute method is used to compile a SQL statement. So whatever we write in the execute() method is a SQL command. In the above example, the statements in capital latter are keywords; these can be written in small letters also.
In the above example, we have created a database named tutorial, cursor.fetchall() holds the status of the last execution done by the cursor. Below is the list of databases available at this server;
‘USE tutorial’ selects the database in which we are going to create the table. ‘CREATE TABLE employees’ create a table named employees further in a round bracket in the same command we have defined the name of the column; there is a total three-column we have created as emp_no, name address.
NOT NULL means while we are filling the data, we have entered all column values; if not, it will through an error. This is because we have used NOT NULL statements. This statement is very helpful where we need to collect some necessary information from the user.
AUTO_INCREMENT is used to increment each value uniformly by one; here, in this case, we have used it in emp_no. So whenever adding data to the table, we need not specify emp_no as it is automatically increased for each instance.
PRIMARY_KEY is used to represent a unique value in a table with which we can find out the particular row.
int, VARCHAR is allowed data type for each column; if through name for emp_no it will give error.
query = "INSERT INTO employees (name, address) VALUES (%s, %s)" values = [ ("Sam", "LA"), ("Mick", "NewYork"), ("Michael", "HongKong"), ("Neha", "Delhi")] cursor.executemany(query,values) db.commit() print(cursor.rowcount,'rows added')
See the description of the table;
cursor.execute('DESC employees') describes = cursor.fetchall() for dec in describes: print(dec)
See the inserted data;
cursor.execute('SELECT * FROM employees') records = cursor.fetchall() for record in records: print(record)
SELECT * indicates we have selected all the column from table employees if wish to select the particular column you can directly specify name there
query = "SELECT * FROM employees WHERE emp_no = 3" cursor.execute(query) print(cursor.fetchall())
The statement is pretty straightforward; it has selected row no 3 for all the columns. Here you can specify any condition that needs to be executed.
Use the order by to sort the data in ascending order or descending order by default; it will sort in ascending order for descending use keyword DESC.
query = "SELECT * FROM employees ORDER BY name" cursor.execute(query) results = cursor.fetchall() for result in results: print(result)
query = "SELECT * FROM employees ORDER BY address DESC" cursor.execute(query) results = cursor.fetchall() for result in results: print(result)
Delete and Update commands:
This section will see how to delete the row-wise elements, update the elements, and delete tables and databases from the server.
query = "DELETE FROM employees WHERE emp_no = 2" cursor.execute(query) db.commit() query = "SELECT * FROM employees" cursor.execute(query) results = cursor.fetchall() for result in results: print(result)
The second row is deleted.
query = "UPDATE employees SET name = 'Mohit' WHERE emp_no = 1" cursor.execute(query) db.commit() query = "SELECT * FROM employees" cursor.execute(query) results = cursor.fetchall() for result in results: print(result)
We have updated the first column and first instance by replacing SAM by Mohit;
This is all about a quick start article for SQL. We have seen how a data management system utilises SQL to satisfy the related operation. We have also seen how relational databases can be manipulated by using mysql.connector. Also, we can perform all this activity directly in MySQL workbench, where we import datasets from different sources such as CSV and JASON files. To use this with colab, follow this setup from the support page, which shows how can we connect to MySQL through Colab.