Advertisement

Active Hackathon

Guide To Regular Expression(Regex) with Python Codes

Introduction

A regular expression (regex, regexp) is a string-searching algorithm, which you can use for making a search pattern in a sequence of characters or strings. Usually, these patterns are used to find or find and replace operations. 

Regular expressions are commonly used in search engines, text processing, web scraping, pattern matching etc. With this, we specify the rules for matching a set of possible strings; by the rules, you ask questions such as “does that string is contained by a particular set of string”, “in what places this pattern is available”. Regex can also be used for the modification of strings in various ways. This article will cover some common uses of regular expressions using the regex (re) module in python.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Python has a built-in package called re for regex, which contains various functions such as findall, search, split, sub etc. In addition, the re module provides a set of functions to search a particular pattern or patterns of the strings. 

Let’s get started with the module Regular Expression(re)

Python has a built-in module package called re, which usually help to work with Regular Expressions.

Import the re module.

import re

After importing the re module let’s look at a basic search operation: 

 pattern = "^analytics.*magazine$"
 test_string = 'analytics india magazine'
 result = re.search(pattern, test_string)
 if result:
   print("Search successful.")
 else:
   print("Search unsuccessful.")  

Output:

In this example, I searched for analytics and magazine word, in the string ‘analytics india magazine’. To get more into re, knowledge of metacharacters is necessary. Metacharacters are special characters that affect how the regular expression finds the patterns and are mostly used to define the pattern of search or manipulation.

Below, there is a list of metacharacters:

Image Source

For some examples of metacharacter, you can go through this link.

Special sequence 

There are some basic predefined character classes, which are represented by the special sequence. Each special sequence has a unique meaning that helps us find or match other strings or sets of strings using a specialized syntax present in a pattern. The special sequence consists of alphabetic characters lead by / (backlash).

Regex special sequences and their meanings are given in the following table:

                                                       Image source 

For some examples of special sequences, you can go through this link.

Sets (character sets or character classes) 

Character sets are a predefined range of characters enclosed by a square bracket. With the use of sets, we can match only one out of several characters. Simply place the character you want to match in a square bracket. For example, we can use we[ea]k to match either week or weak.

This is a common feature of regular expressions. You can search for a word, even if it is misspelled. There are some sets in python regex with their special meaning list below.

                                                 Image source                               

For some examples of character sets, you can go through this link.

The compile() function

re.compile(pattern, repl, string):

In the compile function, we can combine/compile expressions into an object which can be used for further matching. In this article, we will be using this function with other functions also. In python, we can use it like this-

Input:

 import re
 pattern=re.compile('analytics india magazine')
 print(pattern) 

Output :

The findall() function

This function is used for finding the matches of any specified pattern in an object. Below is an example of it.

Input: 

 import re
 pattern=re.compile('analytics india magazine')
 txt = "The article in analytics india magazine"
 x = re.findall(pattern, txt)
 print(x) 

Output: 

If there is no match present in the text, then it will give an empty list.

Input: 

 pattern=re.compile('nothing')
 txt = "The article in analytics india magazine"
 x = re.findall(pattern, txt)
 print(x) 

Output:

The above example is findall function where the object has the pattern which we compiled.

The search() function

The search function searches for the string compiled into a pattern or present in an object and tells the place from where it starts.

Input: 

 txt = "i love the analytics india magazine"
 pattern=re.compile('analytics india magazine')
 x = re.search(pattern, txt)
 print("The string start from the position:", x.start(), pattern)  

Output: 

If no matches are found, it will give None.

Input:

 txt = "i love the analytics india magazine"
 pattern=re.compile('nothing')
 x = re.search(pattern, txt)
 print(x) 

Output:

Match object

Match object is a function used for asking the information about the search we have done before. We can retrieve three functions under it:

  • object.string()
  • Object.span()
  • object .group()

object.string()

It searches for the pattern in a string or set of strings passed in the search function.

Input:

 txt = "i love the analytics india magazine"
 pattern=re.compile('analytics india magazine')
 x = re.search(pattern, txt)
 print(x.string) 

Output:

object.span()

We use it to know the start and end position in the search function; it returns a tuple. 

Input:

 txt = "i love the analytics india magazine"
 pattern=re.compile('analytics india magazine')
 x = re.search(pattern, txt)
 print(x.span()) 

Output:

object.group()

    This function gives the part of the string where the search function finds the

 match. 

Input :

 txt = "i love the analytics india magazine"
 pattern=re.compile('analytics india magazine')
 x = re.search(pattern, txt)
 print(x.group()) 

Output:

The split function

The split function returns a list of string which is split from the specified separator. 

Input :

 txt = "analytics_india magazine"
 x = re.split("\s", txt)
 print(x) 

Output:

Splitting text using _(underscore) separator:

Input: 

 txt = "analytics_india magazine"
 x = re.split("_", txt)
 print(x) 

Output:

 Sub Function

The sub function replaces the text of the object with the text of your choice.

Input:

 txt = "i love the analytics india magazine"
 x = re.sub("\s", "_", txt)
 print(x) 

Output:

Application of Regex

  • Regex is widely used in text processing.
  • This is also widely used in search engines for making users searching experience better. 
  • Regex is used in data scraping(web scraping),  data mugging, wrangling, and many other tasks.

These are the basics of Regex we have seen in the article. We discussed and implemented some functions, metacharacters, special sequences and character sets. There are a lot of use cases of regex. It is necessary to be aware of every function and use of a character that will help you understand re.

Reference:

All the information written in this article is gathered from :

More Great AIM Stories

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.