Introduction
A regular expression (regex, regexp) is a string-searching algorithm, which you can use for making a search pattern in a sequence of characters or strings. Usually, these patterns are used to find or find and replace operations.
Regular expressions are commonly used in search engines, text processing, web scraping, pattern matching etc. With this, we specify the rules for matching a set of possible strings; by the rules, you ask questions such as “does that string is contained by a particular set of string”, “in what places this pattern is available”. Regex can also be used for the modification of strings in various ways. This article will cover some common uses of regular expressions using the regex (re) module in python.
Python has a built-in package called re for regex, which contains various functions such as findall, search, split, sub etc. In addition, the re module provides a set of functions to search a particular pattern or patterns of the strings.
Let’s get started with the module Regular Expression(re)
Python has a built-in module package called re, which usually help to work with Regular Expressions.
Import the re module.
import re
After importing the re module let’s look at a basic search operation:
pattern = "^analytics.*magazine$" test_string = 'analytics india magazine' result = re.search(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
Output:
In this example, I searched for analytics and magazine word, in the string ‘analytics india magazine’. To get more into re, knowledge of metacharacters is necessary. Metacharacters are special characters that affect how the regular expression finds the patterns and are mostly used to define the pattern of search or manipulation.
Below, there is a list of metacharacters:
For some examples of metacharacter, you can go through this link.
Special sequence
There are some basic predefined character classes, which are represented by the special sequence. Each special sequence has a unique meaning that helps us find or match other strings or sets of strings using a specialized syntax present in a pattern. The special sequence consists of alphabetic characters lead by / (backlash).
Regex special sequences and their meanings are given in the following table:
Image source
For some examples of special sequences, you can go through this link.
Sets (character sets or character classes)
Character sets are a predefined range of characters enclosed by a square bracket. With the use of sets, we can match only one out of several characters. Simply place the character you want to match in a square bracket. For example, we can use we[ea]k to match either week or weak.
This is a common feature of regular expressions. You can search for a word, even if it is misspelled. There are some sets in python regex with their special meaning list below.
For some examples of character sets, you can go through this link.
The compile() function
re.compile(pattern, repl, string):
In the compile function, we can combine/compile expressions into an object which can be used for further matching. In this article, we will be using this function with other functions also. In python, we can use it like this-
Input:
import re pattern=re.compile('analytics india magazine') print(pattern)
Output :
The findall() function
This function is used for finding the matches of any specified pattern in an object. Below is an example of it.
Input:
import re pattern=re.compile('analytics india magazine') txt = "The article in analytics india magazine" x = re.findall(pattern, txt) print(x)
Output:
If there is no match present in the text, then it will give an empty list.
Input:
pattern=re.compile('nothing') txt = "The article in analytics india magazine" x = re.findall(pattern, txt) print(x)
Output:
The above example is findall function where the object has the pattern which we compiled.
The search() function
The search function searches for the string compiled into a pattern or present in an object and tells the place from where it starts.
Input:
txt = "i love the analytics india magazine" pattern=re.compile('analytics india magazine') x = re.search(pattern, txt) print("The string start from the position:", x.start(), pattern)
Output:
If no matches are found, it will give None.
Input:
txt = "i love the analytics india magazine" pattern=re.compile('nothing') x = re.search(pattern, txt) print(x)
Output:
Match object
Match object is a function used for asking the information about the search we have done before. We can retrieve three functions under it:
- object.string()
- Object.span()
- object .group()
object.string()
It searches for the pattern in a string or set of strings passed in the search function.
Input:
txt = "i love the analytics india magazine" pattern=re.compile('analytics india magazine') x = re.search(pattern, txt) print(x.string)
Output:
object.span()
We use it to know the start and end position in the search function; it returns a tuple.
Input:
txt = "i love the analytics india magazine" pattern=re.compile('analytics india magazine') x = re.search(pattern, txt) print(x.span())
Output:
object.group()
This function gives the part of the string where the search function finds the
match.
Input :
txt = "i love the analytics india magazine" pattern=re.compile('analytics india magazine') x = re.search(pattern, txt) print(x.group())
Output:
The split function
The split function returns a list of string which is split from the specified separator.
Input :
txt = "analytics_india magazine" x = re.split("\s", txt) print(x)
Output:
Splitting text using _(underscore) separator:
Input:
txt = "analytics_india magazine" x = re.split("_", txt) print(x)
Output:
Sub Function
The sub function replaces the text of the object with the text of your choice.
Input:
txt = "i love the analytics india magazine" x = re.sub("\s", "_", txt) print(x)
Output:
Application of Regex
- Regex is widely used in text processing.
- This is also widely used in search engines for making users searching experience better.
- Regex is used in data scraping(web scraping), data mugging, wrangling, and many other tasks.
These are the basics of Regex we have seen in the article. We discussed and implemented some functions, metacharacters, special sequences and character sets. There are a lot of use cases of regex. It is necessary to be aware of every function and use of a character that will help you understand re.
Reference:
All the information written in this article is gathered from :
- Regular Expression HOWTO.
- Google colab for metacharacters.
- Google colab for special sequences.
- Google colab for character sets.
- Google colab for function codes.