Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages
wb_sunny

Regular Expression in Python

The regular expression in Python are used to match a pattern with a string. Formally, a regular expression is a sequence of characters that define a search pattern. Python regular expressions are a powerful way to match text patterns. The module re, short for the regular expression, is the Python module that provides us all the features of regular expressions.

1. Using Python’s re module

Let’s look at some common examples of Python re module. It’s a built-in Python module, so we don’t need to install it.

1.1) re.search()

re.search(pattern, str) is used for searching the sequence pattern, which is a regular expression, within str (search string) and returns a match if the pattern is found.

Let us look at an example for the same

import re

str = 'This is a sample text which we use to search a pattern within the text.'

pat = r'text'

match = re.search(pat, str)

if match is None:
    print('Pattern not found')
else:
    print('Pattern found!')
    print('Match object', match)

Output

Pattern found!
Match object <re.Match object; span=(17, 21), match='text'>

As you can see, the output shows that there indeed exists a match for the pattern, We searched for the simple word text in str, and the span denotes the indices of the match. That is, str[17] to str[20] is matched, which corresponds to the substring text, as expected. But this only gives the first match.

1.2) re.findall()

To give the list of all matched strings, we use re.findall(pat, str) to return a list of all matched strings (which can be empty).

>>> matches = re.findall(pat, str)
>>> print(matches)
['text', 'text']

re.findall() is an extremely powerful feature to extract out patterns, and this can be used on anything, such as searching within files.

import re
with open('text.txt', 'r') as f:
    matches = re.findall(r'pattern', f.read())
print(matches)

2. Rules of Regular Expression in Python

Before we go further, we look at certain rules that regular expressions follow, which are necessary to make pattern strings.

2.1) Identifiers

These are pattern identifiers and the rule that each identifier follows.

PatternRule
\dMatches any number
\DMatches anything except numbers
\sMatches a single space
\SMatches anything except a space
\wMatches any letter
\WMatches anything except a letter
.Matches any character, except a newline(\n)
\.Matches a full stop
\bSpace around words (word boundary)

2.2) Modifiers

Apart from identifies, there are certain operators/modifiers which regular expressions follow.

ModifierRule
*Matches zero or more occurrences of the preceding character/identifier
+Matches one or more occurrences
?Matches 0 or 1 repetitions/occurrences
$Perform match at the end of string
^Perform match at start of string
{1,3}Match if the number of repetitions are anywhere from 1 to 3 times
{3}Match if number of repetitions are exactly 3 times
{3,}Match if 3 or more times
[a-z]Match any single character from a to z

Here is an example using some of the above rules.

The below pattern matches one or more are words, followed by a space, after which there must be one or more matches of any alphanumeric character, a comma, or space. The match below stops at the nearest full stop, since it is not included in the group.

import re

str = 'There are 10,000 to 20000 students in the college. This can mean anything.\n'

pat = r'are{1,}\s[a-z0-9,\s]+'

match = re.search(pat, str)
matches = re.findall(pat, str)

if match is None:
    print('Pattern not found')
else:
    print('Pattern found!')
    print('Match object', match)
    print('Listing all matches:', matches)

Output

Pattern found!
Match object <re.Match object; span=(6, 49), match='are 10,000 to 20000 students in the college'>
Listing all matches: ['are 10,000 to 20000 students in the college']

3. Conclusion

We learned the basics of regular expressions, and how we could use Python’s re module for this functionality to match for patterns using regular expression rules.

4. References