Regex - Python Regular Expression "1"

Regular expressions, also commonly known as regex are used to find a particular pattern in a given string. They consist of a set of metacharacters which helps in defining patterns. These patterns can be anything from as simple as finding a particular word in a paragraph to finding any kind of URL from a given text.

Python has great functionalities for regular expressions. It even provides an in-built module called re to work with regular expressions. The re module has many functions that help find patterns in a given text or text file.

Related: Learn more about the re module.

Metacharacters are the most important part of regular expressions. Without metacharacters, a regular expression is just a normal string. It includes normal characters that are part of Unicode, but they are supposed to be treated differently from other characters. Using the metacharacters in a pre-specified way such that it says something about the string containing it is what makes a normal string a regular expression. Some common metacharacters are \d, $, *, +, etc.

The list of metacharacters is too big. Let alone learning, just understanding what each one of them does is a hell of a task. Due to these, people often get scared of regular expressions. Cause when they see them, all they see are some complex strings consisting of random characters, which doesn’t make any sense to them.

So to solve this problem, today, in this article, we’re going to see what the metacharacter \1 mean. We’re going to understand it, use it in some examples consisting of \1, and finally, also check out some other metacharacters similar to it.

Related: Learn more about metacharacters.

What is `"\1"` in regex?

When we write regular expressions, we often group them using parentheses. These parentheses create groups of patterns in regular expressions. These groups are called capturing groups, as it captures groups of regular expressions. Each pair of parentheses represents a single capturing group. The indexing in the expression starts with 1 for capturing groups, unlike arrays and strings, where the indexing starts from 0.

So a regular expression (cg1)(cg2)(cg3), has 3 capturing groups with cg1 indexed as 1, cg2 indexed as 2 and cg3 indexed as 3. It’s important to note that all of them are separated into pairs of parentheses. The parentheses play an important role as it would separate one capturing group from another.

Now you must be asking, “What’s the point of all this indexing and stuff?” So the point is that doing so will allow us to refer to the pre-defined pattern. The regular expression is again later in the expression itself. For example, if you have 3 capturing groups in the regular expression, and the 4th is supposed to be the same as the first one. So instead of writing it again, we can simply refer to the first capturing group again. So if you’re writing really big regular expressions where there are repetitions of groups, you can use this method to reduce the complexity of the expression.

This process of calling a capturing group again in the regular expression itself is called a backreference. So the metacharacter "\1" simply refers to the first capturing group and nothing more. So if we consider the above example, (cg1)(cg2)(cg3) and add \1 to it making it (cg1)(cg2)(cg3)\1 it would simply become cg1cg2cg3cg1.

Enough with the theory, now let’s get our hands dirty and try using the backreferencing metacharacter "\1" in the regular expression and check out its applications.

Uses of the `"\1"` metacharacter

To thoroughly understand the uses of the "\1" metacharacter, we will check out some examples of regular expressions where we use the "\1" metacharacter.

Example 1:- Matching repeated strings

In this example, we’re going to match repeated words. For example, “Python Python”. To create a regular expression for this, we have to understand the pattern first. The pattern is that there’s a string containing any number of word characters ([A-Za-z0-9_]), and then some white spaces and the same word will follow it. Metacharacter for word character is “\w”. “+” is the metacharacter for n times of something. So “\w+” will represent a word. Now similarly, “\s+” represents white spaces. So “(\w+)\s+” represents a word followed by some whitespaces. Now if we add “\1” to the entire thing, we will get our desired regular expression, as it would mean the same word as in the first parentheses.

import re
string = "python python"
regex = r'(\w+)\s+\1'

matches = re.findall(regex, string)
print(matches[0])

Example 2:- swap words

The way we use the "\1" to backref the first capturing group. Similarly, we can backrefer to the second or third capturing group with "\2" or "\3“.

In this example, we’re going to swap two words in a string using regular expressions. Let’s consider the input string “hello world”.To create a regular expression for swapping words, we need to define the pattern. We want to capture two words, separated by whitespace. The pattern (\w+)\s+(\w+) can be broken down as follows:

(\w+): This captures the first word, represented by \w+, which matches any number of word characters ([A-Za-z0-9_]). The parentheses create a capturing group for the first word.
\s+: This matches one or more whitespace characters.
(\w+): This captures the second word, following the whitespace, also represented by \w+. Again, the parentheses create a capturing group for the second word.

To swap the captured words, we can use the replacement string \2 \1, where \2 represents the second capturing group (second word) and \1 represents the first capturing group (first word). Let’s try to code it now.

import re
string = "hello world"
regex = r'(\w+)\s+(\w+)'
replacement = r'\2 \1'
output_string = re.sub(regex, replacement, string)
print(output_string)

`"\0"`

"\0" is a metacharacter used to match the entire matched string. It is independent of any parentheses or capturing groups.

Named backreferences

It’s really good that we can back refer to the patterns within the regular expression itself, but the indexing part is not that convenient in some cases. Therefore, we have named backreferences. Named backreferences allow you to name a group and back refer it when you need it through its name. Let’s see how it works.

`"?P"`

The "?P" metacharacter is used to name a group that we’ll be calling later in a regular expression. The syntax for naming a group is (?P="<group-name>"). Once we’ve named a group, we can call it by using its group name later in the expression.

To understand the named backreferencing better, we’re going to use the same example as before, where we matched the repeated words.

Matching repeated words with named backreferencing

In this example, we’re going to match repeated words with named backreferencing. Let’s consider the input string “Python Python”.

To create a regular expression for this, we first need to understand the pattern. The pattern consists of a string containing any number of word characters ([A-Za-z0-9_]), followed by some whitespace, and then the same word is repeated.

The metacharacter \w represents a word character, and the metacharacter + indicates one or more occurrences. So, \w+ represents a word. Similarly, \s+ represents one or more whitespace characters.

Putting it together, the pattern (?P<word>\b\w+\b)\s+(?P=word) can be broken down as follows:

(\w+): This captures the first word represented by \w+, and creates a capturing group for it.
\s+: This matches one or more whitespace characters.
\1: This is a backreference that matches the exact content captured by the first capturing group, ensuring that the same word is repeated.
(?P=word): This is a named backreference that matches the exact content captured by the named group “word”, ensuring that the same word is repeated.

Let’s see it in code.

import re
string = "Python Python"
regex = r'(?P<word>\b\w+\b)\s+(?P=word)'
matches = re.findall(regex, string)
print(matches[0])

Why do we need named backreferencing?

Named backreferencing doesn’t do anything different from capturing groups. Then why do we need it? Named backreferencing helps in writing clean code. It avoids any kind of confusion when there are too many capturing groups. It increases the readability and clarity of the regular expressions. Indexing the capturing groups could be confusing when there are too many of them. In those cases, named backreferencing is really useful.

Application of backreferencing

Backreferencing has applications in a lot of fields, including data analysis, natural language processing, web development to match URL patterns, database management to match records, and many more. It’s really important to know where you can use backreferencing cause it makes a lot of tedious tasks quite simple. Some of the use cases of backreferncing are:-

To validate data
To parse URLs
To rewrite text
Finding palindromes
Validating email address
Converting HTML tags

Conclusion

Regular expressions are all about the metacharacters and how you use them. There are often many ways to achieve a pattern in regular expressions. Knowing as many metacharacters and functionalities as you can helps you in creating short, readable and clear regular expressions. So it’s always good to learn new metacharacters. Make sure that you know how to use them. And the most important part is practice. If you don’t practice enough, you’ll keep forgetting what you’ve learned. So it’s important that you practice yourself each time you learn a new thing.

References

Official Python Dcocumentation.

Stack Overflow thread for the same question.

What is "\1" in regex?

Uses of the "\1" metacharacter