Regular Expression in Python

import re

text = "Contact us at [email protected] or [email protected]"
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(pattern, text)
print(emails)  # ['[email protected]', '[email protected]']

That code extracts email addresses from text using Python regex. The pattern looks cryptic at first, but each piece serves a specific purpose in matching the structure of an email address. Regular expressions give you a tiny language for describing text patterns, and Python’s re module makes them practical to use.

Understanding Python regex syntax

Python regex patterns combine ordinary characters with special metacharacters that represent broader matching rules. The character \d matches any digit from 0 to 9. The symbol \w matches word characters (letters, digits, and underscores). The . matches any single character except newlines.

import re

# Match a three-digit number
pattern = r'\d{3}'
text = "My PIN is 4829"
match = re.search(pattern, text)
print(match.group())  # 482

The \b creates a word boundary, which helps you match complete words instead of partial strings. This becomes critical when you need to find “cat” but not “category.”

# Without word boundaries
pattern = r'cat'
text = "The cat in the category"
matches = re.findall(pattern, text)
print(matches)  # ['cat', 'cat']

# With word boundaries
pattern = r'\bcat\b'
matches = re.findall(pattern, text)
print(matches)  # ['cat']

Character classes let you define sets of acceptable characters using square brackets. The pattern [aeiou] matches any single vowel. You can specify ranges like [a-z] for lowercase letters or [0-9] for digits.

# Match valid hexadecimal digits
pattern = r'[0-9A-Fa-f]+'
text = "Color code: #FF5733"
match = re.search(pattern, text)
print(match.group())  # FF5733

Quantifiers control how many times a pattern element should repeat. The * matches zero or more occurrences. The + requires at least one occurrence. The ? makes the preceding element optional (zero or one occurrence).

# Match phone numbers with optional area codes
pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
numbers = [
    "(555) 123-4567",
    "555-123-4567",
    "5551234567"
]

for num in numbers:
    if re.match(pattern, num):
        print(f"{num} is valid")

Core Python regex functions

The re.search() function scans through a string looking for the first location where the pattern matches. It returns a match object when successful or None when no match exists.

import re

text = "Python 3.12 was released in 2024"
pattern = r'\d+\.\d+'
match = re.search(pattern, text)

if match:
    print(f"Found version: {match.group()}")  # Found version: 3.12
    print(f"Position: {match.start()}-{match.end()}")  # Position: 7-11

The re.findall() function returns all non-overlapping matches as a list of strings. This becomes your go-to when you need to extract multiple instances of a pattern from text.

# Extract all URLs from text
text = """
Visit https://example.com for docs.
Try http://test.org for testing.
"""

pattern = r'https?://[^\s]+'
urls = re.findall(pattern, text)
print(urls)  # ['https://example.com', 'http://test.org']

The re.sub() function substitutes matches with replacement text. You can use backreferences with \1, \2 to reference captured groups from the pattern.

# Format phone numbers consistently
text = "Call 555-1234 or 555.9876"
pattern = r'(\d{3})[-.](\d{4})'
formatted = re.sub(pattern, r'(\1) \2', text)
print(formatted)  # Call (555) 1234 or (555) 9876

The re.match() function checks if the pattern matches at the beginning of the string. This differs from re.search(), which looks anywhere in the string.

text = "Python is great"

# match() only checks the start
match = re.match(r'Python', text)
print(match.group() if match else "No match")  # Python

# This won't match because 'great' isn't at the start
match = re.match(r'great', text)
print(match)  # None

Compiling Python regex patterns

When you use the same pattern repeatedly, compiling it first improves performance. The re.compile() function creates a reusable pattern object with methods like search(), findall(), and sub().

import re

# Compile once, use many times
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

texts = [
    "Email me at [email protected]",
    "Support: [email protected]",
    "No email here"
]

for text in texts:
    matches = email_pattern.findall(text)
    if matches:
        print(f"Found: {matches[0]}")

Compiled patterns also let you set flags that modify matching behavior. The re.IGNORECASE flag makes matching case-insensitive.

# Case-insensitive matching
pattern = re.compile(r'python', re.IGNORECASE)
texts = ["Python", "PYTHON", "python"]

for text in texts:
    if pattern.search(text):
        print(f"{text} matches")  # All three match

Working with groups in Python regex

Parentheses create capturing groups that extract specific parts of a match. Each group gets assigned a number starting from 1, and you can access them through the match object.

import re

# Parse log entries
log = "2024-01-15 14:30:22 ERROR Database connection failed"
pattern = r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)'

match = re.search(pattern, log)
if match:
    date = match.group(1)
    time = match.group(2)
    level = match.group(3)
    message = match.group(4)
    
    print(f"Date: {date}")
    print(f"Level: {level}")
    print(f"Message: {message}")

Named groups make your code more readable by letting you access groups by name instead of number. Use the syntax (?P<name>pattern) to create a named group.

# Extract structured data
pattern = r'(?P<protocol>https?)://(?P<domain>[^/]+)(?P<path>/.*)?'
url = "https://api.example.com/v1/users"

match = re.search(pattern, url)
if match:
    print(f"Protocol: {match.group('protocol')}")  # https
    print(f"Domain: {match.group('domain')}")      # api.example.com
    print(f"Path: {match.group('path')}")          # /v1/users

The re.finditer() function returns an iterator of match objects, which gives you both the matched text and position information for each match.

# Find all numbers with their positions
text = "Scores: 85, 92, 78, 95"
pattern = r'\d+'

for match in re.finditer(pattern, text):
    print(f"Found {match.group()} at position {match.start()}-{match.end()}")
# Found 85 at position 8-10
# Found 92 at position 12-14
# Found 78 at position 16-18
# Found 95 at position 20-22

Handling special characters in Python regex

Many characters have special meaning in regex patterns. The characters .^$*+?{}[]\\|() all need escaping with a backslash if you want to match them literally.

import re

# Match prices in dollars
text = "Items cost $19.99 and $5.50"
pattern = r'\$\d+\.\d{2}'
prices = re.findall(pattern, text)
print(prices)  # ['$19.99', '$5.50']

Raw strings (prefixed with r) prevent Python from interpreting backslashes as escape sequences. This makes regex patterns cleaner and less error-prone.

# Without raw string (confusing)
pattern = '\\bword\\b'

# With raw string (clear)
pattern = r'\bword\b'

The re.escape() function automatically escapes all special characters in a string, which helps when you need to match user input literally.

user_input = "How much is $5.00?"
# Escape special characters
safe_pattern = re.escape(user_input)
pattern = re.compile(safe_pattern)

text = "FAQ: How much is $5.00? Answer: Yes."
if pattern.search(text):
    print("Found exact match")

Advanced Python regex techniques

Lookahead and lookbehind assertions let you match patterns based on what comes before or after, without including those parts in the match. Positive lookahead uses (?=pattern) and negative lookahead uses (?!pattern).

# Match passwords with specific requirements
# Must contain at least one digit and one letter
pattern = r'^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}$'

passwords = ["pass1234", "12345678", "abcdefgh", "Test1234"]

for pwd in passwords:
    if re.match(pattern, pwd):
        print(f"{pwd} is valid")
    else:
        print(f"{pwd} is invalid")

The re.split() function splits strings based on pattern matches instead of just fixed delimiters. This handles complex splitting scenarios that basic string methods can’t.

# Split on multiple delimiters
text = "one,two;three:four|five"
parts = re.split(r'[,;:|]', text)
print(parts)  # ['one', 'two', 'three', 'four', 'five']

# Split on whitespace but keep quoted strings together
text = 'field1 "field 2" field3'
parts = re.split(r'\s+(?=(?:[^"]*"[^"]*")*[^"]*$)', text)
print(parts)  # ['field1', '"field 2"', 'field3']

Non-greedy quantifiers change the matching behavior to prefer shorter matches. Add a ? after *, +, or {m,n} to make them non-greedy.

# Greedy vs non-greedy matching
html = "<div>Content 1</div><div>Content 2</div>"

# Greedy: matches everything between first < and last >
greedy = re.findall(r'<.*>', html)
print(greedy)  # ['<div>Content 1</div><div>Content 2</div>']

# Non-greedy: matches smallest possible strings
non_greedy = re.findall(r'<.*?>', html)
print(non_greedy)  # ['<div>', '</div>', '<div>', '</div>']

Practical Python regex examples

Validating email addresses requires careful pattern design. This pattern handles most common email formats while rejecting obvious invalid addresses.

import re

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

emails = [
    "[email protected]",
    "[email protected]",
    "invalid@",
    "@invalid.com"
]

for email in emails:
    status = "valid" if validate_email(email) else "invalid"
    print(f"{email}: {status}")

Extracting data from structured text becomes straightforward with Python regex. You can parse log files, configuration files, or any text with predictable patterns.

# Parse Apache access log entries
log_line = '192.168.1.1 - - [15/Jan/2024:14:30:22 +0000] "GET /index.html HTTP/1.1" 200 1234'

pattern = r'(?P<ip>[\d.]+) - - \[(?P<date>[^\]]+)\] "(?P<method>\w+) (?P<path>[^\s]+) HTTP/[\d.]+" (?P<status>\d+) (?P<size>\d+)'

match = re.search(pattern, log_line)
if match:
    log_data = match.groupdict()
    print(f"IP: {log_data['ip']}")
    print(f"Path: {log_data['path']}")
    print(f"Status: {log_data['status']}")

Sanitizing user input prevents security issues and data corruption. Python regex helps you strip unwanted characters or validate input formats.

def clean_username(username):
    # Remove anything that's not alphanumeric, underscore, or hyphen
    cleaned = re.sub(r'[^\w-]', '', username)
    # Limit to 20 characters
    return cleaned[:20]

usernames = ["user@123", "valid_user", "test--user!", "a" * 30]

for username in usernames:
    cleaned = clean_username(username)
    print(f"{username} -> {cleaned}")

Common Python regex pitfalls

Greedy quantifiers can match more than you expect. The pattern .* will consume everything until the last possible match point, which often isn’t what you want.

# Problem: greedy matching
text = "Start {content} middle {more} end"
# This matches from first { to last }
wrong = re.search(r'\{.*\}', text).group()
print(wrong)  # {content} middle {more}

# Solution: non-greedy matching
right = re.findall(r'\{.*?\}', text)
print(right)  # ['{content}', '{more}']

Forgetting to escape special characters causes patterns to fail mysteriously. Always escape characters that have regex meaning when you want to match them literally.

# Problem: unescaped period matches any character
pattern = r'file.txt'
print(bool(re.search(pattern, 'fileXtxt')))  # True (oops!)

# Solution: escape the period
pattern = r'file\.txt'
print(bool(re.search(pattern, 'fileXtxt')))  # False
print(bool(re.search(pattern, 'file.txt')))  # True

Not using raw strings leads to double-escaping confusion. Python interprets backslashes before the regex engine sees them.

# Problem: double escaping needed without raw strings
pattern = '\\bword\\b'  # Confusing

# Solution: use raw strings
pattern = r'\bword\b'   # Clear

The difference between match() and search() trips up many developers. Remember that match() only checks the start of the string while search() looks anywhere.

text = "The answer is 42"

# match() fails because '42' isn't at the start
match_result = re.match(r'\d+', text)
print(match_result)  # None

# search() succeeds because it looks anywhere
search_result = re.search(r'\d+', text)
print(search_result.group())  # 42

Python regex gives you precise control over text processing. Start with simple patterns and build complexity gradually. Test your patterns against real data, including edge cases. The Python regex module handles most text manipulation needs once you understand the core concepts and common patterns.