How To Extract Emails From a Text File Using Python

Extract Email Addresses From A Text File

In this article, we all going to see how we can extract emails from a text file using Python. To make things easier to use we shall make some use of regular expressions. These are some special character equations that are in use for String Manipulations for a very long time even before the origin of computers.

Using RegEx with Python

The term Regular Expressions means a lot when we need to manipulate a string and make a thorough approach towards creating a good format for our output. The “re” module is a built-in module in Python. In the sub-sections, we will see the basic operations and then move toward the main topic.

Applications of Regular Expressions

To get a more clear idea here are some of the applications:

  1. Finding a specific pattern in a string.
  2. Matching a particular keyword or alphabet in a sentence.
  3. Extraction of useful symbols or patterns from a long text.
  4. Performing complex string operations.

A small tutorial on RegEx Python library

A regular expression allows us to match a specific pattern in the given text. So, to make things easier we shall know about them for this topic. Not only for email extraction but, for ETL (Extract Transform and Load ) processing of texts in BigData they are in use for a long time.

There are four basic functions to perform four basic operations on strings:

  1. match(): To match a particular string pattern at the beginning of the text.
  2. find(): To find a string pattern in the given text.
  3. findall(): Find all the matching strings in the whole text.
  4. finditer(): Finds a matching pattern and returns it as an iterable.

Limitations of matching for special characters

There is a set of special characters that do not involve in matching rather they help in finding the complex patterns in a string. Here is a list of those:

  1. Square braces: [ ]
  2. Round brackets: ( )
  3. Curly braces: { }
  4. The pipe: |
  5. The backslash: \
  6. Question mark: ?
  7. Plus sign: +
  8. The dot operator: “.”
  9. Exclusive OR (XOR) operator: ^
  10. Ampersand: $
  11. The asterisk or star operator: *

Point to remember: Also take a note that whenever matching a pattern we must specify it as a raw string using the “r” alphabet before declaring a string. This makes the RegEx engine of Python to avoid any types of errors. Ex: myPattern = r”myString”.

Compiling a regular expression

The first thing to start string operations is we need to compile our expression into our system. This will create a object that helps us to call the above four functions. To compile an expression we use the re.compile() function and insert our pattern inside that function. Set the flag to re.UNICODE.

Code:

import re
myPattern = re.compile("python", flags = re.UNICODE)
print(type(myPattern)) 

Output:

<class 're.Pattern'>

Now we have successfully created a pattern object. Using this only we are going to call the functions and perform all the operations.

The match() function

This function creates an object if the string’s starting characters match the pattern.

Code:

match = myPattern.match("python")  
print(match.group())

Output:

python

The group function is called we can specify whether. Thus, when a pattern matches our sample string then the object is created. We can check the matching index using the span() function.

print("The pattern matches upto {0}".format(match.span()))
The pattern matches upto (0, 6)

Please remember that, if the function does not find any match then no object is created. We get a NoneType as a return answer. The match() function returns the matching string index positions in the form of a tuple. It also has two extra parameters namely:

  1. pos: Starting position/index of the matching text/string.
  2. endpos: Ending position/index of the starting text.

Example:

match = myPattern.match("hello python", pos = 6)  
print(match.group())
print("The pattern matches upto {0}".format(match.span()))

# output
python
The pattern matches upto (6, 12)

Advance matching entities

Sometimes our string may contain some numbers, digits, spaces, alphanumeric characters, etc. So, to make things more reliable re has some set of signatures. We need to specify those in our raw strings.

  1. \d: To match integer characters from 0 to 9.
  2. \D: To match non-integer characters from 0 to 9.
  3. \s: For any whitespace characters. “\n”, “\t”, “\r”
  4. \S: For any non-whitespace character.
  5. \w: Matching the alphanumeric characters.
  6. \W: Matching any non-alphanumeric characters.

Flags for the match function:

Flags prove an extra helping hand when we perform some sort of complex text analysis. So, the below is a list of some flags:

  1. re.ASCII or re.A: For all ASCII code characters like: \w, \W, \b, \B, \d, \D, \s and \S .
  2. re.DEBUG: Displays all the debug information.
  3. re.IGNORECASE or re.I: This flag performs case-insensitive matching.
  4. re.MULTILINE or re.M: Immediately proceeds to newline after matching the starting or ending patterns.

For more info about flags please go through this link: https://docs.python.org/3/library/re.html#flags

The search() function

The search function searches for a specific pattern/word/alphabet/character in a string and returns the object if it finds the pattern.

import re

pattern = r"rain rain come soon, come fast, make the land green";
mySearch = re.search("rain", pattern, re.IGNORECASE))
print("Successfully found, ", mySearch.group(), " from", mySearch.start(), " to ",mySearch.end())

#output
Successfully found "rain"  from 0  to  4

Extracting the email using RegEx module

As we are studying all the basics now it’s time for a bigger challenge. Let us implement the knowledge of file read and regular expression in one code and extract some email addresses from that file.

Sample file:

Hello my name is Tom the cat.
I like to play and work with my dear friend jerry mouse. 
We both have our office and email addresses also. 
They are [email protected], [email protected]. 
Our friend spike has also joined us in our company.
His email address is [email protected]. 
We all entertaint the children through our show. 

Here is the simple file that contains the three email addresses. This also makes things more complex but, our code shall make them simpler. Using the above knowledge of regex we are good to implement it.

The regular expression for this is: “[0-9a-zA-z]+@[0-9a-zA-z]+\.[0-9a-zA-z]+”

Code:

import re

try:
    file = open("data.txt")
    for line in file:
        line = line.strip()
        emails = re.findall("[0-9a-zA-z]+@[0-9a-zA-z]+\.[0-9a-zA-z]+", line)
        if(len(emails) > 0):
            print(emails)

except FileNotFoundError as e:
    print(e)
    

Explanation:

  1. The pattern says that: extract the text that starts with alphanumeric characters and has a “@” symbol after that again it has alphanumeric characters and has a dot “.” and after the dot again the text has the same type of characters.
  2. Do not directly take the dot, rather include it with a backslash “\.”, to specify the python regex engine that we are using the dot. Using it as it is will specify that we are taking each character except newline in the patterns.
  3. Then include the sample text in a file.
  4. Open the file in reading mode.
  5. Implement a for loop with a line variable. It reads every line in the text.
  6. Then strip the line to extract each part of the text.
  7. Create an object of the findall() function and include our pattern expression inside it, after that include the line variable. This piece of code matches each strip of the text with the pattern.
  8. After the pattern matches, it just prints it.
  9. The outer code is just a try-catch block to handle errors.

Output:

Conclusion

Hence we implemented a smart script using a few lines of code that extracts emails from a given text.