Regular Expression for a String With Certain Condition

Regular Expression or regex in Python is a tool used to manipulate strings or text. It is used in validating a text with a given matching pattern. A regular expression is a sequence of characters that define a search pattern. Regular expressions play an important role in Natural language processing(NLP) and text processing.

With regular expressions, we can also split the text or string into sub-patterns and create new strings. The re module in Python is used to implement the regular expressions.

In this tutorial, we are going to learn how to use regular expressions and also create one for a string with a certain condition to be satisfied.

What Is a Regular Expression?

As mentioned above, regular expressions are matching patterns used to detect if a string contains a specific pattern or not. Regular expressions have been used in automata theory, theory of computation(TOC), and also programming languages. With regular expressions, we can determine if a Finite State Automata(FA) accepts a particular string. These pattern-matching systems have many other applications.

Read this post to know more about Regex

Applications of Regular Expressions

We tend to use regex in our daily life. Wonder how? If you ever used find, replace in a text editor like MS Word, we find a word from the entire document by searching for the word. If the word is present in the document, the word is returned. Else, a message is displayed. Coming to replace, you give the original word you want to replace and the new word you want to replace it with. These operations internally used regex.

Let us take a look at the general applications of regex.

Data Validation
Text Processing/ Pre-Processing
Web Scraping
Parsing
Data Wrangling

Coming to regex in Python, you can do the following:

Email Extraction
Text Pre-Processing(NLP)
Extracting Date-Time
RSS Feeds

Let us get started with the basic operations of the regular expressions.

re.search()

Justifying its name, this operation is used to search for a pattern in a given string and return the pattern if found. If the pattern is not found, this operation returns None.

The syntax of this function is given below.

re.search(pattern, string, flags=0)

Learn more about the def keyword and other keywords of Python here.

Let us see an example.

import re
def findword(str, word):
    m = re.search(word, str)
    return m
str = 'Python is a popular programming language founded by Guido Van Rossum in 1991. It is the most sought after programming language for new technologies like Artificial Intelligence and Machine Learning'
word = 'programming'
res = findword(str,word)
if res is None:
    print("Word not found!!")
else:
    print("Search Success!!")
    print("The searched word is:",res)

In the very first line, we are importing the re module. Next, we are creating a function to search for a word in a string using the def keyword. The name of this function is findword and it takes the string and word to search as the arguments. We are using the search function and storing the result in m. Next, we are initializing the variable str with a string to search for the word in. In the next line, we are specifying the word to search. The function findword is called with the string and the word as its arguments and the result is stored in a variable called res. Now we are creating an if-else loop to print the result statements, If the given word is not present in the string, the function returns None. In this case, the statement Word not found!! is printed on the screen. If the word is present in the string, the statement Search Success!! is printed along with the word we searched.

Here is the output.

re.match()

The match function of the re module is similar to the search function but only checks for the occurrence of the pattern or word just at the beginning of the string. It returns None even if the pattern is found at positions other than the first position.

Let us see an example.

import re
def findmatch(s,p):
    m=re.match(p,s)
    return m
s='Python,Java,C,C#,Perl,PHP,Ruby....'
p='Python'
res=findmatch(s,p)
if res is None:
    print("Match not found in the beginning of string s")
else:
    print("Match found at the beginning of string s!!")
    print(res)
s1='Java,C,C#,Python,Perl,PHP,Ruby...'
res1=findmatch(s1,p)
if res1 is None:
    print("Match not found in the beginning of string s1")
else:
    print("Match found at the beginning of string s1!!")
    print(res1)

In the above code, we have created two strings – s and s1 in which the word Python occurs at the beginning and does not occur at the beginning respectively. We have created a function just like the above example to match the word in the string. We have also initialized two results- res and res1 to store the result of the match operation in the two strings.

Let us take a look at the output.

re.findall()

The findall function of the re module is used to return all the matching patterns in the given string in the form of a list. Let us see an example.

def find_all(str,word):
    m=re.findall(word,str)
    return m
str = 'Python is a popular programming language founded by Guido Van Rossum in 1991. It is the most sought after programming language for new technologies like Artificial Intelligence and Machine Learning'
word = 'programming'
res=find_all(str,word)
if res is None:
    print("Word not found!!")
else:
    print("Search Success!!")
    print("The searched word is:",res)

In the above example, we are creating a function called find_all that takes a string and a pattern/word to find the occurrences in the string.

re.split()

This function is used to split the string or text based on a certain pattern.

Let us see an example.

str = "Python is a popular programming language founded by Guido Van Rossum in 1991. It is the most sought after programming language for new technologies like Artificial Intelligence and Machine Learning."
wd = re.split(r"\s+", str)
print("The words of the string split on white space are:")
print(wd)

In this example, we have taken the string and created a variable called wd to store the results of splitting based on the white space character. The result is then printed in the next line.

Those are the basic functions of the regex module. Now let us dive into the problem statement.

Regex for a String With a Certain Condition

Let us consider a pattern. We need to create a regular expression for the strings that match [a-zA-Z0-9-] but with a condition. The regular expression we are going to create should allow the dashes anywhere in the string but not at the start nor at the end.

Let us understand the problem statement. The regular expression should accept the strings that start and end with numbers or alphabets and can contain dashes anywhere in between. So there should be no dashes at the start and the end.

While the string abc-12a-def can be accepted by the regular expression, the string -123-- should not be accepted.

Let us take a look at the regular expression first and understand it.

But before we start off, we need to be clear about some of the modifiers of regular expressions.

Modifier	Definition
*	Matches zero or more occurrences of the character preceding the current one
+	Matches one or more occurrences
?	Matches 0 or 1 repetitions/occurrences
$	Performs a match at the end of the string
^	Performs a match at the beginning of the string

Modifiers of Regular Expressions

The regular expression that should work is:

^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$

The ^ modifier is used to match the first occurrence of the string with alphanumeric characters- [a-zA-Z0-9]. The + is used to match one or more occurrences of the pattern [a-zA-Z0-9]. Lastly, the * operator is used to match zero or more occurrences of dashes in between the alphanumeric characters but never at the start or the end. The $ is used to check the above pattern at the end of the string.

Let us take a look at some examples.

Dashes in the Between the Alphanumerics

Let us take the user input which contains dashes somewhere in between the string and check if it is accepted by the regular expression.

import re
pattern = r"^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$"
text = input("Enter any string:")
if re.match(pattern,text):
    print("It is a Match!")
else:
    print("Sorry:( it does not match...")

In the first line, we are importing the re module.

Next, we have assigned the regular expression we discussed earlier to a variable called pattern. Then, we use the match function to check if the input string matches the pattern.

If it is a match, the print statement prints It is a Match!. Else, the statement in the else clause is displayed.

Let us see the input string and the output.

As you can see, the input we provided contains dashes in between the string. So it is accepted.

Now let us see what happens if the input string contains dashes at the beginning.

Dashes at the Beginning of the String

What happens if the input string contains dashes at the beginning of the string? Obviously, the string is not accepted. Because the regex we created does not allow dashes right at the beginning.

Dashes at the End of the String

We are going to give dashes at the end of the string. Let us see what happens. As you may have guessed already, the input does not match the pattern provided.

Conclusion

We have reached the end of the tutorial. To recapitulate, we have learned about regular expressions and how they are used in general, and also what we can do with regex in Python. Regular expressions have been used in automata theory to check if a particular string is parsed or accepted by a finite state automata.

We also happen to use regex in our daily life in the form of find, and replace operations in text editors like MS Word.

In the next segment, we have discussed the popular functions of the re module such as search, findall, match and split.

Next, we discussed the problem statement which states that we need to generate a regular expression that only allows dashes in between the string and not at the beginning or the end.

We have seen three examples of the placement of the dashes in the string and observed the outputs. The regex we generated accepts the string which has dashes anywhere in between the string and rejects those strings which have dashes at the start or the end.