What is RegEx? Regular Expression in Python & Meta Characters
A regular expression (regex, regexp, or re) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expression patterns are assembled into a set of byte codes which are then executed by a matching engine written in C. Regular expressions are widely used in the world of UNIX.
Now let’s understand simple basic regular expression through the following image.
The caret sign (^) serves two purposes. Here, in this figure, it’s checking for the string that doesn’t contain upper case, lower case, digits, underscore and space in the strings. In short, we can say that it is simply matching for special characters in the given string. If we use caret outside the square brackets, it will simply check for the starting of the string.
An example of a "proper" email-matching regex (like the one in the exercise), see below:
import re
input_user = input("enter your email address: ")
m = re.match( '(?=.*\d)(?=.*[a-z])(?=.*\W)',input_user)
if m:
print("Email is valid:")
else:
print("email is not valid:")
The most common usages of regular expressions are:
- Search a string (search and match)
- Finding a string (findall)
- Break string into substrings (split)
- Replace part of a string (sub)
'Re' Module
The module 're' gives full assistance for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.
Now if we talk about the 're' module, the re module gives an interface to the regular expression engine, that permits you to arrange REs into objects and then perform with the matches. A regular expression is simply a sequence of characters that define a search pattern. Pythons’ built-in 're' module provides excellent support for the regular expressions with a modern and complete regex flavor.
Now, let’s understand everything about regular expressions and how they can be implemented in python. The very first step would be to import 're' module which provides all the necessary functionalities to play with. It can be done by the following statement in any of the IDE’s.
import re
Meta Characters
Metacharacters are characters or we can say it's a sequence of such characters, that holds a unique meaning specifically in a computing application. These characters have special meaning just like a '*' in wild cards. Some set of characters might be used to represent other characters, like an unprintable character or any logical operation. They are also known as “operators” and are mostly used to give rise to an expression that can represent a required character in a string or a file.
Below is the list of the metacharacters, and how to use such characters in the regular expression or regex like;
. ^ $ * + ? { } [ ] \ | ( )
Initially, the metacharacters we are going to explain are [ and ]. It’s used for specifying the class of the character which is a set of characters that you wish to match.
Characters can be listed individually here, or the range of characters can be indicated by giving two characters and separating them by a '-'. For instance, [abc] will match any of the characters a, b, or c; we can say in another way to express the same set of characters i.e. [a-c]. If you wanted to match only lowercase letters, your RE would be [a-z].
Let’s understand what these characters illuminate:
Here, [abc] will match if the string you are trying to match contains any of the a, b or c.
You can also specify a range of characters using - inside square brackets.
- [a-e] is the same as [abcde].
- [1-4] is the same as [1234].
- [0-9] is the same as [0123---9]
You can complement (invert) the character set by using the caret ^ symbol at the start of a square-bracket.
- [^abc] means any character except a or b or c.
- [^0-9] means any non-digit character.
The basic usages of commonly used metacharacters are shown in the following table:
For example, \$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.
s = re.search('\w+$','789Welcome67 to python')
Output:
'python'
\ is used to match a character having special meaning. For example: '.' matches '.', '+'matches '+' etc.
We need to use '\' to match. Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.
The following code example will show you the regex '.' function:
s = re.match('........[a-zA-Z0-9]','Welcome to python')
Output:
'Welcome t'
Other Special Sequences
There are some Special sequences that make commonly used patterns easier to write. Below is a list of such special sequences:
Understanding special sequences with examples
\A - Matches if the specified characters are at the start of a string.
s = re.search('\A\d','789Welcome67 to python')
Output:
'7'
\b - Matches if the specified characters are at the beginning or end of a word.
a = re.findall(r'\baa\b', "bbb aa \\bash\baaa")
Output:
['aa']
\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.
a = re.findall('[\B]+', "BBB \\Bash BaBe BasketBall")
Output:
['BBB', 'B', 'B', 'B', 'B', 'B']
\d - Matches any decimal digit. Equivalent to [0-9]
a = re.match('\d','1Welco+me to python11')
Output:
'1'
\D - Matches any non-decimal digit. Equivalent to [^0-9]
a = re.match('\D','Wel12co+me to python11')
Output:
'W'
\s - Matches where a string contains any white space character. Equivalent to [ \t\n\r\f\v].
a = re.match('\s',' Wel12co+me to python11')
Output:
' '
\S - Matches where a string contains any non-white space character. Equivalent to [^ \t\n\r\f\v].
a = re.match('/S','W el12co+me to python11')
Output:
'W'
\w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.
a = re.match('[\w]','1Welco+me to python11')
Output:
'1'
\W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]
s = re.match('[\W]','@@Welcome to python')
Output:
'@'
\Z - Matches if the specified characters are at the end of a string.
s = re.search('\w\Z','789Welcome67 to python')
Output:
'n'
Module- Level Functions
'Re' module provides so many top level functions & among them primarily used functions are: match(), search(), findall(), sub(), split(), compile().
These functions are responsible for taking arguments, primarily, regular expression pattern as the first argument and the string where regex has to be applied to be the second. It returns either None or a match object instance. They store the compiled object in a cache for the purpose of making future calls using the same regular expressions and avoiding the need to parse the pattern again and again.
We will explain some of these functions in the below section.
1. re.match() - The match() function is used to match the beginning of the string. In the following example, the match() function will match the first letter of the given string whether it is a digit, lowercase or uppercase letter (underscores included).
a = re.match('[0-9_a-zA-Z-]','Welcome to programming')
Output:
'W'
If we add ‘+' outside the character set, it will check for the repeatability of the given characters in 'RE'. In the following example, '+' checks about one or more repetitions of uppercase, lowercase, and digits (underscore included, white spaces excluded).
a = re.match('[_0-9A-Za-z-]+','Welcome to programming')
Output:
'Welcome'
'*' is a quantifier that is responsible for matching the regex preceding it 0 or more times. In short, we can say it matches any character zero or more times. Let's understand via the below given example. In the given string ('Welcome to programming'), '*' will match for characters given in the regex as long as possible.
a = re.match('[_A-Z0-9a-z-]*','Welcome to programming')
Output:
'Welcome'
If we add '*' inside the character set, the regex will check for the presence of '*' at the beginning of the string. Since in the following example '*' is not present at the beginning of the string, so it will result in 'W'.
a = re.match('[_A-Z0-9a-z-*]','Welcome to programming')
Output:
'W'
Using quantifier '?' matches zero or one of whatever precedes it. In the following example '?' matches uppercase or lowercase characters including underscore as well at the beginning of the string.
a = re.match('[_A-Za-z-]?','Welcome to programming')
Output:
'W'
There's 're' module function that offers you the set of functions that mainly allows you to search a string for a match. Let’s understand what these functions perform.
2. re. search()- It is mainly used to search the pattern in a text. The function re. search() takes a regex pattern and a string and searches for that particular pattern within the string. In that case, if the search is successful, search() returns a match object or None otherwise. The syntax of re. search is as follows:
a = re.search(pattern, string)
You can better understand the following example.
a = re.search('come', 'welcome to programming')
Output:
<_sre.SRE_Match object; span=(3, 7), match='come'>
3. re. findall()- Returns a list containing all matches. The function re. findall() is used when you want to iterate over the lines of a file or string, it will return a list of all the matches in a single step. The string is scanned left-to-right, and matches are returned in the order that found. The syntax of re. findall() is as follows:
a = re.findall(pattern, string)
Below is an example of re. findall() function.
a = re.findall('prog','welcome to programming')
Output:
['prog']
4. re. split () - Returns a list where the string has been split at each match. Split string by the occurrences of pattern. The syntax of re. split is given below:
a = re.split(pattern, string)
Look at the following example re. split() function:
a = re.split('[\W]+','welcome to programming')
Output:
['welcome', 'to', 'programming
a = re.split('\s','Hello how are you')
Output:
['Hello', 'how', 'are', 'you']
b = re.split('\d','hello1i am fine')
Output:
'hello', 'i am fine']
5. re. sub() - It replaces one or many matches with a string. It is used to replace sub strings and it will replace the matches in string with replacing value. The syntax of re. sub() is as follows:
a = re.sub(pattern, replacing value, string)
The following example replaces all the digits in the given string with an empty string.
m = re.sub('[0-9]','','Welcome to python1234. Coding3456.')
Output:
'Welcome to python. Coding.'
6. re. compile() - We can compile pattern into the pattern objects all with the help of function re.compile(), and which contains various methods for operations such as searching for pattern matches or performing string substitutions.
In the following example, the compile function compiles the regex function mentioned and then the code asks the user to enter a name. If the user types/inputs any digit or other special characters, the compile results won't match and it will again ask the user for input. It will continue doing this unless and until the user inputs a name containing characters only.
name_check = re.compile(r"[^A-Za-zs.]")
name = input("Please, enter your name: ")
while name_check.search(name):
print("please enter your name correctly!")
name = input("Please, enter your name: ")
The output of the following code is as follows:
Please, enter your name: 1234
please enter your name correctly!
Please, enter your name: 5678
Name is provided correctly
please enter your name correctly!
Please, enter your name: john
Name is provided correctly
Wrapping Up
Now that we have a rough understanding of what RegEx is, how regex works in python, further we can move onto something more technical. It's time to get a small project up and running.
Comments
Post a Comment