Unicode In Python – The unicodedata Module Explained

1

Hey guys! In this tutorial, we will learn about Unicode in Python and the character properties of Unicode. So, let’s get started.

What is Unicode?

Unicode associates each character and symbol with a unique number called code points. It supports all of the world’s writing systems and ensures that data can be retrieved or combined using any combination of languages.

The codepoint is an integer value ranging from 0 to 0x10FFFF in hexadecimal coding.

To begin using Unicode characters in Python, we need to understand how the string module interprets characters.

How to interpret ASCII and Unicode in Python?

Python provides us a string module that contains various functions and tools to manipulate strings. It falls under the ASCII character set.

import string

print(string.ascii_lowercase) 
print(string.ascii_uppercase)
print(string.ascii_letters)
print(string.digits)
print(string.hexdigits)
print(string.octdigits)
print(string.whitespace)  
print(string.punctuation)

Output:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
0123456789abcdefABCDEF
01234567
 	
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

We can create one-character Unicode strings by using chr() built-in function. It takes only one integer as argument and returns the unicode of the given character.

Similarly, odr() is an inbuilt function that takes a one-character Unicode string as input and returns the code point value.

chr(57344)
ord('\ue000')

Output:

'\ue000'
57344

What does character encoding mean in Python?

A string is a sequence of Unicode codepoints. These codepoints are converted into a sequence of bytes for efficient storage. This process is called character encoding.

There are many encodings such as UTF-8,UTF-16,ASCII etc.

By default, Python uses UTF-8 encoding.

What is UTF-8 Encoding?

UTF-8 is the most popular and commonly used for encoding characters. UTF stands for Unicode Transformation Format and ‘8’ means that 8-bit values are used in the encoding.

It replaced ASCII (American Standard Code For Information Exchange) as it provides more characters and can be used for different languages around the world, unlike ASCII which is only limited to Latin languages.

The first 128 codepoints in the UTF-8 character set are also valid ASCII characters. A character in UTF-8 can be from 1 to 4 bytes long.

Encoding Characters in UTF-8 Using the Python encode() function

The encode() method converts any character from one encoding to another. The syntax of the encode function is as shown below –

string.encode(encoding='UTF-8',errors='strict')

Parameters:

  • encoding is the encoding to be used which is supported by python.
  • errors – The list of different error types is below
  1. strict- The default error is strict which raises UnicodeDecode error on failure.
  2. ignore– Ignores the undecodable unicode from the result.
  3. replace– Replaces the undecodable unicode with ‘?’
  4. xmlcharrefreplace- Inserts xlm character reference in place of undecodable unicode.
  5. backslashreplace- Insets \uNNNN escape sequence in place of undecodable unicode.
  6. namereplace- Inserts \N{…} escape sequence in place of undecodable unicode.

How to use Unicode in Python with the encode() function?

Let’s now move to understanding how the string encode function can allow us to create unicode strings in Python.

1. Encode a string to UTF-8 encoding

string = 'örange'
print('The string is:',string)
string_utf=string.encode()
print('The encoded string is:',string_utf)

Output:

The string is: örange
The encoded string is: b'\xc3\xb6range'

2. Encoding with error parameter

Let us encode the german word weiß which means white.

string = 'weiß'

x = string.encode(encoding='ascii',errors='backslashreplace')
print(x)

x = string.encode(encoding='ascii',errors='ignore')
print(x)

x = string.encode(encoding='ascii',errors='namereplace')
print(x)

x = string.encode(encoding='ascii',errors='replace')
print(x)

x = string.encode(encoding='ascii',errors='xmlcharrefreplace')
print(x)

x = string.encode(encoding='UTF-8',errors='strict')
print(x)

Output:

b'wei\\xdf'
b'wei'
b'wei\\N{LATIN SMALL LETTER SHARP S}'
b'wei?'
b'weiß'
b'wei\xc3\x9f'

The uncidedata module to work with Unicode in Python

The unicodedata module provides us the Unicode Character Database (UCD) which defines all character properties of all Unicode characters.

Let’s look at all the functions defined within the module with a simple example to explain their functionality. We can efficiently use Unicode in Python with the use of the following functions.

1. unicodedata.lookup(name)

This function looks up the character by the given name. If the character is found, the corresponding character is returned. If not found, then Keyerror is raised.

import unicodedata 
   
print (unicodedata.lookup('LEFT CURLY BRACKET')) 
print (unicodedata.lookup('RIGHT SQUARE BRACKET')) 
print (unicodedata.lookup('ASTERISK'))
print (unicodedata.lookup('EXCLAMATION MARK'))

Output:

{
]
*
!

2. unicodedata.name(chr[, default])

This function returns the name assigned to character chr as string. If no name is defined, it returns the default otherwise it raises Keyerror.

import unicodedata 
   
print (unicodedata.name(u'%')) 
print (unicodedata.name(u'|')) 
print (unicodedata.name(u'*')) 
print (unicodedata.name(u'@'))

Output:

PERCENT SIGN
VERTICAL LINE
ASTERISK
COMMERCIAL AT

3. unicodedata.decimal(chr[, default])

This function returns the decimal value assigned to the character chr. If no value is defined then the default is returned otherwise Keyerror is raised as shown in the example below.

import unicodedata
   
print (unicodedata.decimal(u'6'))
print (unicodedata.decimal(u'b')) 

Output:

6
Traceback (most recent call last):
  File "D:\DSCracker\DS Cracker\program.py", line 4, in <module>
    print (unicodedata.decimal(u'b')) 
ValueError: not a decimal

4. unicodedata.digit(chr[, default])

This function returns the digit value assigned to the character chr as integer. One thing to note is that this function takes a single character as an input. In the last line in this example, I’ve used “20” and the function throws an error stating that it cannot accept a string as an input.

import unicodedata 
   
print (unicodedata.decimal(u'9')) 
print (unicodedata.decimal(u'0')) 
print (unicodedata.decimal(u'20'))

Output:

9
0
Traceback (most recent call last):
  File "D:\DSCracker\DS Cracker\program.py", line 5, in <module>
    print (unicodedata.decimal(u'20'))
TypeError: decimal() argument 1 must be a unicode character, not str

5. unicodedata.numeric(chr[, default])

This function returns the numeric value assigned to the character chr as an integer. If no value is defined then it returns default otherwise ValueError is raised.

import unicodedata 
   
print (unicodedata.decimal(u'1'))
print (unicodedata.decimal(u'8'))
print (unicodedata.decimal(u'123'))

Output:

1
8
Traceback (most recent call last):
  File "D:\DSCracker\DS Cracker\program.py", line 5, in <module>
    print (unicodedata.decimal(u'123')) 
TypeError: decimal() argument 1 must be a unicode character, not str

6. unicodedata.category(chr)

This function returns the general category assigned to the character chr as a string. It returns ‘L’ for letter and ‘u’ for uppercase and ‘l’ for lowercase.

import unicodedata 
   
print (unicodedata.category(u'P')) 
print (unicodedata.category(u'p')) 

Output:

Lu
Ll

7. unicodedata.bidirectional(chr)

This function returns the bidirectional class assigned to the character chr as a string. An empty string is returned by this function if no such value is defined.

AL denotes Arabic letter, AN denotes Arabic number and L denotes left to right etc.

import unicodedata 
   
print (unicodedata.bidirectional(u'\u0760'))

print (unicodedata.bidirectional(u'\u0560')) 

print (unicodedata.bidirectional(u'\u0660')) 


Output:

AL
L
AN

8. unicodedata.combining(chr)

This function returns canonical combining class assigned to the given character chr as string. It returns 0 if there is no combining class defined.

import unicodedata 
   
print (unicodedata.combining(u"\u0317"))

Output:

220

9. unicodedata.mirrored(chr)

This function returns a mirrored property assigned to the given character chr as an integer. It returns 1 if the character is identified as ‘mirrored‘ in bidirectional text or else it returns 0.

import unicodedata 
   
print (unicodedata.mirrored(u"\u0028"))
print (unicodedata.mirrored(u"\u0578"))

Output:

1
0

10. unicodedata.normalize(form, unistr)

Using this function returns the conventional form for the Unicode string unistr. The valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

from unicodedata import normalize 
   
print ('%r' % normalize('NFD', u'\u00C6')) 
print ('%r' % normalize('NFC', u'C\u0367')) 
print ('%r' % normalize('NFKD', u'\u2760')) 

Output:

'Æ'
'Cͧ'
'❠'

Conclusion

In this tutorial, we learned about unicode and unicodedatabase module which defines the unicode characteristics. Hope you all enjoyed. Stay Tuned 🙂

References

Unicode Official Docs

Unicodedatabase