Converting Bytes to Ascii or Unicode

How To Convert Bytes To Human Readable Text (1)

Converting Bytes to Ascii or Unicode format might be helpful when you are trying to work with data that is encoded in binary form. The data stored in bytes cannot be understood by a human trying to process it. Hence, converting the bytes string into something a human can read is very much required.

We are going to learn what is a byte string, how to create a byte string, definitions of Ascii and Unicode, and their differences as a pre-requisite of this tutorial.

Finally, we are going to see how we can convert a byte string to Readable Ascii or Unicode.

What Is a Byte?

The computer memory uses bytes as its fundamental storage unit. A byte is used as the significant unit of digital information and is generally made of 8 bits. It represents characters, numbers, and even images in binary form(0 or 1).

Read this article if you want to convert bytes to a string.

How to Create a Byte Format?

Any character or a string can be encoded as a byte by using two approaches. These two approaches are using ‘b’ as a prefix to the string and apply the bytes() constructor.

Let us see these two approaches.

Using the b Prefix

Any string or a set of characters passed in single quotes after the letter b is converted into a stream of bytes.

An example is given below.

#creating a byte form
str1=b'This is a message'
print("The byte form is:",str1)
print(type(str1))

The output is given below.

The b prefix
The b prefix

As observed from the output, the type of the string stored in the variable str1 is bytes which means we have successfully converted the message to bytes.

Using the bytes() Constructor

The bytes() constructor is used to convert any object into a byte object that is immutable. Immutable means the object cannot be changed or modified without proper decoding techniques.

Let us take a look at an example.

To know more about this constructor, read this article.

#using the bytes() to create byte object
s1="Hey! How you doin'?"
print("Original string:",s1)
s2= bytes("Hey! How you doin'?",encoding='utf-8')
print("Encoded string:",s2)
print(type(s1))
print(type(s2))

To explain the code briefly, we are creating two variables- s1 and s2 to store the original string and the encoded string respectively. We are printing the two strings on the screen using the print function.

The bytes() constructor is used to create a byte object for the original string with an encoding format called utf-8.

The UTF-8 stands for “Unicode Transformation Format – 8 bits”.

Next, we are printing the type of the objects stored in s1 and s2. While type(s1) results in str(string), type(s2) will output bytes.

The output is given below.

Using The Bytes Constructor
Using The Bytes Constructor

What Is Ascii?

ASCII stands for American Standard Code for Information Interchange is a standard data-encoding format that is used for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices.

The ASCII for characters can be computed using the ord() function of python.

To know more about the ord function, visit this article.

Let us see a sample code to find out the ASCII values of the English alphabet.

#print ascii values of small alphabets
#for capital alphabets, use the string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for ch in 'abcdefghijklmnopqrstuvwxyz! ':
    asc = ord(ch)
    print(ch,':',asc)

In this code, we are initializing a for loop to run through every alphabet from a-z to print the ASCII values of each. The variable ch is a loop iterator used to go through every alphabet. The alphabet set we took in this code is the small alphabet even including ‘!’ and white space.

The ord() function is used to return the ASCII code of any character. The variable ch is passed as an argument to this function and the result is stored in another variable called asc.

In this for loop itself, we are printing the result which follows a pattern – a:97.

Let us see the output.

ASCII Values
ASCII Values

What Is Unicode?

If you know even a bit about ASCII and Unicode, you might be wondering, aren’t they both the same?

Well, they are. But, Unicode is even more diverse because it has code points for every character in English and even almost any other language that is known to mankind. While for English both ASCII and Unicode have the same values, Unicode has a number assigned to every character in different languages.

Isn’t it interesting to know that we can find the code points for a language other than English?

Let us see one example!

str = "Hello,안녕하세요!"
print(str)
for ch in str:
    print("Value for", ch, "is", ord(ch))

In this example, we have taken a variable called str to store English text as well as a Korean representation of Hello. The same string is printed to the screen with the help of print() in the next line.

We used a for loop and an iterator variable to go through every character in the string.

In the following line, we are printing the Unicode value of each character in the string. The variable ch represents each character in the string and the function ord() takes this variable as an argument to print the number associated with the character.

You can experiment with any language of your choice to see what code is assigned to it. Sounds fun right?

Unicode Values
Unicode Values

Differences between ASCII and Unicode

Let us learn a few differences between the two encoding techniques.

So ASCII is limited to the English language whereas Unicode is much more diverse as it supports so many languages like Hebrew, Greek, Latin, and even Indian languages.

The next difference is ASCII uses a 7-bit scheme while Unicode has multiple encoding schemes like 8-bit,16-bit, and 32-bit.

We can say that ASCII is a subset of the Unicode system.

Now that we have knowledge about byte objects, ASCII and Unicode, let us learn how to convert byte objects into ASCII and Unicode.

Converting Bytes to Ascii

The decode() method can be used to convert the byte object into ASCII. Which then can be used to obtain the ASCII values of each character in the string.

Let us see the code.

bsr = bytes("HelloWorld!",encoding='utf-8')
print("The encoded byte object is:",bsr)
asr = bsr.decode("ascii")
print("The ascii form of the string:",asr)
for ch in asr:
  asc=ord(ch)
  print("The ascii form of", ch , "is", asc)

Let us look at the explanation of the above code.

In the first line, we are creating a variable called bsr to store the string which will be encoded by using the bytes() function. We can notice that the utf-8 encoding scheme is used in this code. We can also use other schemes based on compatibility.
The above-generated string is printed.

Next, we are creating another variable asr to store the decoded string using the decode function. Again, this string is also printed on the screen.

Just like we did in the previous example of ASCII, we are running a for loop to go through every character of the string to find the ASCII values.

The ord function is used to get the ASCII values of the characters.

The output would be something like The ascii form of H is 72.

Converting Bytes To Ascii
Converting Bytes To Ascii

Converting Bytes to Unicode

We can use the decode() method to convert bytes to Unicode too. In the previous example, we have seen the creation of a byte object using the bytes() constructor. In this example, we are going to use the “b” prefix.

bsr = b'\xe4\xbd\xa0\xe5\xa5\xbd'  
usr= bsr.decode('utf-8')
print(usr)
for u in usr:
  uv=ord(u)
  print("The Unicode point of ", u , "is", uv)

We have used a variable called bsr just like in the previous example to store the string to encode it.

We are using a new variable usr to decode the string using the utf-8 encoding.

We can use the print() function to print the string.

The next three lines are used to obtain the code points of the characters.

Converting Bytes To Unicode
Converting Bytes To Unicode

Let us see if we can convert the same string into Ascii. As you have guessed it right, it renders an error as Ascii does not support any language other than English.

Error In Converting Bytes To Ascii
Error In Converting Bytes To Ascii

Conclusion

To conclude what we have discussed, we have seen the reason behind the need to convert byte objects to either Ascii or Unicode format. We have learned the definition of a Byte, and how we can create a byte object with the help of two methods which we have discussed in detail.

The two approaches are using the b prefix and the bytes() constructor.

We have understood what is Ascii and the usage of the ord() function to return the Ascii value of any character. We have seen a code to print the ASCII values of the small letter alphabets as an example.

We have also understood the Unicode format and how it is so diverse and has codes for characters in almost every language.

We have observed the major differences between Ascii and Unicode.

Lastly, coming to our main topic, we have tried to convert a byte object into Ascii as well as Unicode. We have also observed how we cannot convert certain bytes into Ascii as they are in some other language.

References

Visit this Stack Overflow answer chain to know why sometimes, we cannot convert bytes to Ascii.