How to Convert Binary Data to UTF-8 in Python

Dealing with binary data and text encodings can be tricky in any programming language. In Python, you may encounter binary data when reading files opened in binary mode, interfacing with network sockets, or using libraries that return binary buffers.

On the other hand, Python’s string type uses Unicode by default. So you need to properly encode and decode between binary data and text.

Also read: 4 ways to Convert a Binary String to a Normal String

What is Binary Data?

Computers ultimately store all data as binary – sequences of 1’s and 0’s. This low-level binary data representation is great for storage and transmission efficiency. But binary data in raw form is not human-readable. So we have various encoding schemes that convert binary data to text characters and vice-versa. Here are some examples of binary data you may encounter in Python:

Contents read from a file opened in binary mode
Image data from files or camera devices
Audio buffers from microphone input
Network packets received from a socket
Output of a C library function

Binary data can contain any arbitrary sequence of bytes. The text has structure and encoding rules to map binary patterns to human-readable characters.

What is Text Encoding?

Text encoding schemes provide rules to map binary byte sequences to text characters that humans can read and write.

Some examples of text encodings are:

ASCII – Maps binary patterns to English characters and symbols
UTF-8 – Unicode encoding that supports all major world languages
Latin-1 – Encoding for Western European languages

The same binary data, when decoded with different text encodings, will result in entirely different text output.

So dealing with binary data and text requires you to be aware of encodings.

Python String and Bytes Types

Python has two main types for representing binary data and text:

bytes – Immutable sequence of integers in the range 0 <= x < 256. Used for representing binary data.
str – Immutable sequence of Unicode codepoints. Used for representing text.

You need to convert between these types using .encode() and .decode() methods when interfacing with binary data in Python.

Here is an example:

# Text string
text = "Hello World" 

# Encode text to binary data
data = text.encode("utf-8") 

# Decode binary data back to text string
text = data.decode("utf-8")  

print(data)
print(text)

Now let’s go through different techniques to convert binary data to UTF-8 encoded text in Python.

Convert Binary Data to UTF-8 String

This section provides various methods to decode binary data to UTF-8 properly.

Method 1: Decode with UTF-8

If your binary data is already UTF-8 encoded, you can simply decode it to a text string:

data = b"Some binary data"
text = data.decode("utf-8")

Note that UTF-8 is the default encoding for decode(), so you can omit it:

text = data.decode() # UTF-8 default

This handles the majority of the UTF-8 decoding use cases.

Also read: Encoding an Image File With BASE64 in Python

Method 2: Decode Latin-1 Before UTF-8

For binary data with unknown encoding, you can first decode it as Latin-1. This will map each byte to a corresponding Unicode code point.

Then you can encode the resulting text as UTF-8:

data = b"\x12\xab" # Binary data with unknown encoding 

text = data.decode("latin-1").encode("utf-8")

Caution: This can corrupt your data if it’s not actually Latin-1 compatible. Only use this method if you are sure your binary data meets the Latin-1 spec.

Method 3: Base64 Encode Before Decoding

Another technique that works for unknown binary data is to base64 encode it first. Base64 produces UTF-8 compatible text output from any binary input.

import base64

data = b"\x12\xab" 

# Base64 encode 
b64_data = base64.b64encode(data)  

# Now decode base64 text from UTF-8 to string
text = b64_data.decode("utf-8")   

print(text)
# Output: Eqs=

The output base64 text will be safe for UTF-8 decoding.

This method has the overhead of base64 encoding/decoding but reliably converts even non-text binary data to UTF-8.

Method 4: Hex Encode Before Decoding UTF-8

Similar to base64 encoding, you can hex encode the binary data first. The output hex text will be UTF-8 friendly:

import binascii

data = b"\x12\xab"

hex_data = binascii.hexlify(data) 

text = hex_data.decode("utf-8")  
print(text)
# Output: 12ab

So hex encoding is another method to reliably convert arbitrary binary data to UTF-8 text.

Handling Encoding Errors

The above decoding examples can fail with Unicode encoding errors:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x12 in position 0: invalid start byte

This means your binary data is not valid UTF-8. There are a few ways to handle such errors:

Ignore Errors

Pass errors="ignore" to skip undecodable bytes:

data = b"Some \x12invalid utf-8\xab" 

text = data.decode("utf-8", errors="ignore")
print(text) # Some  invalid utf-8

The unknown bytes are omitted from the text output.

Replace Errors

Replace bad bytes with a placeholder like � using errors="replace":

data = b"Some \x12invalid utf-8\xab" 

text = data.decode("utf-8", errors="replace")
print(text) # Some �invalid utf-8�

This indicates invalid byte sequences to the reader.

Use Fallback Encoding

If your text is a legacy encoding like Latin-1, you can decode first with ISO-8859-1 and then convert to UTF-8:

data = b"Some \x12invalid utf-8\xab" 

text = (
    data.decode("iso-8859-1")
         .encode("utf-8", errors="replace") 
)

This will salvage as much text as supported by Latin-1 encoding.

Summary

We went over various techniques to handle binary data to UTF-8 text conversion in Python:

For UTF-8 encoded input, decode it directly
For unknown input, try Latin-1, Base64, or Hex decode
Handle encoding errors by ignoring, replacing, or using fallbacks

In summary, some best practices are:

Know your input encoding – Explicitly decode with the required encoding
Handle errors gracefully – Don’t let unmappable bytes crash your program
Validate decoded text – Spot check output if corruption is unacceptable

Converting between binary data and text is unavoidable. I hope these practical examples give you a toolkit to handle these scenarios in your Python applications.

The key ideas are being aware of text encodings and using the appropriate decoding method based on your binary data source and use case requirements.