Unicode encodings in python

So “str” in Python 2 is now called “bytes,” and “unicode” in Python 2 is now called “str”.

Oh boy, that’s why I have been so confused working on both Py2 and Py3 on various projects.

It’s year 2020. From this point on, it’s all Py3. So I/O is all byte string, and try to keep unicode/str inside python. IO_byte_string.decode() -> unicode_string, unicode_string.encode() -> IO_byte_string. So:

with open(filename, 'rb') as f:
  byte_string = f.read()  # binary
  # external knowledge: data encoded in utf-8
  my_string = byte_string.decode('utf-8')
  # my_string is a list of "code points"
  # Output say, using 8859-1 (Latin-1)
  output_byte_string = my_string.encode('8859-1')


# external knowledge: data encoded in utf-8
with open(filename, 'r', encoding='utf-8') as f:
  my_string = f.read()  # code points

# Output say, using 8859-1 (Latin-1)
with open(filename2, 'r', encoding='8859-1') as f:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s