Unicode encodings in python

So “str” in Python 2 is now called “bytes,” and “unicode” in Python 2 is now called “str”.

Oh boy, that’s why I have been so confused working on both Py2 and Py3 on various projects.

It’s year 2020. From this point on, it’s all Py3. So I/O is all byte string, and try to keep unicode/str inside python. IO_byte_string.decode() -> unicode_string, unicode_string.encode() -> IO_byte_string. So:

with open(filename, 'rb') as f:
  byte_string = f.read()  # binary
  # external knowledge: data encoded in utf-8
  my_string = byte_string.decode('utf-8')
  # my_string is a list of "code points"
  # Output say, using 8859-1 (Latin-1)
  output_byte_string = my_string.encode('8859-1')

OR

# external knowledge: data encoded in utf-8
with open(filename, 'r', encoding='utf-8') as f:
  my_string = f.read()  # code points

# Output say, using 8859-1 (Latin-1)
with open(filename2, 'r', encoding='8859-1') as f:
  f.write(my_string)

ref:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
https://nedbatchelder.com/text/unipain.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s