So “str” in Python 2 is now called “bytes,” and “unicode” in Python 2 is now called “str”.
Oh boy, that’s why I have been so confused working on both Py2 and Py3 on various projects.
It’s year 2020. From this point on, it’s all Py3. So I/O is all byte string, and try to keep unicode/str inside python. IO_byte_string.decode() -> unicode_string, unicode_string.encode() -> IO_byte_string. So:
with open(filename, 'rb') as f: byte_string = f.read() # binary # external knowledge: data encoded in utf-8 my_string = byte_string.decode('utf-8') # my_string is a list of "code points" # Output say, using 8859-1 (Latin-1) output_byte_string = my_string.encode('8859-1')
OR
# external knowledge: data encoded in utf-8 with open(filename, 'r', encoding='utf-8') as f: my_string = f.read() # code points # Output say, using 8859-1 (Latin-1) with open(filename2, 'r', encoding='8859-1') as f: f.write(my_string)
ref:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
https://nedbatchelder.com/text/unipain.html