So “str” in Python 2 is now called “bytes,” and “unicode” in Python 2 is now called “str”.
Oh boy, that’s why I have been so confused working on both Py2 and Py3 on various projects.
It’s year 2020. From this point on, it’s all Py3. So I/O is all byte string, and try to keep unicode/str inside python. IO_byte_string.decode() -> unicode_string, unicode_string.encode() -> IO_byte_string. So:
[code lang=“python”] with open(filename, ‘rb’) as f: byte_string = f.read() # binary
external knowledge: data encoded in utf-8
my_string = byte_string.decode(‘utf-8’)
my_string is a list of “code points”
Output say, using 8859-1 (Latin-1)
output_byte_string = my_string.encode(‘8859-1’) [/code]
OR
[code lang=“python”] # external knowledge: data encoded in utf-8 with open(filename, ‘r’, encoding=‘utf-8’) as f: my_string = f.read() # code points
# Output say, using 8859-1 (Latin-1) with open(filename2, ‘r’, encoding=‘8859-1’) as f: f.write(my_string) [/code]
ref: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ https://nedbatchelder.com/text/unipain.html