Does var automatically decode in UTF-8?

Leo · January 2020

Hi,

I have encountered some strange behavior when reading and decoding text files and using var. When using

var.variable instead of variable, decoding with UTF-8 produces an

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

I have reproduced this using a txt file containing only the word 'text' and the following inline script in OpenSesame using xpyriment backend

import string
#this works
with open(pool[u'text.txt']) as file:
   myString = file.read()
var.myString=myString.decode(u'UTF-8-sig')
print(var.myString)
#this does not work
with open(pool[u'text.txt']) as file:
   var.myString = file.read()
var.myString=var.myString.decode(u'UTF-8-sig') #error is here
print(var.myString)

#some testing without line 10
print(myString) #works
#print(myString.encode(u'UTF-8')) #error
print(var.myString) #works
print(var.myString.encode(u'UTF-8')) #works

The error message indicates that the error is located in line 10.

And indeed, this works without line 10. Moreover myString cannot be encoded again using UTF-8, whereas var.myString can be. Without line 10, the script also works when adding non-ascii characters to the txt file. This seems to indicate that reading the file to var.myString automatically decodes to UTF-8.

Is there some automatic decoding/encoding when using var and is this intended? And which way should be used in a program?

Best

Leo

sebastiaan · January 2020

Hi Leo,

Yes, you're correct. In Python 2, the var_store automatically decodes str objects to unicode objects. So this:

var.myString.decode(u'UTF-8-sig')

Is calling decode() on a unicode object. What happens in that case is a bit strange. Python 2 will automatically call encode() to first create a str object, assuming ascii encoding, and only then call decode() . And the encode step is where it goes wrong.

In any case, it's not necessary to decode, because everything is already unicode !

Cheers!

Sebastiaan

Leo · January 2020

Hi Sebastiaan,

Thank you, that explains a lot. But it seems var always converts using UTF-8, so I guess it's a little safer to use manual decoding (since sadly some applications automatically use UTF-8-BOM). See e.g. this example using a UTF-8-BOM encoded file. The second part also doesn't work for ANSI-encoded files and the error message suggests that the decoding of varis always performed using UTF-8.

import string
#reading to variable
with open(pool[u'text.txt']) as file:
  myString = file.read()  
var.myString=myString.decode(u'UTF-8')
print(var.myString[0]==u't') #false
var.myString=myString.decode(u'UTF-8-sig')
print(var.myString[0]==u't') #true
#reading to var.variable
with open(pool[u'text.txt']) as file:
  var.myString = file.read()
print(var.myString[0]==u't') #false

Error message using ANSI:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 4: invalid continuation byte

Best

Leo

sebastiaan · January 2020

Hi Leo,

Thanks for bringing this to my attention, because this definitely should be clarified in the documentation. But yes, it's as you say: the var object automatically decodes bytecode strings to unicode , assuming utf-8 encoding.

Cheers!

Sebastiaan

Howdy, Stranger!

Categories

Does var automatically decode in UTF-8?

Comments

Howdy, Stranger!

Quick Links

Categories

Does var automatically decode in UTF-8?

Comments