Does var automatically decode in UTF-8?
Hi,
I have encountered some strange behavior when reading and decoding text files and using var. When using
var.variable
instead of variable
, decoding with UTF-8 produces an
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
I have reproduced this using a txt file containing only the word 'text' and the following inline script in OpenSesame using xpyriment backend
import string #this works with open(pool[u'text.txt']) as file: myString = file.read() var.myString=myString.decode(u'UTF-8-sig') print(var.myString) #this does not work with open(pool[u'text.txt']) as file: var.myString = file.read() var.myString=var.myString.decode(u'UTF-8-sig') #error is here print(var.myString) #some testing without line 10 print(myString) #works #print(myString.encode(u'UTF-8')) #error print(var.myString) #works print(var.myString.encode(u'UTF-8')) #works
The error message indicates that the error is located in line 10.
And indeed, this works without line 10. Moreover myString cannot be encoded again using UTF-8, whereas var.myString can be. Without line 10, the script also works when adding non-ascii characters to the txt file. This seems to indicate that reading the file to var.myString automatically decodes to UTF-8.
Is there some automatic decoding/encoding when using var and is this intended? And which way should be used in a program?
Best
Leo
Comments
Hi Leo,
Yes, you're correct. In Python 2, the
var_store
automatically decodesstr
objects tounicode
objects. So this:Is calling
decode()
on aunicode
object. What happens in that case is a bit strange. Python 2 will automatically callencode()
to first create astr
object, assuming ascii encoding, and only then calldecode()
. And the encode step is where it goes wrong.In any case, it's not necessary to decode, because everything is already
unicode
!Cheers!
Sebastiaan
Check out SigmundAI.eu for our OpenSesame AI assistant!
Hi Sebastiaan,
Thank you, that explains a lot. But it seems
var
always converts using UTF-8, so I guess it's a little safer to use manual decoding (since sadly some applications automatically use UTF-8-BOM). See e.g. this example using a UTF-8-BOM encoded file. The second part also doesn't work for ANSI-encoded files and the error message suggests that the decoding ofvar
is always performed using UTF-8.Error message using ANSI:
Best
Leo
Hi Leo,
Thanks for bringing this to my attention, because this definitely should be clarified in the documentation. But yes, it's as you say: the
var
object automatically decodes bytecode strings tounicode
, assumingutf-8
encoding.Cheers!
Sebastiaan
Check out SigmundAI.eu for our OpenSesame AI assistant!