Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Supported by

Does var automatically decode in UTF-8?

LeoLeo
edited January 15 in OpenSesame

Hi,


I have encountered some strange behavior when reading and decoding text files and using var. When using

var.variable instead of variable, decoding with UTF-8 produces an

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

I have reproduced this using a txt file containing only the word 'text' and the following inline script in OpenSesame using xpyriment backend

import string
#this works
with open(pool[u'text.txt']) as file:
   myString = file.read()
var.myString=myString.decode(u'UTF-8-sig')
print(var.myString)
#this does not work
with open(pool[u'text.txt']) as file:
   var.myString = file.read()
var.myString=var.myString.decode(u'UTF-8-sig') #error is here
print(var.myString)

#some testing without line 10
print(myString) #works
#print(myString.encode(u'UTF-8')) #error
print(var.myString) #works
print(var.myString.encode(u'UTF-8')) #works

The error message indicates that the error is located in line 10.

And indeed, this works without line 10. Moreover myString cannot be encoded again using UTF-8, whereas var.myString can be. Without line 10, the script also works when adding non-ascii characters to the txt file. This seems to indicate that reading the file to var.myString automatically decodes to UTF-8.


Is there some automatic decoding/encoding when using var and is this intended? And which way should be used in a program?


Best

Leo

Comments

  • Hi Leo,

    Yes, you're correct. In Python 2, the var_store automatically decodes str objects to unicode objects. So this:

    var.myString.decode(u'UTF-8-sig')
    

    Is calling decode() on a unicode object. What happens in that case is a bit strange. Python 2 will automatically call encode() to first create a str object, assuming ascii encoding, and only then call decode() . And the encode step is where it goes wrong.

    In any case, it's not necessary to decode, because everything is already unicode !

    Cheers!

    Sebastiaan

  • LeoLeo
    edited January 16

    Hi Sebastiaan,


    Thank you, that explains a lot. But it seems var always converts using UTF-8, so I guess it's a little safer to use manual decoding (since sadly some applications automatically use UTF-8-BOM). See e.g. this example using a UTF-8-BOM encoded file. The second part also doesn't work for ANSI-encoded files and the error message suggests that the decoding of varis always performed using UTF-8.

    import string
    #reading to variable
    with open(pool[u'text.txt']) as file:
      myString = file.read()  
    var.myString=myString.decode(u'UTF-8')
    print(var.myString[0]==u't') #false
    var.myString=myString.decode(u'UTF-8-sig')
    print(var.myString[0]==u't') #true
    #reading to var.variable
    with open(pool[u'text.txt']) as file:
      var.myString = file.read()
    print(var.myString[0]==u't') #false
    

    Error message using ANSI:

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 4: invalid continuation byte
    

    Best

    Leo

  • Hi Leo,

    Thanks for bringing this to my attention, because this definitely should be clarified in the documentation. But yes, it's as you say: the var object automatically decodes bytecode strings to unicode , assuming utf-8 encoding.

    Cheers!

    Sebastiaan

Sign In or Register to comment.