accent and mask convertion : ANSI, utf-8 format ?

chris · February 2019

Hello everyone,

I'm editing a script OS and I have a problem.
In this task we present 4 French words followed by a mask.
The material is in a pool directory containing 4 .txt files.

The mask depends on the size of the words so that there are as many '#' as letters when masking:

mask = ''
mask = len(word1)*'#'+' '+len(word2)*'#'+' '+len(word3)*'#'+' '+len(word4)*'#'
exp.input_canvas.text(mask)

It works for words without accent.
For words with accents - this adds a character e.g. 'poussé' -> '#######'

So I made a list:

mask = ''
acc_word = ["poussé","....","...."]

if word1 in acc_word
    mask = (len(word1)-1)*'#'+' '+len(word2)*'#'+' '+len(word3)*'#'+' '+len(word4)*'#'
    [...]

That works well !

The problem lies on the first words of each file.
The words are displayed correctly but changing to the mask generates 3 '#' more
e.g., 'alan' -> '#######'

This only occurs for the words at the beginning of the file
Since there are 4 files it concerns 4 trials in the task.

So I wrote the material myself in a txt file in AINSI format

That works well !

The first word of the file is now correctly coded
e.g. 'alan' -> '####'

But now this message is displayed for words with accents

exception type: UnicodeDecodeError
exception message: 'utf8' codec can't 
decode byte 0xea in position 15: invalid continuation byte

At last I tried csv files but that's the same message.

If anyone could help me ?

Thanks,

sebastiaan · February 2019

Salut Chris,

The devil lies in the details when it comes to character encoding. You mention that you read from a text file. To make things easy, I would ensure that the text file is saved in utf-8 encoding. (If possible, also indicate that there should not be a BOM [byte order mark], which often shows up as an extraneous invisible character at the start of the file.)

Then, when you read in the text, convert everything to unicode as soon as possible and only then do stuff with it. The following script will do this in Python 2, which is what OpenSesame uses by default. (In Python 3, things are easier.)

with open('my_file.txt') as f:
    for line in f:
        line = line.decode('utf-8')
        # Now do something with the line

So in general, that's the flow you want to use. That's also what OpenSesame will do for you if you use a .csv file as a source for the loop table.

Cheers,
Sebastiaan

chris · February 2019

And if you gaze long enough into an abyss, the abyss will gaze back into you.

Thank you Sebastiaan ! That was effectively a txt file in UTF-8 with BOM.
I downloaded a good convert editor which allowed to convert to UTF-8 and only this format.
Once that is done, the first word of each first line no longer three characters, but then
accented words was still coded with one more characters.

But it was clearer, I made a list with the accented words:

acc_words = ["purée","bêtes", "..."]
lw1, lw2, lw3,lw4=len(word1),len(word2),len(word3),len(word4)

if word1 in acc_words:
            lw1 = lw1-1

# I did that for each word

mask = lw1*'#'+' '+lw2*'#'+' '+lw3*'#'+' '+lw4*'#'

Now it works !
Thanks again
Christophe

chris · February 2019

Salut Sebastiaan,

First of all thank you for answering.
That was effectively text files saved in utf-8 with BOM (which adds 3 characters to the beginning of the file).
So I edited the text files in utf-8 without BOM with Sublime Text editor and I had no problem with these 3 characters at the beginning of the file.
However the problem of accented words persisted, but it was now easier, I created a list of accented words and I simply coded the masks :

awo = ["purée","bêtes","œufs","...."]

lw1, lw2, lw3, lw4=len(word1), len(word2), len(word3), len(word4)

if word1 in awo:
    lw1 = lw1-1
if word2 in awo:
    lw2 = lw2-1
if word3 in awo:
    lw3 = lw3-1
if word4 in awo:
    lw4 = lw4-1
mask = lw1*'#'+' '+lw2*'#'+' '+lw3*'#'+' '+lw4*'#'
exp.input_canvas.text(mask)

It works now !!
Thank you again, it was a good clue
Chris

sebastiaan · February 2019

Hi Chris,

When defining literal text in an inline_script, it's best to define unicode strings directly by prefixing a u. For unicode strings, the length indicates the number of characters. Otherwise (as in your case) they will be bytecode strings by default, and the length will indicate the number of bytes, which doesn't need to match the number of characters!

This, by the way, is only true for Python 2.

Cheers,
Sebastiaan

chris · February 2019

Hi Sebastiaan,

The people who work on this subject report to me problems.
It may have something to do with your last message ?

it's the same message that was displayed when I tried your proposal with :

path=exp.get_file('sentences1')
    with open(path) as file:
             line = line.decode('utf-8')
        stimuli1_ord = file.readlines()[0:23]

The program would stop displaying this message about empty list as if a file is empty.

When I tried myself error display is : " Python seems to have crashed. This should not happen. If Python crashes often, please report it on the OpenSesame forum."

Details : item-stack: ``

That seems to be happening at the end of the task.

It may be related to the file format ?
Or the way to gather stimuli ?

For information: the file "sentences1'' is edited with sublime text 3 and I did not wish to rename it with ".txt" because it worked well.

Thank,
Chris

chris · February 2019

Problem solved !

It was the cycle repetition that was poorly tuned in the looper.
No problem with the file format.

Have a good day,
Chris

Howdy, Stranger!

Categories

accent and mask convertion : ANSI, utf-8 format ?

Comments

Howdy, Stranger!

Quick Links

Categories

accent and mask convertion : ANSI, utf-8 format ?

Comments