Data loss ("bad request") in JATOS

yvonne · September 2020

Hi all,

I am currently running a lab.js study on JATOS and 25% of my participants are reporting that they clicked on submit at the end of the study, but got the error message: "bad request". Data is also missing for these participants. This has happenend to maybe 1-3 participants in earlier studies, but never to this extent.

I have tested the study many times on different browsers (firefox, chrome, edge), but could not replicate the error. In additoin, I have run other studies on the same server without problems before. I would be thankful for any pointers on what is going on and how I could reduce data loss for my study.

More info: JATOS version 3.4.1 on DigitalOcean with Traefik, individual data size 5.0MB

Thank you,

yvonne

kri · September 2020

Hi Yvonne,

Someone asked a similar question on Github (https://github.com/JATOS/JATOS/issues/204). I think there we couldn't find the reason for the error due to time constrains.

But one possible cause was that somehow the 'srid' parameter in the URL is overwritten. This experiment had a feedback form in the end and the text that was entered in there was then put into the URL query string to be passed on. Problem here is that there is already the 'srid' parameter and if JATOS gets an request without 'srid' it returns a 'bad request' ('srid' means study result ID and is the identifier of a study run).

So, is it possible that you overwrite this 'srid' parameter?

Best,

Kristian

yvonne · September 2020

Hi Kristian,

thank you for your reply! I checked the GitHub exchange and I can at least say that I was hosting very few participants in parallel (max 20) and I think I have large enough server size (2vCPUs, 2GB memory, 3TB transfer, 25GB SSD). The content is also less than in the issue on GitHub (5.0MB).

I do not access or write the URL parameters at all and there is just plain vanilla keyboard input, text input and buttons. So I don't think it is likely that the srid is overwritten in any way. I did use mousetracking though (with the mousetrap plugin).

I got some feedback from my participants too. Apparently, the error occurs on all common browsers (firefox, chrome, safari, edge), which are also up to date, and on all operating systems (mac, windows). Almost all participants took exactly the estimated time of 45 minutes too (so no super long outliers).

I'll try to check the logs as soon as possible and will comment if I find any errors.

Best,

Yvonne

Edit: I attached what seemed to be the relevant part from the logs.

JATOS_bad_request_log.txt

kri · October 2020

Hi Yvonne,

Sorry for taking so long to answer. Apart from the 'bad request' error you also have an 'EntityStreamException' which can or can not have something to do with the first error. This 'EntityStreamException' just states that the request wasn't completely received: parts are missing. To say what happened here is difficult: often it means that some network problems occurred. This might just be temporary and can happen in the Internet. Another reason I can think of is the request size: some of your requests might be too large to handle for JATOS or other Internet-handling-thingies between your participant's browser and your JATOS server. You said already that 'individual data size 5.0MB' - can you elaborate on this? Is it always the same? Are there some bigger requests (and those get the 'bad request' error)?

Now that I'm writing about this: I remember that proxies usually have a config field to set up the max allowed request size. I know Apache and Nginx have. But you use Traefik: I found https://doc.traefik.io/traefik/v2.3/middlewares/buffering/. You can try to increase the field maxRequestBodyBytes to e.g. 100000000 (100 MB) in your traefik.toml.

Best,

Kristian

yvonne · October 2020

Dear Kristian,

thank you for your reply! The individual data size can vary, depending on the mouse-clicking behavior of the participants. It usually is around 3-5MB. However, there are also participants with only 2.6MB or as much as 9.0MB. Unfortunately, I don't know which participants did get the "bad request" error, as I only have the "survivors" on my server. The only thing I know is that the participants with the error message did not take longer than the other participants (all around 45min).

Thank you for the hint with the request size. Do you know if I need to adjust the threshold from which the request will be buffered on disk instead of in memory (memRequestBodyBytes) too?

Best,

Yvonne

kri · October 2020

I'm not an expert in Traefik, just using it every now and then. But from there documentation (https://doc.traefik.io/traefik/middlewares/buffering/) I'd say it's enough to set maxRequestBodyBytes.

yvonne · October 2020

Thank you Kristian! I adjusted the maximum request size. I will let you know if the problem persists or not.

Best,

Yvonne

yvonne · November 2020

Dear Kristian,

here is the final update after data collection has ended:

the data loss did not decrease after adjusting maximum request size
213 participants indicated that they completed the study, but we only have 140 data sets on the server, indicating a data loss of 34%

Personally, I would say this is pretty bad. If you have any other clues on what have might been the problem, I would be grateful.

In the meantime, I will probably stick to studies with smaller data output and shorter duration.

Thank you and best regards,

Yvonne

kri · November 2020

Hi Yvonne!

I'm sorry to hear your study continues to have data loss.

It's difficult to say what went wrong if there is no reliable way to replicate this problem.

Can you check the JATOS log for errors (or send them to me and I check them)? And I mean all logs from JATOS: all logs from the last 30 days are in your JATOS installation folder on the server in logs/ starting with 'application-log'.

And can you send me your study? I'd like to have a look at it and check how the network traffic actually looks like.

Best,

Kristian

yvonne · November 2020

Dear Kristian,

yes, unfortunately, the study is relatively long (45min - 1h) and attempts to replicate the error are therefore quite time consuming.

Concerning the logs: I can neither find the "application-log" nor JATOS itself on my server, since I am running it with Docker and Traefik. Would you happen to know in which directory I can find the JATOS installation in such a case? I tried to use "find", but nothing ever happens when I type it in the console.

I also cannot send you the study because it has a .jzip ending and the forum does not allow me to upload such a file. Could I send it to you in another way?

Anyways, thank you a lot for taking the time to take a look at this. Since I am not running another study in the immediate future, there is no rush either.

Best,

Yvonne

kri · December 2020

Hi Yvonne,

Let me answer one by one:

yes, unfortunately, the study is relatively long (45min - 1h) and attempts to replicate the error are therefore quite time consuming.

If the study takes that long it's difficult to figure out what went wrong. I simple won't have the time to do the study.

Concerning the logs: I can neither find the "application-log" nor JATOS itself on my server, since I am running it with Docker and Traefik. Would you happen to know in which directory I can find the JATOS installation in such a case? I tried to use "find", but nothing ever happens when I type it in the console.

If you are using Docker than it is more difficult. You first would have to go into your JATOS Docker container with `docker exec -it running-jatos-container-id /bin/bash`. In there you have a different system separate from you server's system. In there JATOS is in /opt/docker.

I also cannot send you the study because it has a .jzip ending and the forum does not allow me to upload such a file. Could I send it to you in another way?

The jzip is just a zip and you can share them with Google Drives or similar tools. Email or this forum won't work because they assume the content, the JavaScript files, are a potential security threat.

Anyways, thank you a lot for taking the time to take a look at this. Since I am not running another study in the immediate future, there is no rush either.

I would only have a look at your logs then since replicating your error is difficult.

Best,

Kristian

yvonne · December 2020

Dear Kristian,

I am not sure if you still need it, but below is the study I exported from JATOS.

Concerning the logs: I finally managed to get to the application-logs, but I am lost on how I should download them or send them to you (I have very limited knowledge of ubuntu servers and docker elements). I googled a lot, but I don't even know if I am on the right track (e.g. curl, wget, mail, mutt?). Could you give me a pointer on how I could do this?

In the case that anyone struggles with the same. Here is what I did to get to the logs:

Use 'docker container ls' to show the containers and their content.
Run the command 'docker exec -it running-jatos-container-id /bin/bash' and replace "running-jatos-container-id" with the ID found in the first column.

Now, I am stuck in the container it seems and cannot exit it anymore (without remvoing it?). So any pointers on how to get out of the container would be greatly appreciated too!

Best,

Yvonne

https://www.dropbox.com/s/9aufkpzkakri13c/mousetrack_final_long_2.jzip?dl=0

kri · December 2020

Hi Yvonne,

You got already somewhere :). Docker can be confusing.

Docker has a cp command to copy files between container and host. Have a look at this SO question: https://stackoverflow.com/questions/22049212/copying-files-from-docker-container-to-host

Now, I am stuck in the container it seems and cannot exit it anymore (without remvoing it?). So any pointers on how to get out of the container would be greatly appreciated too!

You can leave the Shell that you opened in the container with docker exec -it running-jatos-container-id /bin/bash with the just exit. And since the Shell is not the actual running program in the container (this is JATOS) it should not stop the running of the container.

Best,

Kristian

yvonne · December 2020

Dear Kristian,

I apologize for the delayed answer, pre-christmas stress caught up with me. Thank you very much for pointing me in the right direction! :) I was finally able to access and download the logs (after some additional failed attempts). I hope there is still enough data relevant to the submission problems in the past 30 days.

Thanks also for helping me exit the container. In the case that anyone with my level of knowledge ever wants to download the logs on a docker server themselves, here is how I did it:

Copy the logs from the docker to the host (outer thing) like this (because you cannot easily access the logs otherwise). I still did this in the web console provided by DigitalOcean. Command: 'docker cp running-jatos-container-id:/opt/docker/logs .' Again, replace "running-jatos-container-id" with the ID from your jatos container. You can find this by running 'docker container ls' and check the first column. The stuff after the ID indicates the path were the logs are located.
I changed from the web console to PuTTY/Windows console here. Here is a short info on how to use PuTTY to access a droplet. PuTTY pscp helps with saving files and folders on your local computer. It can be downloaded here. After saving the pscp.exe somewhere on the computer, I started the Windows console (type 'cmd' in your search bar). From there on, I followed this tutorial (How to Get Files Using PuTTY). However, this did not quite work because the port was not correctly defined, but adding '-P 22' did the trick. Also, to download the whole directory, I had to add '-r' too. So the final command run (in the folder where pscp.exe is) was: 'pscp.exe -r -P 22 username@x.x.x.x:/logs C:\Users\Yvonne\Downloads'. Again, replace "username@x.x.x.x" with your username and IP (e.g. root@124.192.143) and the "C:\Users\Yvonne\Downloads" with the location you want to save the logs on your local computer.

Coming back to the problem: please let me know if there is any addtional information that could be useful to figure out what is causing the issue. I plan to run a similar (but reduced) experiment in the future sometime, and it would be great if that was possible.

Either way, thanks a lot for guiding me through this!

Best,

Yvonne

https://www.dropbox.com/sh/hav7irv7l1ymnyh/AAA_8_erEd0wT1rE3UtR_JJma?dl=0

kri · December 2020

Hi Yvonne!

Great you managed to get the logs. And great you put the way you did it here so others can profit from it.

I had a look at your logs. Lets start with the easy things:

There is no indication that your server had problems. E.g. there are no errors about out-of-memory or disk full.
In the logs are 7 times: Internal JATOS error: java.util.concurrent.CompletionException: akka.http.scaladsl.model.EntityStreamException: Entity stream truncation. Those are usual from network problems and often there is not much that you can do.
Then you had on 4 different days and altogether 8 times: g.ErrorHandler - PublixException during call /publix/97/start?batchId=130&tfafebyZhygvcyf: Unknown worker type. This is a weird one and I can't really say what happened here. Someone tried to start your study with a broken link. Usually the link is /publix/97/start?batchId=130&generalMultiple but someone changed the suffix to tfafebyZhygvcyf. Can't say why. It's weird that this was tried on 4 different days.

And then there are 15 of these JATOS isn't allowed to run a study with the study result ID xxxx. Some of those are just because someone tried to start your study a second time although it was finished shortly before. E.g. have a lock at this one: first finishStudy and then 2 min later start again:

2020-11-19 12:28:23,556 [INFO] - publix_access - GET /publix/97/end?srid=2362&successful=true
2020-11-19 12:28:23,556 [INFO] - c.p.Publix - .finishStudy: studyId 97, studyResultId 2362, successful true, message 'null'
2020-11-19 12:30:24,337 [INFO] - publix_access - GET /publix/97/97/start?srid=2362
2020-11-19 12:30:24,337 [INFO] - g.ErrorHandler - PublixException during call /publix/97/97/start?srid=2362: JATOS isn't allowed to run a study with the study result ID 2362.

But there are also these ones that I don't really understand. First it looks like a normal study start with 'startStudy' and it redirects to start the first component. But something breaks and we get an error: 'JATOS isn't allowed ...'.

application-log-2020-11-22-2020-11-22 11:43:44,830 [INFO] - publix_access - GET /publix/97/start?batchId=130&generalMultiple
application-log-2020-11-22-2020-11-22 11:43:44,830 [INFO] - c.p.w.GeneralMultiplePublix - .startStudy: studyId 97, batchId 130
application-log-2020-11-22-2020-11-22 11:43:44,834 [INFO] - c.p.w.GeneralMultiplePublix - .startStudy: study (study ID 97, batch ID 130) assigned to worker with ID 2385
application-log-2020-11-22:2020-11-22 11:43:44,918 [INFO] - publix_access - GET /publix/97/97/start?srid=2387
application-log-2020-11-22:2020-11-22 11:43:44,919 [INFO] - g.ErrorHandler - PublixException during call /publix/97/97/start?srid=2387: JATOS isn't allowed to run a study with the study result ID 2387.

It somehow looks as if the 'startStudy' wasn't successfully written to the database. It's just a guess - but it's my only guess right now. What database runs with your JATOS? Do you use an external MySQL or the embedded H2? Is the database on the same host as JATOS?

Best,

Kristian

yvonne · December 2020

Dear Kristian,

thank you for taking a look at the logs! Great to hear that it was probably not a server issue. The wrong link seems a little weird, but since the study was conducted with students, maybe someone forwarded a wrong link to their peers (for whatever reason). I also don't have an explanation for people restarting the study after finishing. It's not possible for students to get course credit twice, so that seems a little weird.

Concerning the problems with 'startStudy': Since I do not know what an external MySQL database is, I assume I am running it with the embedded H2. I only have one host, so I assume that they run on the same one (I followed this tutorial here to create the server).

I am not sure if it is relevant, but I noticed that the errors above seem to happen in a relatively short timeframe. From the students' report, they could start the study and complete all the elements and the error appeared at the end only (bad request), after they tried to submit their data (after around 45 min). However, maybe the problem was already there at the start and only resurfaced after they reached submission.

Thanks a lot and best regards,

Yvonne

kri · December 2020

Hm. Strange, from what you describe, ' they could start the study and complete all the elements and the error appeared at the end only (bad request), after they tried to submit their data (after around 45 min)', the only error matching this behavior is the EntityStreamException. All others happened right after starting the study and the participant would not have seen anything of the actual study, but just an error message. Then the database is also not a likely culprit. Then you say your server is on DigitalOcean. They have reliable machines (experiments.jatos.org also runs there). So I don't assume the server machine has a problem with the network. Now I'm out of ideas. Is it possible your students have unstable internet?

Kristian

yvonne · December 2020

Dear Kristian,

while it might be possible, I think the probability is relatively low. The internet connection in Switzerland should be ok for the most part (except for remote mountain villages, but almost no one lives there).

Could the EntityStreamException be the culprit then?

I also tried to set up a little error factory by making the study advance by itself and running it on my tablet and phone in many simulatenous windows, but none of the 25 instances lead to an error (even though they all took around 40 minutes). Looks like the problem is not the duration, but something that happens during the study.

I will try to force an error when I am back in office in January. I hope the logs of such an instance will be more informative.

Best,

Yvonne

kri · December 2020

It's mysterious :). I'm looking forward to any news.

yvonne · February 2021

Hi Kristian,

thanks to the help of someone in the lab who helped me click through the study repeatedly, we could reproduce the error (see below). Attached is the log file of the day. The error happenend at the end of a study that was started at around 07:10 and ended at 08:06. Based on the log file, the error seems to be the EntityStreamException. I hope this helps to pinpoint the problem.

Best,

Yvonne

application-log-2021-02-06.gz

kri · March 2021

Hi Yvonne!

Sorry for the late reply. Things are chaotic with a newborn at home right now.

Let me try to summarize the issue (mostly to remind myself since this thread is already a little older). Please correct me if I'm wrong:

You have your own JATOS server (v3.4.1) with H2 database on DO (2vCPUs, 2GB memory, 3TB transfer, 25GB SSD)
You run a lab.js study that takes 45min - 1h and produces result data of usually 3-5MB but sometimes up to 9MB. The result data are sent once in the end of the study with jatos.submitResultData.
Around 34% of your main batch of 213 participants had some kind of issues
Most worrying are the `EntityStreamException` often accompanied with a `bad request: Missing parameter: srid`
Now you were able to reproduce the error in the lab.

I hope this helps to pinpoint the problem.

To be honest: no, not really. I try to wrap my head around it and try to figure out the cause. But there are so many potential reasons. E.g. I searched again and found this StackOverflow question: https://stackoverflow.com/questions/63120901/alpakka-s3-file-download-problems-with-play-framework-2-8. One of the answers has a list of potential reasons:

The server is too slow (improve server performance)
The network is too slow (improve network performance)
The client issues requests too fast (slow down creation of requests if possible)
There’s high latency between client and server (use more concurrent connections to side latency with parallelism)
There are peaks in the request rate (prevent peaks by tuning the client application or increase max-open-requests to buffer short-term peaks)
Response entities were not read or discarded (see Implications of the streaming nature of Http entities)
Some requests are slower than others blocking the connections of a pool for other requests (see below)

All seem to hint at network / server performance issues.

But I have an idea. For me the problem is still that we can't reliably reproduce the problem. If I can reproduce the problem I can try to find the cause. If you could give me an user on your JATOS then I would do some load/stress tests and see if I can reproduce the error (without having to run your study several times for 1h each). I'd prefere an admin user so I can see the logs right away - but if this isn't possible a normal user would do and you can post me the error logs somehow.

Best,

Kristian

yvonne · April 2021

Hi Kristian,

sorry for the late reply too, I got absorbed in other projects.

Your summary is correct except for

34% of participants having a complete loss of data (the number of people with any type of issue is higher)
The EntityStreamException not necessarily being the most worrying but the most likely culprit of the problem (as it occured in the data set from the study and in the data set in which we could successfully reproduce the error).

Concerning the potential reasons, I don't see one sticking out that could be the problem. I have worked with this server and JATOS before, testing many more participants in parallel and never had this issue (I also increased the maximum request size). Based on this, I think that performance issues are unlikely (I don't understand the response entity issue though).

I have some more info though. A collaborator of mine did a new study (lab.js + JATOS) in which they used eye-tracking (webcam based). Same as mouse-tracking, the coordinates (and timestamps) are saved in long brackets: [1244,1242,1241,1242,1241,1239,1239,1235,1239,1260,1290,1294] and same as in our study, they experienced unexplicable data loss with participants reaching the end of the study but the data not being saved on the server. So my guess would be that the problem lies somewhere with these brackets.

Concerning the login details: I cannot share them in this forum, would you be willing to provide your email so I can contact you directly?

Best,

Yvonne

yvonne · June 2022

In case anyone else stumbled across this thread, someone helped me find a solution in the lab.js Slack channel. Basically, the idea is to add more time for the data to submit. Here is the advice:

Add a sequence with a “data transmission” template on the last page, followed by a page (which should be the last one in the experiment) where you put a button and the script

“this.options.events[‘click button’] = () => this.end()”

in “before:prepare” mode. Add the information that participants have to wait about 5 seconds after pressing the button for the data to be sent (add a countdown).

I implemented this, and it works much better now. I only had 2 missing data files for 70 participants, which could be due to participants aborting the study too.

kri · June 2022

Thanks for posting this! Hopefully it helps others with the same problem.

logan · August 2024

Hello Yvonne!

I believe I've been running into the exact same problem. The lab.js slack group has been unavailable for several months, so this thread seems to be my only hope.

If you remember, could you please help me understand the solution you used?

On the second to last page, I should have a sequence with a "data transmission" template. I don't entirely understand what this means - would you mind explaining what this looks like or what the script you used is?

On the final page, I understand that I should include a button with that line of script in "before:prepare", and instructions on the screen to wait a few seconds after pressing the button. Does this mean that participants can click the button, wait a few seconds, and then just close out of the experiment? I've been using a "complete study" button to end the experiment, with JATOS redirecting participants to a completion url. I'm wondering if the data will still save correctly if my "complete study" button is used after the 5 second wait, as the "complete study" button redirects to the completion url immediately.

Thank you so much - if it's easier for you I would also be infinitely grateful if you could share the last couple of pages of your experiment with me. I'd be happy to share my email with you if this is an option.

Kind regards,

Logan

Howdy, Stranger!

Categories

Data loss ("bad request") in JATOS

Comments

Howdy, Stranger!

Quick Links

Categories

Data loss ("bad request") in JATOS

Comments