Saving screen at end of experiment hangs under load

mcuos · January 30

We are running JATOS 3.8.5 with an Apache reverse proxy on a RHEL 9 VM (4 vCores and 4GB RAM), connected to a MariaDB database held on another VM.

One of our academics is using JATOS in their lab sessions, and is experiencing issues saving data at the end of the experiment which seems to be related to the number of active users:

The problem is not running the experiment (at least I can get 25 pc's going without it crashing), it is the saving of the data at the end of the experiment. The saving screen just hangs and the finish screen never seems to appear. However, when I close the PCs, leaving around 5 going, then it will save.

Looking at the monitoring information for the VM, the CPU load is generally nil with some spikes to 20%. There is noticeably more memory load, although that stays below 70%:

It’s probably worth adding that we previously had java.lang.OutOfMemoryError exceptions where Java wasn't all the free memory, so JATOS’s systemd file has been modified to add -J-Xmx3840M to the ExecStart command.

Looking at the Apache error log, there’s a total of 14 errors during the lab session which seem to arrive in 7 pairs of the form:

[09:21:09.498025 2025] [proxy_http:error] [pid
1466961:tid 1466961] (104)Connection reset by peer: [client <ip redacted>.247:52283]
AH01102: error reading status line from remote server localhost:9000, referer:
https://<domain>/publix/<guid redacted>-c7f94b481d42/<guid
redacted>-e6fcd8817d6c/start

[09:21:09.499741 2025] [proxy:error] [pid 1466961:tid
1466961] [client <ip redacted>.247:52283] AH00898: Error reading from
remote server returned by /publix/<guid redacted>-c7f94b481d42/<guid
redacted>-e6fcd8817d6c/jspsych_dist_7.3.1ANTimage/plugin-image-keyboard-response.js,
referer: https://<domain>/publix/<guid redacted>-c7f94b481d42/<guid
redacted>-e6fcd8817d6c/start

Looking at the application log, there’s 104 pairs of entries which take one of the two forms below:

[INFO] - publix_access - POST /publix/<guid redacted>-f45467e24078/<guid redacted>-e6fcd8817d6c/resultData
[INFO] - g.ErrorHandler - PublixException during call /publix/<guid redacted>-f45467e24078/<guid redacted>-e6fcd8817d6c/resultData: Study result <guid redacted>-f45467e24078 doesn't exist.

[INFO] - publix_access - POST /publix/<guid redacted>-f45467e24078/heartbeat
[INFO] - g.ErrorHandler - PublixException during call /publix/<guid redacted>-f45467e24078/heartbeat: Study result <guid redacted>-f45467e24078 doesn't exist.

Looking at the StudyResult GUID in the error, there are other references to it in the log:

[INFO] - publix_access - POST /publix/<guid redacted>-f45467e24078/<guid redacted>-e6fcd8817d6c/resultData
[INFO] - c.p.PublixInterceptor - .submitOrAppendResultData: studyResultId 5526, componentId 180

There aren’t any other lines in the log with “warn” or “error” in them. Any suggestions on where to look or what to do gratefully received!

kri · January 30

Thank you for the nice detailed post with logs and CPU/memory monitoring. It makes answering easier. Let's take this apart.

First, after reading "at least I can get 25 pc's going without it crashing" and "However, when I close the PCs, leaving around 5 going, then it will save.", I have a question: How does the academic start the study runs in JATOS? It sounds like they are testing it by starting all 25 study runs in the same browser? Problem here is that JATOS allows only 10 study runs at the same time in the same browser. If there are more then 10 study runs JATOS closes the oldest one and starts a new one. If you want to know, JATOS tracks the running studies with cookies (named JATOS_IDS + index). This can lead to exactly the described problem: study run starts perfectly but then can't finish (and send result data) because JATOS finished them automatically. A way to bypass this is to use different browsers and/or incognito mode: you can have up to 10 study runs in each browser at the same time.

Then in the Apache error log you have 7x an error reading study assets from JATOS. It seems to be always the same file ".../jspsych_dist_7.3.1ANTimage/plugin-image-keyboard-response.js". Are there instances where the same file gets loaded without error? If not then I would assume there is something wrong with this particular file, e.g. wrong filename.

Finally you have those 104x errors in the JATOS application log all stating that a certain study result doesn't exist although it seems to exist since it is used in other requests without an error. This is indeed strange. Could you please send me the whole application log? If you don't want to send me the whole log you can filter for all lines that contain "/publix/<guid redacted>-f45467e24078/" (with the real UUID).

Best,

Kristian

mcuos · January 31

Hi Kristian,

Thanks for getting back to me. 😀

To answer your questions:

How does the academic start the study runs in JATOS? It sounds like they are testing it by starting all 25 study runs in the same browser?

They were running 4 workshops throughout the day, each with 40 students in it. The setup has been each student gets their own PC, and they are all completing the experiment at the same time (i.e. in parallel).

The experiment is constructed as 3 blocks, with the data being saved on completion of the last one. Apologies if that sounds inexact - I'm being a bit of a relay here, as the experiment was written by one person and the session run by another (I'm just responsible for the server/error reports).

Are there instances where the same file gets loaded without error?

It looks like a transient error, as there are other instances of it loading correctly. A quick search suggests that it's an issue with Apache failing to get a response from the proxied application (possibly due to taking too long to respond)..?

Could you please send me the whole application log?

I've attached the JATOS and Apache logs. I've split the log files up so that they only contain entries recorded whilst the workshops were being run, and split them up into one file for each workshop in order to keep them manageable (the JATOS logs are 20,000 lines each; the Apache ones are under 10 lines each). I've redacted out the IP addresses from the Apache logs and username from the JATOS log, but otherwise left them untouched (they were all the same value, pre-redaction).

mcuos · February 3

I had some extra information from the academic - I don't know if it contains anything helpful:

I can run up to 25 PCs at a time. However, it won't save the data. At least my students can run through the experiment and just close the tab. It looks like JATOS can save data when 4-5 PCs are running at the same time. For this experiment, it is just too chunky for our JATOS server.

The experiment is fast (trials are completed under a second) and has up to 200 trials per participant. I didn't have this problem for an experiment that was slow (trials are completed under 2 seconds) and about half the size. I would love to be able to run some complex experiments in my workshops but I think for that we would need a lot of memory!

elisa · February 3

Sorry to jump in with what is probably a silly question, but your last description made me think of this:

Are the data being sent by any chance after every trial? If it's jsPsych, there's a chance this is the case, and that would put unnecessary load/traffic on the server.

mcuos · February 4

Apparently, the experiment does use jsPsych and the data is sent after each trial (they are looking at changing it to after each block).

Looking at the monitoring data for networking, it seems to be a reasonably high number of connections, but not that much data being sent/received (a peak of 420 connections, and data rate of about 300-400KB but peaking at 1.5MB):

kri · February 7

Hi mcuos!

I had a look at your logs. You seem to have had two issues with your study:

First issue

About the many errors "Study result xxx doesn't exist". Cautiously I'd like to suggest that they all seem to belong to study runs where the study result was deleted during the study run in JATOS GUI. At least this is what the application.log suggests.

An example is the study run with result UUID "837e7b7a-117d-49ef-9d05-f45467e24078". It got started at 09:51:35 and assigned the mentioned UUID:

2025-01-28 09:51:35,117 [INFO] - c.p.w.GeneralMultiplePublix - .startStudy: studyCode 2CWNnUzrsqD, studyResultId5526, studyId 62, batchId 76, workerId 5245
2025-01-28 09:51:35,146 [INFO] - publix_access - GET /publix/837e7b7a-117d-49ef-9d05-f45467e24078/00ea56fd-fd13-443c-aa52-e6fcd8817d6c/start

Then the study runs as expected with many result data being sent, many lines like the following:

2025-01-28 09:52:00,121 [INFO] - publix_access - POST /publix/837e7b7a-117d-49ef-9d05-f45467e24078/00ea56fd-fd13-443c-aa52-e6fcd8817d6c/resultData

Then there is a line in the logs at 09:54:41 that deletes a study result. Unfortunately JATOS doesn't log the Study Result ID here because potentially it can be many results at the same time:

2025-01-28 09:54:41,998 [INFO] - gui_access - PUT /jatos/studyResult/delete (<study owner e-mail address>)

From now on the sending of result data fails:

2025-01-28 09:54:44,999 [INFO] - g.ErrorHandler - PublixException during call /publix/837e7b7a-117d-49ef-9d05-f45467e24078/00ea56fd-fd13-443c-aa52-e6fcd8817d6c/resultData: Study result 837e7b7a-117d-49ef-9d05-f45467e24078 doesn't exist.

I can see this pattern with other study results too, e.g. "a118139d-5484-44b0-b706-cba5567becd3", "6c0cba1f-21ac-4d23-b56c-e537e66acfe2", or "52394a67-7dac-459a-bdac-cc913df9f33c", although there it isn't that obvious in the logs anymore because they get with log messages from the running study including the many errors.

Second issue

The second issue seems to be a choking of JATOS due to more parallel participants it can handle in the current configuration. The study sends quite a lot of result data at the same time, often more then 10 per s. I know your monitoring doesn't show low CPU or memory. It seems to be an issue with JATOS under higher load with many parallel result data requests. I want to investigate this issue further but it needs some time since I have to set up a test environment where I can do some profiling under load. My current hunch is that the internal thread pool isn't large enough. In the mean time you can do things to not choke your JATOS:

Do as you already wrote in your previous post: send the result data not after each trial but change it to after each block.
Turn off your study logs (https://www.jatos.org/JATOS_Configuration.html#enabledisable-study-logging) (the hash calculation takes a lot of CPU)
Do not run a high number of participants at the same time. Split your study runs in batches.

I hope this helps for now.

Best,

Kristian

mcuos · February 10

Hi Kristian,

Thanks - that's very helpful. 😀

I'll pass the information on to the academic in question, and look at the remediation steps you suggested.

Howdy, Stranger!

Categories

Saving screen at end of experiment hangs under load

Comments

First issue

Second issue

Howdy, Stranger!

Quick Links

Categories

Saving screen at end of experiment hangs under load

Comments

First issue

Second issue