Saving screen at end of experiment hangs under load
We are running JATOS 3.8.5 with an Apache reverse proxy on a RHEL 9 VM (4 vCores and 4GB RAM), connected to a MariaDB database held on another VM.
One of our academics is using JATOS in their lab sessions, and is experiencing issues saving data at the end of the experiment which seems to be related to the number of active users:
The problem is not running the experiment (at least I can get 25 pc's going without it crashing), it is the saving of the data at the end of the experiment. The saving screen just hangs and the finish screen never seems to appear. However, when I close the PCs, leaving around 5 going, then it will save.
Looking at the monitoring information for the VM, the CPU load is generally nil with some spikes to 20%. There is noticeably more memory load, although that stays below 70%:
It’s probably worth adding that we previously had java.lang.OutOfMemoryError
exceptions where Java wasn't all the free memory, so JATOS’s systemd file has been modified to add -J-Xmx3840M
to the ExecStart
command.
Looking at the Apache error log, there’s a total of 14 errors during the lab session which seem to arrive in 7 pairs of the form:
[09:21:09.498025 2025] [proxy_http:error] [pid 1466961:tid 1466961] (104)Connection reset by peer: [client <ip redacted>.247:52283] AH01102: error reading status line from remote server localhost:9000, referer: https://<domain>/publix/<guid redacted>-c7f94b481d42/<guid redacted>-e6fcd8817d6c/start [09:21:09.499741 2025] [proxy:error] [pid 1466961:tid 1466961] [client <ip redacted>.247:52283] AH00898: Error reading from remote server returned by /publix/<guid redacted>-c7f94b481d42/<guid redacted>-e6fcd8817d6c/jspsych_dist_7.3.1ANTimage/plugin-image-keyboard-response.js, referer: https://<domain>/publix/<guid redacted>-c7f94b481d42/<guid redacted>-e6fcd8817d6c/start
Looking at the application log, there’s 104 pairs of entries which take one of the two forms below:
[INFO] - publix_access - POST /publix/<guid redacted>-f45467e24078/<guid redacted>-e6fcd8817d6c/resultData [INFO] - g.ErrorHandler - PublixException during call /publix/<guid redacted>-f45467e24078/<guid redacted>-e6fcd8817d6c/resultData: Study result <guid redacted>-f45467e24078 doesn't exist.
[INFO] - publix_access - POST /publix/<guid redacted>-f45467e24078/heartbeat [INFO] - g.ErrorHandler - PublixException during call /publix/<guid redacted>-f45467e24078/heartbeat: Study result <guid redacted>-f45467e24078 doesn't exist.
Looking at the StudyResult GUID in the error, there are other references to it in the log:
[INFO] - publix_access - POST /publix/<guid redacted>-f45467e24078/<guid redacted>-e6fcd8817d6c/resultData [INFO] - c.p.PublixInterceptor - .submitOrAppendResultData: studyResultId 5526, componentId 180
There aren’t any other lines in the log with “warn” or “error” in them. Any suggestions on where to look or what to do gratefully received!
Comments
Thank you for the nice detailed post with logs and CPU/memory monitoring. It makes answering easier. Let's take this apart.
First, after reading "at least I can get 25 pc's going without it crashing" and "However, when I close the PCs, leaving around 5 going, then it will save.", I have a question: How does the academic start the study runs in JATOS? It sounds like they are testing it by starting all 25 study runs in the same browser? Problem here is that JATOS allows only 10 study runs at the same time in the same browser. If there are more then 10 study runs JATOS closes the oldest one and starts a new one. If you want to know, JATOS tracks the running studies with cookies (named JATOS_IDS + index). This can lead to exactly the described problem: study run starts perfectly but then can't finish (and send result data) because JATOS finished them automatically. A way to bypass this is to use different browsers and/or incognito mode: you can have up to 10 study runs in each browser at the same time.
Then in the Apache error log you have 7x an error reading study assets from JATOS. It seems to be always the same file ".../jspsych_dist_7.3.1ANTimage/plugin-image-keyboard-response.js". Are there instances where the same file gets loaded without error? If not then I would assume there is something wrong with this particular file, e.g. wrong filename.
Finally you have those 104x errors in the JATOS application log all stating that a certain study result doesn't exist although it seems to exist since it is used in other requests without an error. This is indeed strange. Could you please send me the whole application log? If you don't want to send me the whole log you can filter for all lines that contain "/publix/<guid redacted>-f45467e24078/" (with the real UUID).
Best,
Kristian
Hi Kristian,
Thanks for getting back to me. 😀
To answer your questions:
How does the academic start the study runs in JATOS? It sounds like they are testing it by starting all 25 study runs in the same browser?
They were running 4 workshops throughout the day, each with 40 students in it. The setup has been each student gets their own PC, and they are all completing the experiment at the same time (i.e. in parallel).
The experiment is constructed as 3 blocks, with the data being saved on completion of the last one. Apologies if that sounds inexact - I'm being a bit of a relay here, as the experiment was written by one person and the session run by another (I'm just responsible for the server/error reports).
Are there instances where the same file gets loaded without error?
It looks like a transient error, as there are other instances of it loading correctly. A quick search suggests that it's an issue with Apache failing to get a response from the proxied application (possibly due to taking too long to respond)..?
Could you please send me the whole application log?
I've attached the JATOS and Apache logs. I've split the log files up so that they only contain entries recorded whilst the workshops were being run, and split them up into one file for each workshop in order to keep them manageable (the JATOS logs are 20,000 lines each; the Apache ones are under 10 lines each). I've redacted out the IP addresses from the Apache logs and username from the JATOS log, but otherwise left them untouched (they were all the same value, pre-redaction).