Crashed Digital Ocean Server, Lost Data, and Computational Problems
Hello,
I've been running JATOS on a droplet server on digital ocean it since around March 2023. I have two issues I would like some advice for.
First, recently (~31/10/2023), I went to my URL to find that it threw a 404 error. I hadn't made any changes to my droplet in some months so this was strange. When I tried to log into the server via the console on digital ocean, I got the message {"id": "forbidden", "message": "you are not authorized to perform this operation" }. So I opened the recovery console and still did not have access to some information. I powered down the droplet and containers and restarted to find the original JATOS instance was gone. Unfortunately, I did not have a backup of the server, though, thankfully I had backups of the data that was stored on it. My question: What might have happened here? I have currently started a new instance of JATOS, and am now doing weekly server backups.
Second, a couple days after the crash, I tried to collect data from a large number of participants (~100); they did not all complete the experiment at the same time, but data collection (through Prolific) occurred within a couple hours and had several concurrent users (~40). Therefore, traffic was high (CPU usage ~100%), and many people messaged me on Prolific to say that the "submitting data" bar was spinning after experiment completion, but nothing was happening, so I paused the experiment about 3/4 of the way through. Each complete data file from JATOS runs about 866kB (2100 or so excel lines). I believe the participants finished the experiment as they said they did, but many data files were incomplete (53/75 participants) as they did not log all the information as expected (i.e., < 866kB). This is definitely the most load intensive experiment I've run so far. My question: Is this lost data issue due to the previous crash? Or, I think more likely, with computational load?
My droplet plan was originally:
- Machine Type: Basic
- CPU Type: Regular Intel
- 1 vCPU
- 2 GB RAM
- 50 GB SSD
- 2 TB Transfer
Stress tests for this one show:
Concurrency Level: 100 Time taken for tests: 34.597 seconds Complete requests: 1000 Failed requests: 0 Non-2xx responses: 1000 Total transferred: 472000 bytes HTML transferred: 0 bytes Requests per second: 28.90 [#/sec] (mean) Time per request: 3459.653 [ms] (mean) Time per request: 34.597 [ms] (mean, across all concurrent requests) Transfer rate: 13.32 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 34 1849 649.6 1865 3902 Processing: 91 1475 822.9 1313 5440 Waiting: 81 1473 823.0 1312 5440 Total: 139 3324 1095.6 3037 8118 Percentage of the requests served within a certain time (ms) 50% 3037 66% 3609 75% 3798 80% 3914 90% 4695 95% 5363 98% 7020 99% 7239 100% 8118 (longest request)
I've since upgraded it to:
- Machine Type: Basic
- CPU Type: Regular Intel
- 4 vCPUs
- 8 GB RAM
- 50 GB SSD
- 5 TB Transfer
Stress tests show:
Concurrency Level: 100 Time taken for tests: 9.886 seconds Complete requests: 1000 Failed requests: 0 Non-2xx responses: 1000 Total transferred: 472000 bytes HTML transferred: 0 bytes Requests per second: 101.15 [#/sec] (mean) Time per request: 988.593 [ms] (mean) Time per request: 9.886 [ms] (mean, across all concurrent requests) Transfer rate: 46.63 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 38 653 242.2 627 1602 Processing: 11 267 154.0 232 895 Waiting: 10 247 147.0 209 891 Total: 59 921 279.7 905 2091 Percentage of the requests served within a certain time (ms) 50% 905 66% 1039 75% 1107 80% 1135 90% 1281 95% 1390 98% 1567 99% 1710 100% 2091 (longest request)
Will that be enough to handle ~40 concurrent users, do you think?
I appreciate any help given as I am not a server expert. If I am asking these questions in the wrong place, please let me know where to go with these questions. Thank you again!
Regards,
Riss
Comments
Hi Riss!
Sorry to hear about your crashed server. Good that you had your data saved before. But it's difficult to say what happened to your server. Servers crash sometimes. DigitalOcean has an uptime guarantee of 99.99%. The weird thing in your case is that the docker containers were gone afterwards. I'd expect that after a restart everything is going to be there as before. Maybe your server got hacked? Again, it's nearly impossible to say afterwards without having more information.
Then about your second question: Yes, when your CPU usage maxes out to 100% over some time (minutes to hours and not just a short spike) then you should increase your CPU (like you did). But when you say ~40 concurrent users: are those users really accessing JATOS at the same time? Or is it more like you started a batch on Prolific with ~40 users and the users then started your study within a short time period but not all at the exact same time (maybe just a couple at the same time)? If the users get smoothed out over some time period the CPU load is less. And how much JATOS uses your server's resources also depends on your study, what it does, e.g. how much result data is send, is it send multiple times, do you load many large files etc.
That said, your new host with 4 vCPU and 8 GB memory, should be more than enough for what you want to do, even if you have a challenging study. For comparison jatos.mindprobe.eu uses the same Droplet specs and is usually under 10% CPU, seldom gets over 50% CPU.
About your stress test, I can't say much because I don't know the details. Stress test are difficult to do since the different requests that JATOS allows can put tremendously different load on the server. E.g. a simple request of the GUI's home page puts nearly no load on the server, while repeatedly sending of large result data or result files can put a heavy load on it. In your case it would be ideal to design the stress test as close to your study as possible (but this is difficult).
Best,
Kristian
Hi Kristian,
Thanks so much for the swift reply.
It would be wild if my server got hacked. Does that happen often to small research servers, I wonder? I believe I put regular security precautions into place, but again, I'm not a server expert, so it is entirely possible. I don't have much more information about it because basic diagnostics show nothing amiss, and Digital Ocean help staff don't have access to server information to tell me more. Either way, it is running now, I have my previous data, and I'm now making server backups.
For the ~40 concurrent users, no, they probably hadn't started the study at the exact same time, but they are running the study at the same time. My experiment code updates the datafile essentially after every jsPysch trial is complete; and there are a lot of trials. I don't know for sure if this would impact ongoing CPU usage, but given what you said about what the study does, I suspect it might. It is also very possible my code is not perfectly optimized, and thus contributes to CPU usage. The only time I was running participants was during the 2 hour window between 1400 and 1630 (as seen in the image), which shows constant ~100% CPU usage the whole time. So I don't see any "smoothing out."
And what you said about stress tests is true: it is very difficult to simulate real life usage. But it sounds like the server specs I have now should work. I will report back after running my study and let you know how it went.
Thank you again for all the help!
Riss
It would be wild if my server got hacked.
I agree. I never had a JATOS server got hacked. Although I see in the logs a lot of attempts. We try to make JATOS as safe as possible and luckily JATOS is not a major goal for any hackers (I guess. Why would it be?).
Either way, it is running now, I have my previous data, and I'm now making server backups.
Excellent.
For the ~40 concurrent users, no, they probably hadn't started the study at the exact same time, but they are running the study at the same time.
I see. So they indeed are running concurrently.
My experiment code updates the datafile essentially after every jsPysch trial is complete; and there are a lot of trials.
Depending on the number of trials and how long one trail takes, it can add to a lot of requests for sending result data. E.g. if you have 40 concurrent participants and one participant sends data every 10 seconds then you have one request every quarter second. Since JATOS does a lot with the data (calculate hashes, store in the database) it puts a lot of load on your host's resources. There are ways to reduce the load, like using an external database, turn off the hash calculation, optimize your study's code - or, like you did, give it more resources. If it works, it works. :)
Best,
K
Hi Kri,
Good news and bad news!
In my latest run of participants, I still encountered the issue where participants were waiting for the "submitting data" bar, to the point where some of them timed out, however, this occurred at a significantly smaller frequency (12/55), probably because of the increased server resources. I did some more digging, and found a comment you made to another researcher who was encountering the same problem (https://forum.cogsci.nl/discussion/8518/fail-to-fully-transfer-data):
"That might be the reason if you sent the data often. Sending data often can overload the network and when you participants network connection is weak it can lead to exactly this behavior ("transfering data" for a long time). A little bit of background about how JATOS handles data submission. jatos.js, JATOS' JavaScript library, puts all data submissions in a queue and sends them one after another. This ensures data arriving in the correct order at the server-side. It also guarantees if the data is submitted successful you can rely on it being stored in JATOS' database (in your case this sometimes does not happen and therefore your studies sometimes do not finish). But if your study tries to submit large data in short succession this queue might just fill up and the "transfering data" is shown."
I think my issue might just be that some participants have a weaker network connection and my code definitely has a lot of load intensive updates, and then these participants either time out or exit before the transfer can complete. I will look into my code to reduce the amount of required computational load to see if that makes a difference (by not having my code submit data on_update, just on_finish).
Regards,
Riss
Hi Riss!
In my latest run of participants, I still encountered the issue where participants were waiting for the "submitting data" bar, to the point where some of them timed out, however, this occurred at a significantly smaller frequency (12/55), probably because of the increased server resources.
Did you check the CPU load? Was it again 100%? If it was, then I'd strongly suggest to look at your experiment's code and reduce the times you send result data.
I think my issue might just be that some participants have a weaker network connection and my code definitely has a lot of load intensive updates, and then these participants either time out or exit before the transfer can complete.
I agree, that can also be the reason.
I will look into my code to reduce the amount of required computational load to see if that makes a difference (by not having my code submit data on_update, just on_finish).
That is a good strategy for both possible reasons.
I wish you good luck!
Kristian
Hi Kristian,
Yep, smooth sailing with both decreased number of updates to JATOS in my code plus increased resources. Not a single person (0/32) timed out! CPU load went from about 50% (with constant updates) to 5% (without constant updates).
My reasoning for the frequent updates was that if something were to happen to a participant mid-study at least I would have partial data. But then I realized I can't use partial data anyways--at least not in my experiment. So I feel a little silly for only figuring it out now, but hey, that's how we learn. Moral of the story is (1) optimize your code and (2) use appropriate server resources for best results; and perhaps there is something to be said about asking ourselves (experimenters) why we make certain decisions.
Anyways, thanks Kristian for all your help. It is much appreciated!
Regards,
Riss
Hi Riss,
I'm glad you figured it out. In my experience this is a common approach, first store as much as possible because, hey, it's better to have data than not - and then reduce the submission because the server's resources aren't enough, the participant's browser is overloaded, or the network is just not strong enough. I'm glad that our discussion is here, in the forum, and others can learn from it and have a shorter time figuring out why their experiments keep failing sometimes.
Best,
Kristian