Patch Notes: The Case of the Mysterious Crashing Server

I’ve been kicking a few ideas around, thinking about what topics or experiences I wanted to write about to chronicle my journey into systems administration/architecture. This week I had the fortunate misfortune to to come across the perfect situation for the first entry in this blog series when I received an email from one of the math professors at UP about our WeBWork server.

A bit of background on WebWork: it’s an open source program for delivering math homework via the web. We’ve run it at UP for a number of years, and recently migrated to AWS , using a single instance running Ubuntu.

Back to our emailing professor. She had been using the system over the holiday break and noticed intermittent lock-ups. She could use the site normally for about half an hour before things went sideways and it became unresponsive; if she waited a few hours and came back it would be working again, but only for about 30 minutes at a time.

After getting the email I checked the server and yeah, it was totally unresponsive via the web and SSH. Time for a hard reboot via the AWS console. After rebooting everything seemed fine but before long the prof let me know she was still experiencing the same issues. This time I was able to SSH in before things went totally haywire – but trying to execute any commands returned an error:

-bash: fork: Cannot allocate memory

Aha! A clue. I rebooted again and downloaded htop:

sudo apt-get install htop

htop is a CLI program for monitoring system usage – including memory. I hadn’t used it before, so I took advantage of the Lynda.com account I have through work and watched a good htop tutorial video to get a feel for it. When WebWork was in use, I could see via htop that there were some processes owned by the Apache server (www-data) that were eating up all of the available memory, of which we only had a paltry 4GB (I suspect that the original sysadmin who built the server intended to use an auto-scaling group that would spin-up additional resources on-demand, but was never able to get this working correctly).

A bit of research turned up lots of forum posts that discussed WebWork’s nasty habit of eating up system memory and failing to give it back, which over time can use up all the available RAM and result in crashes – some useful threads:

Through these posts, I learned about the Apache Max RequestWorkers and MaxConnectionsPerChild parameters. These control how many processes the server can spawn before shutting the oldest/most bloated ones down and can be tweaked to keep WebWork from running too many memory-obliterating tasks at once. It’s a balancing act, though: allow too few simultaneous requests, and unnecessary lag is introduced as Apache is forced to create new processes constantly while memory sits unutilized. A quick visit to the appropriate Apache config file at /etc/apache2/mods-available/mpm_prefork.conf confirmed that the server was still at the default setting of 150 Request Workers and unlimited child connections per requests (0 = unlimited in this case).

StartServers             5 
MinSpareServers          5 
MaxSpareServers          10 
MaxRequestWorkers        150 
MaxConnectionsPerChild   0

A bit more research turned up the install guide for WebWork on Ubuntu and “a rough rule of thumb” of 5 MaxRequestWorkers per 1 GB of memory and a MaxConnectionsPerChild value of 50.

This gave me the formula to determine optimal Apache settings but I wanted to increase the system RAM as 4GB still seemed likely to be insufficient for any heavy use. This was easy to do in AWS as the WebWork instance was a simple single Elastic Block Store (EBS) backed EC2 Amazon Machine Image (AMI).

In the EC2 AWS console:

  • Take a snapshot of the root volume attached to the instance (just in case)
  • Instance state -> Stop
  • Actions -> Change instance type (in my case I changed from t2.medium with 4GB RAM to a t2.large with 8GB RAM)
  • Instance state -> restart

I logged back in and tuned the Apache server for 8GB of RAM with the following settings in the mpm_prefork.conf file (8GB X 5 = 40):

StartServers             5 
MinSpareServers          3 
MaxSpareServers          5 
MaxRequestWorkers        40 
MaxConnectionsPerChild   50

This was followed with a quick restart of Apache to apply the changes:

sudo apachectl restart

I headed to the site and logged onto a test course, trying out some searches in the Library Browser and found that things were improved: I could still see in htop that processes were eating up memory, but they would quickly be killed off and the memory returned to the system. I may have to tweak things when students start logging in and hitting the server with a lot of simultaneous, small requests, but for now so far so good.

Photo by Paxson Woelber on Unsplash