Our lemmy instance stopped processing new activity sometime on Monday 2024-11-18 morning.
The root cause remains unknown. Services were online. Database was responsive.
Lemmy server logs were showing the incoming ActivityPub requests, no errors, but no response was being returned to the sender. The system was restarted on 2024-11-19 and processing of requests resumed.
Luckily, the protocol allows for some caching of requests across all servers, so after 30 minutes of heavy load, our server had mostly caught up.
I was away on Monday, and I did notice the issue, but I initially thought it was a problem with my mobile app (recently moved to Boost). I normally view Lemmy sorted by “Top - Last Twelve Hours” and on Tuesday this returned zero results, which prompted a closer look.
I have added additional monitoring to the system, checking for the age of the latest post. I shall now receive an alert if a new post has not been received for 15 minutes. This may result in some false positives if Lemmy is quiet so may adjust this in the future.
Example data:
A status page is available at https://status.lazysoci.al/