Overview

lazysoci.al was offline for 3h 15m today following a database corruption. Server is now back online, federated data is flowing again.

Details

I moved the server to its own dedicated host this morning, for both the performance and security (dedicated vlan) impact. Should have been a simple case of moving the virtual disk with the Lemmy data to the new VM and spinning up the new docker image.

The docker logs didn’t show any initial issues, however writing to the database gave errors of ERROR: relation "approvals" does not exist for every UPDATE query.

After some troubleshooting, I finally thought the database was corrupted, so I started a restore from last nights backup. This took approx. 2h 30m to restore.

Post-restore, the same issue. I then performed an update to the latest beta, and the issue is now resolved.

This has highlighted one problem. I use proxmox-backup-server and proxmox-virtual-environment. You can’t easily restore a single disk from a VM into a ZFS volume. If using the web interface you have to restore the entire VM. So the backup took much longer to restore, as it needed to restore several other disks first.

Improvements

  • The new setup would restore quicker as there is only the OS and data, unique to Lemmy.
  • There shall now be backups performed every 2 hours instead of nightly.
  • A new script performs a snapshot every hour, and retains snapshots for 24 hours. Snapshots can be restored pretty much instantly.

I’ll also do some testing with the CLI proxmox-backup-client so I can work out how to restore a single disk into a zvol.

  • @lazyadmin
    shield
    OPMA
    link
    110 months ago

    Unintended side effect: Mobile apps don’t support the upgraded version, but its not possible to revert at this point. We’ll need to wait for the android apps to catch up.