Data Lost and Found in a Distributed System

HYPHA

February 7, 2022

Healing data loss with Matrix’s federated protocols

Matrix is a “decentralized real time chat” standard and software – an alternative to existing closed services like Slack – but positioned as having “no single points of control or failure.” We’ve used Matrix extensively within Hypha for our internal chat and while we’re a fan, there are still moments where we are surprised by how a design goal like federation plays out in practice. One such moment was when a failed server upgrade led to data loss and how the federated Matrix protocol “healed” the data.

Software upgrades are tricky

Hypha uses a Matrix server for internal and external communication. Our Matrix server is hosted by Toronto Mesh members who are also a part of Hypha with donated resources from a former PittMesh member. In April 2021, Synapse, one of Matrix’s software components, dropped support for our operating system, Debian 9. That led to a planned server upgrade to Debian 10 and a sequence of additional upgrades and migrations to ensure ongoing software support.

What sounded like routine maintenance became a bigger undertaking than we expected. The plan was to update Synapse at the same time because there were security advisories issued on the version that was running. After upgrading Debian, an upgrade of PostgreSQL was required from 9.6 to 11. This required copying the databases over from 9.6 to 11 because they were not compatible with the new version of PostgreSQL. While the large database was being copied, the Toronto Mesh member doing the upgrade signed off for the night and handed off to another member to pick up after the copy finished and the upgrades left off.

Next morning on June 6, 2021 another Toronto Mesh member continued the upgrade process. However, unbeknownst to them the system had run out of space and failed the database migration. The next step in the guide stated to run the clean up command, which deletes all the old databases without verifying that all data has been moved over. So the cleanup was performed and the machine rebooted with an incomplete database. Soon enough users started noticing missing chat rooms and messages. Toronto mesh members tried to restore the machine’s backup. But we realized that the earliest backup image was dated January 4, 2021, taken 5 months prior. We learned that the backup hardware was removed to make room for other customers in the datacentre with a plan for offsite backup, but that setup was delayed. After working with the server provider to get the machine restored to the state as of January 4, 2021, members upgraded the machine again to a working Debian 10 with the latest version of Synapse.

Once the system was updated to the latest software, Toronto Mesh members in charge of the server started looking at the impact caused by the 5 month gap between the backups. Private messages, some room avatars, user accounts created during the affected period, and password changes in that period had been lost. We already had a policy of purging chat messages more than 3 months old and the norm of treating our chat as ephemeral rather than as an authoritative record of decisions. These limited the impact on our operations, although the data loss was a disruption. We have already internally addressed larger questions of why our internal processes failed and developed mitigation strategies to prevent the recurrence in the future. For instance, we now monitor our daily backup processes to ensure that we have current snapshots that we can fall back on in case of failure.

So, where’s the data?

But there was a surprise: not all data from that period was lost due to the nature of the federated protocols Matrix is built on. Public rooms (that are indexed globally) created prior to the last backup had their messages and files resynchronized and restored from other Matrix servers, whereas files and messages in unfederated rooms were lost. Also, if a user’s account was lost and they re-registered with their same username they were automatically rejoined to the public channels they were once in, as those other Matrix servers have a record of them being members. Federated rooms created during the period of data loss did not initially appear on Toronto Mesh’s Matrix server. However, once users registered on the Toronto mesh server were invited into a ‘lost room’ by users from other parts of the federation, the @tomesh.net users’ access was restored. This made the process of rebuilding Hypha’s lost Matrix communications channels much easier than we’d originally anticipated.

The Bigger (Federated) Picture

These reflections have us returning to Moxie Marlinspike’s post from 2016 “The ecosystem is moving.” While we are committed to prefiguring alternatives in the tool choices we make, we recognize his point: running services is hard, and while federation allows for collective control, it’s more challenging to improve services and protocols. If you need to make significant changes and be adaptable, centralized offerings allow for rapid rollout in a top-down manner.

Would this happen if we used a SaaS such as Slack? This amount of data loss is extremely unlikely, although a server crash for a centralized service is always possible. Not paying for the headaches of running your own infrastructure and ensuring appropriate backups are part of the reason people choose SaaS providers. On the flip side, Matrix’s federated model means our federated data exists on other servers on the network and could be restored once the ‘weak link’ in the federated system – in this case, a failed server upgrade – is fixed.

Federated models offer interesting use cases for data recovery when faced with censorship or similar challenges. As we demonstrate here, repopulating lost data from other nodes in the network can make recovery possible.

However, federated networks also pose challenges; data retention of public room chats on other servers is not intuitive, and not something that all users would expect. While Matrix allows for private rooms, private messages, and an option to encrypt those, there is still the chance that some data may be retained on servers against the intent of members in that room. Like Mastodon and other self-hosted social media services, each server can set their own policies on data retention. As mentioned above, at Hypha we purge messages older than three months, and we have personal relationships with where that data lives.

In addition to there being no single point of control or failure, there is also no single point of view over the chat rooms for all users on the same server or even across servers. Instead, we all have partial views, and while that may conform to the partial perspective of our lived experiences (and the one called for by feminist science studies scholars), it is a departure from the totalizing view presented by a lot of contemporary social media and chat platforms. These possibilities offer sites for fruitful experimentation as we seek alternatives to existing approaches to technology, despite a few stumbles along the way.