The SCT is wrong about the predecessor field of the foundation office

First of all please excuse the click-baity nature of the title. I tried to find a better title, but I couldn't come up with anything else short and descriptive (and it isn't even short!). But I do intend to have a more nuanced statement in this post if you are willing to follow along.

For context, the SCT (Spec Core Team) recently published a statement about the room predecessor field of the create event in the Office of the Matrix Foundation room. You can find their statement here. This has been necessary because several servers refuse to join the room after it got recently upgraded. These servers refuse to join the room since the m.room.create event contains a predecessor field, but instead of that field being an object as required by the spec, it actually is a string. The opinion of what this requirement means have been split and the topic has been brought up several times as several people are unable to join the room and nobody so far has stepped up to either change the server behaviour or upgrade the affected room.

As a result of this Synapse servers can join the office room, but servers from the Dendrite and Conduit families can't.

Now I hold the strong opinion that the above statement of the SCT is incorrect, both in that it isn't based on rules written in the Matrix specification and in where it leads Matrix as a whole. Since that is obviously a strong statement, let me explain myself. But to do that, we need to take a short detour.

What is Matrix

Yes, we are starting here, but bear with me. First to quote the matrix.org website:

Matrix is an open protocol for decentralised, secure communications.

Later it also outlines some trade-offs on https://matrix.org/foundation/about/:

(partial quote)

What is the SCT

SCT is short for Spec Core Team. To quote again:

The contents and direction of the Matrix Spec is governed by the Spec Core Team; a set of experts from across the whole Matrix community, representing all aspects of the Matrix ecosystem. The Spec Core Team acts as a subcommittee of the Foundation.
Members of the Spec Core Team pledge to act as a neutral custodian for Matrix on behalf of the whole ecosystem and uphold the Guiding Principles of the project as outlined above. In particular, they agree to drive the adoption of Matrix as a single global federation, an open standard unencumbered from any proprietary IP or software patents, minimising fragmentation (whilst encouraging experimentation), evolving rapidly, and prioritising the long-term success and growth of the overall network over individual commercial concerns.

In practice this means the SCT is usually responsible of merging Matrix Spec Changes, doing spec releases, writing clarifications and often also other pull requests to the spec.

Technical inaccuracies in the SCT's position

Now that we laid out some of the definitions, we can look at the position of the SCT in more detail.

The first problem is that the SCT compares a field in the m.room.create event with an arbitrary, optional field of another state event (they do mention that at the very end). The m.room.create event is very different from other state events. Specifically:

The redaction protection isn't that relevant to the statement by the SCT, but there were some arguments made during the discussion, that servers should only validate the redacted events and I wanted to get those arguments out of the way first.

A significant part of the SCT's statement is talking about how during state resolution validation of an invalid event would cause a room split, since different servers would reach different conclusion for the events validity. Since the create event does not take part in state resolution, I don't consider that argument to be very useful. It might be useful for events in general, but the create event is already special.

Even more than that if servers are supposed to interoperate, parsing the create event under the same rules on every server is somewhat important, since that is where the room version is persisted (although a server does receive the room version also out of band when trying to join a room). While the SCT has said, that such an event should be accepted, it doesn't define exactly what that would look like. The homeserver can't ignore the whole event, because it might need the sender or the room version of the create event. As such a server and every client has to provide very fine grained error handling and consider every field potentially invalid every time it handles an event. Most clients and servers don't currently do that. Instead they often discard whole events if they fail to parse them or only read fields on demand and then run into unpredictable behaviour later down the line.

Now the SCT rightfully points out that servers and clients are supposed to validate events before they use them. Quoting the spec again:

Event bodies are considered untrusted data. This means that any application using Matrix must validate that the event body is of the expected shape/schema before using the contents verbatim.
It is not safe to assume that an event body will have all the expected fields of the expected types.

However the specification only defines in a few places, what this could look like:

As such validating the response to the send_join request is not currently forbidden by the specification (according to my reading and the SCT has not provided any links to back their interpretation up). The joining homeserver isn't part of the room, so it can only send join events, which are filled out by the remote server. As such if it refuses to join the room, the room is not split-brained apart from the joining homeserver having sent a join event, that it then doesn't keep around. This behaviour is not different from a server rolling back to a database backup, crashing during the join or purging a room from the local database. The only noticable impact is for the users, because they can't communicate.

However this only applies in cases, where a server already violated the specification by accepting an invalid event over the /send/ endpoint (or equivalent). As such fixing the homeserver, that sent this event originally, is another valid option to resolve this problem. Under normal circumstances such an event would never be received, since clients are supposed to follow the specification when sending their events and in this specific case, clients usually use the upgrade endpoint to send the event in which case the server should create the correct event. As such rooms with a create event broken in this manner are very few and could easily be upgraded to resolve the issue, especially since the issue only appears when someone upgraded the room in the first place.

Furthermore I belive that the above quote actually says you should validate events even more, since you can't rely on others to do it for you. But it doesn't define what that would look like.

The SCT also claims that the predecessor field was only introduced in spec version r0.5.0. While that is true, this is also the version that introduced the upgrade endpoint. As such the only room versions where this could have an effect on validation should have been room version 1 (which nobody should have upgraded to) and room version 2, which would have needed manual upgrades and people shouldn't have used the predecessor field for it, since it didn't exist yet and people shouldn't use new fields without namespacing them (although that requirement was also only later added, I think). The SCT focuses on a purview of the spec, that doesn't reflect reality.

The directional mistakes by the SCT on this issue

Accepting such a "broken" event means that implementations don't notice bugs in their behaviour and features that people wanted from sending the event are broken. The event has a predecessor, but no client or server will understand it. As such you will never get a beautiful upgrade exerience for that room, where you can just scroll up to see the old rooms history, because no client or server can guess what the predecessor is. This could have been avoided by notifying the user early of their mistake. And afterwards this could have been resolved quickly by upgrading the room, which would have made the impact of the issue minor and reduced frustration. Instead members of the SCT decided, that this should become a controversial topic by making sure people run into it regularly.

This was in my opinion the first mistake by the foundation or the SCT. The discussion around this issue has been emotionally draining for both sides of the argument and could have been avoided entirely by someone upgrading the one affected room a second time. Instead the responsible people chose to lock out everyone on a server implementation that isn't Synapse from the foundation office. This has provided a frequent reminder for people about this issue and set a bad precedent for the experience users can expect from Matrix.

Additionally it became obvious during the discussion that every server, which implemented this feature after reading the spec, chose a different interpretation when reading the spec compared to what the SCT in the end decided on. The only server (to my knowledge) dealing with the event "correctly" is Synapse, which implemented room joins before the spec existed. Specifically the SCT says:

However, the event is now part of the room, and was added legally, so the spec requires other servers to accept it over federation.

As I pointed out above neither of that is written in a clear manner in the Matrix specification. At least the SCT didn't provide sufficient links to the specification to back up their decision in my opinion. So my conclusion is that the spec is not clear enough on this issue and needs to be clarified in either direction. Especially since we have more examples of people reading the specification differently than the SCT. If people misunderstand (empirically 100% of the time) what you wrote down, then you can't come to the conclusion that the specification requires one behaviour or another without sufficient asterisks attached.

The SCT then goes on to ask for an MSC to change this behaviour. While it is a small nuance, I think they should have asked for an MSC to define a behaviour. We have multiple independent implementations, that reached different conclusions. If we want to value interoperability, then we need to define one behaviour and do that in a clear and obvious manner.

However, I think it is worth questioning what interoperability we actually want to achieve and instead of trying to attach meaning to the existing specification, we look at implementations and where we want to go in the future.

The SCT has defined their interoperability as being able to join a room no matter how broken it is. This in my opinion has drastic consequences. For one it makes issues much harder to notice. Instead of failing at the earliest possible point and returning an error, we get implementation defined behaviour at every point the event travels to. Because the specification doesn't define what "gracefully degrading" on invalid events means, the ecosystem has formed very different opinions on it. Some clients ignore the event entirely, which implicitly disables some functionality for the room, as the client can't know what the room version looks like. Others just fail to merge the room timelines. And a third client might decide to start supporting such event contents as if they were valid. If I implemented a server or client, I would want to get an error from my server if I send it malformed events, that it knows about. This is much more helpful than having to figure out why things work incorrectly with other clients a few months later!

This has a significant impact on how people perceive Matrix. While some people celebrate how many things you can break, other users are just frustrated by the experience. They can't rely on what they try to do in their client having the same impact on someone in a different client or on a different server. As such the SCT focuses on a very narrow definition of interoperability. I instead would prefer that interoperability means that things actually work (which is how I understand "operate").

Additionally implementing proper "graceful" fallbacks is a game of whack-a-mole. The specification doesn't cover edge cases. As such every client and server developer gets to have the fun of figuring those out on their own. This isn't as easy as ignoring any field, that doesn't fit some schema. You may need to ignore a field, because it violates the schema in one of its arbitrarily deeply nested subfields. In many cases you need to think up custom solutions to problems.

This decision by the SCT is, in my opinion, another step to making the experience on Matrix more frustrating for its users and developers.

My definition of interoperability would include having clear rules on behaviour and asking implementations to fix their behaviour if they violate it. In this specific case the client sending the broken predecessor should be fixed as well as Synapse to validate that content on all its client APIs.

That would trade short term incompatibilities against having a more reliable ecosystem long term. You will always have implementations, which are unwilling to fix their behaviour or users who don't get updated. Element iOS for example still refuses to display stickers with optional fields missing even 5 years after filing that issue. But I think designing the specification around that would be a big mistake.

HTML is an example that might be brought up as a counter point. It historically has been very accepting of malformed input. But in its case this mostly deals with only 2 participants, the author and clients rendering it. Matrix has a lot more participants in its ecosystem and as such needs to be stricter with its spec to reach the same level of reliability and HTML has so far not that great of a track record on the reliability front either, since you basically are required to test against all implementations to have a temporarily stable result.

In my opinion the spec shouldn't be clarified to focus on the lowest common denominator. We should instead allow stricter implementations, that make the ecosystem easier to debug and more interoperable in the long term. In my opinion we otherwise gain more fragmentation in the long term, which means instead of valuing "interoperability over fragmentation" we instead get more of the latter with a very narrow definition of the former in the end. And instead of finding the perfect solution to validating all events we should focus on fixing the small interoperability issues we notice as quickly as possible to get a more stable and enjoyable ecosystem long term.

Currently the spec does mention events should be validated. The specification mentions required fields and error codes. But in practice none of them can be relied on. Any required field is allowed to be missing and according to the SCT has to be accepted even though the only benefit is that we allow events that slipped in to proliferate, even though they cause problems for users and developers. So what does validation and required even mean? I would say every developer has fallen into the trap of assuming a different meaning of that than the SCT assumes. The specification did add the above quoted section to limit the impact of that, but in practice that makes no difference because of the issues outlined above.

Another issue I take with the SCTs direction is that a lot of the frustration users and developers experience with Matrix and its lackluster approach to validation has been brought up in the discussion, but has not been acknowledged by the SCT. Many developers have left the ecosystem, because dealing with Matrix can be so frustrating. You can never know if you covered all edge cases or not and I personally have written a hundred times more lines of tests for just figuring out the event body of an event than it took me to actually implement it. At the same time users are frustrated with Matrix, since it just isn't reliable in their experience. I feel like the SCT should have at least acknowledged these concerns and outlined some solution for those people. It currently feels like the frustration in the ecosystem is falling on deaf ears, even though that might only be an issue in communication. I do strongly believe that the ecosystem is aware of these frustration and emphathizes with them, but just forgets to mention it.

In Summary

To quickly summarize the above rambling:

This isn't to say I feel hostile towards the SCT. This blog post is only intended to provide arguments, that I feel have been overlooked. But the SCT seems to have reached their conclusion behind closed doors and it is hard to argue with a long document without a long document of your own, since it looks like you are just picking on individual points when commenting on it. As such I wanted to provide my position on it and I hope the SCT will change their stance on it to make sure Matrix becomes a more enjoyable experience long term instead of forcing implementations to interoperate for the sake of presumably a single broken room. I already put in a lot of effort into trying to move the specification in that direction and I would like to continue to do so, but it at times feels like it is much easier to write an MSC for a new feature than simplifying existing aspects of the specification and that Matrix doesn't want to prevent frustration and instead embraces complexity.

Impressum