If you are using Matrix, you might have seen this:
Dammit, Matrix betrayed us! You can't read the message and everything is terrible! But why does this happen? And what can be done to fix it? There are a lot of possible causes. Let me see if I can shine some light on a few of them.
Why and how are messages encrypted?
Before we go into why something fails to decrypt, maybe it makes sense to briefly explain how the messages are encrypted in the first place.
What is End-to-End Encryption (E2EE)? There are lots of different kinds of encryptions. You are probably viewing this through a TLS encrypted HTTPS connection. You can encrypt a file or a whole filesystem. So basically encryption makes a message readable for some people that have the key for it, but unreadable to people who don't.
End-to-End now specifically refers to who receives the keys. When you send a message to someone only you and the other person receive the keys to decrypt the message. It is like old phones (if they were secure). You have your telephone device on your end and a telephone on the other end. And only those devices should be able to decrypt the message. None in the middle should be able to, when tapping into the telephone wire. For that reason the ends are called "devices" in Matrix. Some clients call those sessions, because you can have multiple clients or even multiple instances of a client on a physical device, so the name device can be confusing. But in the following text we will consistently call the ends devices as session is a loaded term and can also mean other things. It also is a bit easier to imagine physical devices in my opinion.
Message events in a room are encrypted using Megolm, which is a combination of AES with a ratchet (state events are unencrypted). The ratchet is like a zip tie. It ticks forward by one step every time you encrypt a message with it. If you send someone a Megolm key at a specific ratchet position, they can decrypt all future messages with it. They can't decrypt older messages with it. This makes encryption pretty efficient, since you only need to send the encryption key once to all people and they can decrypt future messages with it too. Now obviously if someone leaves the room, you don't want them to be able to decrypt messages anymore, so in that case you create a new key. Similarly if 7 days passed or you used the key for 100 messages already, clients will also generate a new Megolm key. This means an attacker can't just read all messages by compromising a single key.
Now you might ask, how do people get those Megolm keys? Obviously you can't send them in plain text. So they are instead sent Olm encrypted directly between clients.
Directly in Matrix terms. It might still go through one or two servers, but it will only be delivered to a single device and only be decryptable by that device.
Now if Olm was the same as Megolm, you would have another encryption key, that you would need to send to the other party and we would another encrypted channel to send that. The good thing is that Olm and Megolm are different. It is not actually encryption all the way down, Olm is enough. Olm works using 4 keys to create an encryption key. It uses the identity keys of the devices on each end, let's call them Bob and Alice. If it only used those two keys however, every Olm session would be the same. For that reason it also uses a one time key from the other party and a locally generated one time key(OTK). Olm sessions have a ratchet too. They actually have 2 ratchets, one for each participant. They also differ in that they are used by both parties to encrypt messages. Both Alice and Bob use the same Olm session to communicate between their 2 devices (a different one for other devices though), while they use their own Megolm key each to encrypt messages in rooms. After each encrypted to device message they sent, that side ratchets their session forward. That way if an olm session gets compromised both sides can immediately notice that, because the ratchet will mismatch either by being too old or being incremented differently and as such causing an undecryptable to device message.
One last thing to understand encryption in Matrix is the extra ways for how to share and backup Megolm keys. If for some reason a Megolm key is missing, then a device can request the key from other devices by sending a key request. Usually that request is sent to your own devices and the original sending device (as those are most likely to have the key and be willing to share them). Keys can also be backed up. Either by manually exporting all Megolm keys on the local device by going to the settings or by storing them online. You might remember having a recovery key or passphrase. That is a key to decrypt a key that can then be used to decrypt keys from online backup. (Some indirection there, so that the single key can be used for multiple things securely.) Offline backup is usually encrypted using a password specific to that file.
Why are olm session not backed up? Olm sessions are device specific. It makes no sense to back something up, that won't work on another device.
When encryption fails
Now that we maybe have some very shallow understanding of how the encryption works, we can take a look at why it fails. If you can not decrypt a message you usually don't have the key for it. But there are different ways on why you don't have this key. Sometimes it is just that the key you have is only valid for newer messages however, because the other side thinks you weren't in the room at the time of sending the message and you only joined later.
Let's look at some reasons, why you might not have the Megolm key for a message.
You just logged in and didn't restore your backup yet
If you just signed in, you won't be able to read any of your old encrypted messages. You need to first feed your client with the keys. You can do that a number of ways:
- Use the online backup. Just enter your recovery key or recovery passphrase and by the power of magic, you should slowly get access to your messages!
- Restore an offline backup. Just go to your settings, select a backup that has all your Megolm keys in it and by the power of lesser magic, you will have access to your encrypted messages after the import! AMAZING!
- Now comes the real magic. Clients can request keys from each other by sending a key request. That way you can request the keys to all unreadable messages either from one of your devices or the original sender. This however has limitations. What if your other clients don't exist or are offline? Or the original sender isn't online? Well, then none can send you keys, so you will never receive them. Sorry, make sure to keep a backup next time!
Basically, you need some way to recover old keys. If there is none, you won't be able to read the messages. You can recover the keys also from other people and devices, but they need to be online and willing to share.
You logged out all your devices and can't decrypt any message received since your last login
If you look at the graphic above, the decryption keys are sent to all devices. A device only exists when it is signed in. So if you log out all your devices, you won't receive any messages!
This is actually fairly common. People sign out of their browser session and then when they log back in, they can't read messages received in the mean time. An easy option to avoid that is to always keep a device signed in. In the long term dehydrated devices should fix that. It basically dries and pickels a device and stores that in your secure storage. Then once you sign in, that device gets rehydrated and uses and you can fetch all the messages received in the meantime.
Sender signed out
It was mentioned before already, but in some cases on the original sender has the keys to send to you. Usually they do that immediately, but sometimes that failed (because of networking, OTK exhaustion, you not having a device at the time). Only the originally signed in session keeps count of who that particular Olm key should go to. So if they sign out, it will never arrive to you and you will never be able to decrypt the messages sent by that device.
This is usually only an issue with short lived logins of your communication partner. In most cases you should have received the keys, which allows you to share them freely between your devices, store it in a backup and more. But sometimes this does happen and there is no good plan yet to avoid that. Maybe who should receive the keys could be stored in the encryption keys (dMLS maybe?) or some form of transitive trust could be used for key forwarding. But currently there is nothing. If you send an encrypted message, make sure the keys were sent before you sign out.
To improve security most Matrix clients allow you to opt into sending encryption keys only to verified devices. People often don't understand all the implications of that, but this means if the sender didn't verify you, you will not be able to read their messages. In most cases this should be obvious from the error message. The message should say something about keys being withheld. But if you experience encryption issues, it can be a good idea to at least ask if that option is on or off.
To be able to send you the Megolm key securely, clients need to be able to start an Olm session with you. To be able to do that your device needs to have a One-Time-Key, that the sender can use. OTKs can be used one time. So the receiving device needs to upload new ones after they got used. If they don't do that, no new Olm session can be created and as such no Megolm keys can be received anymore.
This is fairly hard to diagnose and usually shouldn't happen if both sides are online. Usually Olm sessions can be reused indefinitely as such once a device sent you a single Megolm key, you can be fairly certain that they can do that in the future. As such this usually only affects a single device of a single person. It should go away automatically once the receiver signs on. Also Matrix 1.2 has fallback keys now, which can be used once all other keys are used up and can be used multiple times. But not all clients support that.
To recover from this you can send a key request to the sender or ask the sender
to start a new Megolm session using
/discardsession after signing into the device, that can't read the encrypted
If you see a message like "corrupted secure channel", the Olm session became corrupted. This can happen for a few reason, but always is a server or client bug. For example if a client sends multiple Olm messages at the same ratchet position, the Olm session will be out of sync and messages will become undecryptable. Or if the server reorders the to_device messages sent between 2 devices, the ratchet will also be confused and refuse to cooperate. Please report those issues to your client and/or server developer. These issues can be very hard to debug so the more people report it the better!
Device list desync
To be able to send you the Megolm keys, the other party opens an Olm session with all your devices. An obvious issue why this would fail, is when the other side can't see what devices you have or has an outdated list. Then some or all of your devices can't receive the Megolm keys. The other side doesn't know they should receive them after all!
It is very easy to verify that is happening. If the other party opens your profile and can't see those devices, then... it can't see your devices. An easy way to fix that in the past was by changing the displayname of your device. That forces a resync and would usually resolve the issues. But it didn't always work and nowadays device names are hidden over federation. Another option is to (temporarily) sign in a new device. That should also update the device list.
Device list federation is a fickle thing. It is a known problem and maybe at some point it will be changed to be more reliable or the concept of a device will be changed to something else. But that will probably take years. So for now we can only hope for any bugs in that code to get fixed.
Federation latency/relativity theory/state res
In modern physics things "happening at the same time" is a somewhat complicated concept. If A and B happen at the same time, it takes some time for the information to travel between those 2 locations. As such each side will see those events as happening after the closer event. (Everything is limited by the speed of light!)
The same thing applies to Matrix. It should eventually be consistent and all events should arrive, but depending on how long the events take to travel, they might be in a different order. This can cause some small issues for encryption, where a user might have thought you were not in the room, when they sent a message. This should usually resolve with the next message, unless there is a lot of lag going on.
Device identity key reuse
This one is fun, because it is indistinguishable from an attack in many cases. If someone deletes their device, clients remove that identity key and mark it as not being allowed to be reused as it could have been compromised. But sometimes clients mess up and mark keys as reused by accident, because they failed to fetch the new device list or fetched an old one by accident and think it is empty now. Or the server messed up and forgot to include a device in the device list for a bit.
Whatever the reason, that device is now marked as malicious and clients will hide it from you and refuse to communicate with it. The only way to recover from this is by signing out either the "marked" device or the "marking" device. Usually you should not have to do this. But bugs happen and when they do, this could be one of them.
This is a small list of how things can go wrong. There are many more things, that could possibly happen because of client side or server side bugs. I.e. if 2 servers can't talk to each other, you won't be receiving any keys. Or if clients just delete their database by accident, stuff will also obviously break. Maybe you can find some more things to add to this list, but maybe it already gave you enough information to troubleshoot such issues in the future. Feel free to bother us in the #nheko:nheko.im room, in the #e2e:matrix.org room or me personally on Mastodon and ask questions if you are still having trouble. (I might set up a dedicated room for my blog some day.)
I tried to put some things into a small flowchart for reference below:
Have a nice day!