Messaging apps like Signal or WhatsApp offer a contact discovery service that allow you to discover which of the contacts in your address book are also a member. Depending on the implementation, this creates certain privacy risks. Even though Signal uses a protocol involving a trusted execution environment to mitigate the risk, an inherent risk of (one-sided) contact discovery remains: if someone (a patient, an ex) has your phone number in their address book, they will be notified when you join Signal. Not necessarily what you expect when joining a messaging service that focuses on privacy. This motivated me to study mutual contact discovery, where users of a messaging app can only discover each other when both have each other’s phone number in their address book.
Mutual contact discovery allows a potentially very large, dynamic, set of users from an existing social graph, that is represented by locally maintained contact lists, to discover each other when joining a new service. Alice and Bob are mutual contacts if Alice is a contact of Bob and Bob is a contact of Alice. In the case of messaging, the underlying social graph is based on the existing mobile phone network, where the identifiers are mobile phone numbers and the social graph is given by the contact lists that users maintain on their mobile phones.
Members can communicate directly, securely, and anonymously with a separate matching server, and can communicate with each other only through this server. (Sender anonymity is a crucial assumption in many privacy friendly protocols, actually). A mutual contact discovery protocol should satisfy the following requirements.
We make no assumptions about the behaviour of adversarial users (or the matching server). We do assume that the matching server runs independently of the messaging server, in particular we assume that the matching server has no information about membership.
In this blog post we discuss the ‘one-shot’ case that allows a set of members to discover each other exactly once. This is not a real limitation, as such a protocol can be executed in successive rounds (e.g. every day) to implement real, dynamic, mutual contact discovery in practice. See the full paper for details.
The basic protocol uses a key derivation function KDF. This is a cryptographic hash function that is not only hard to invert but also takes a `significant’ time to compute. This means that in practice the function cannot be inverted over relatively small, low entropy, domains (because computing a dictionary is prohibitively expensive).
Let A, B etc. denote users as well as their associated identifiers (i.e. phone numbers) stored in the contact lists. Let us write X(A) for A’s contacts. Define vAB = KDF(A∥B) (where ∥ denotes concatenation). Then the protocol does the following.
For security and privacy, this protocol relies on the key derivation function, applied to the concatenation of two identifiers. This makes it more secure than using hashing or key derivation functions for normal, one-sided, contact discovery for the following reason.
Let us consider the Netherlands as an example. Let’s say there are 224 ≈ 16, 7 million possible phone numbers. Then (because of concatenation) the input domain to the KDF contains at least 248 possible elements. Let us assume that the average customer hardware is a million (220) times slower than the fastest hardware available to the attackers. And let us assume that a very connected user has a contact list of at most several hundred entries (say 28), which each need to be processed by the KDF. Then the effort for an adversary to create a complete dictionary (which would actually be huge to begin with: 248 bits, which equals roughly 32 TB) is 248/(220×28) = 220 times higher than the effort for a user to participate in the protocol. Using a KDF thus (in practice) limits parties without any prior knowledge to try all possible inputs and discover relevant ones. Parties with (specific enough) prior knowledge are less constrained and may thus break the privacy properties of the scheme.
How well does this simple protocol fulfil the requirements?
To reason about privacy, it is crucial to assume that the sender of a submission or a query is anonymous.
Contact privacy protects against testing a suspicion of being a contact. This is a rather strong notion: without any suspicion, an adversary has only limited resources to find some contacts and the server will most certainly not be able to reconstruct the full social graph. In that sense, even this simple mutual contact discovery protocol protects privacy better than one-sided contact discovery. But this comes at a cost: members can now test membership suspicions (which is typically not possible in one-sided contact discovery because information about A’s contacts has no bearing on the information exchanged between any other member and the server).
Note that many of the weaknesses in this protocol are caused by the fact that any member can compute and submit/query vAB for arbitrary A and B. Can this be prevented?
Indeed, it can. The crucial observation is that in order to prevent the adversary to verify a guess, the values used to discover a connection need to involve a secret that the adversary cannot know. The idea here is to give users A a certificate C(A) for their identity that is needed to construct a token TAB used to discover another user B. In essence we use the same techniques as Boneh and Franklin’s system for Identity Based Encryption. The details are as follows.
Then TAB = TBA using the bilinear properties of the pairing function e.
This is the case because TAB = e(C(A),QB) = e(sQA,QB) = e(saP,bP) = e(P,P)sab = e(sbP,aP) = e(C(B),QA) for some (unknown) a, b such that QA = aP and QB = bP.
The intuition is that in order to compute TAB you need to know either A’s or B’s certificate. It is assumed that honest parties keep their certificates secret.
The protocol runs as follows (with parameters as described above). It uses an additional hash function H2 : G2 × G1 ↦ {0, 1}n for some large n.
The protocol is correct, discovering mutual contacts for honest users, because of the following. If A and B are mutual contacts, then
We know TAB = TBA so in the query phase, when A sends ⟨H2(TAB,P), H2(TAB,A)⟩ to the matching server again, it will receive ⟨H2(TBA,P), H2(TBA,B)⟩ in response.
Security and privacy rely on the hardness of the bilinear Diffie-Hellman (BDH) problem (as does the Boneh-Franklin IBE scheme). We will not go into any details in this blog post, but simple mention this assumption allows us to prove that only A and B can in fact compute TAB. Embedding this token with a second hash function H2() subsequently ensures that no other party even sees (or can reconstruct) this token, and therefore prevents it from creating valid tuples used in the submission or query phase.
Security then follows because the server needs to respond with H2(TAB,B) in the query phase, so the server (or any other member) cannot trick A into believing B is a member. Similarly, contact and membership privacy follow because no party can construct a valid token to test a suspicion of membership or contact.
Note that we assume that members anonymously submit to or query the matching server. We also assume the server learns the total number of contacts (this is the size of the set S) and the total number of mutual contacts (this equals the number of tuples in S that are equal on their first component), and that this information is in fact public knowledge. It is for this reason that the server does not need to pad its response to a query, as an empty response reveals a non-mutual contact (and a non-empty response reveals a mutual contact). Although in both cases the identities of the (mutual) contacts is of course not revealed.
(The actual proof is much more involved than this.)
The full paper, with full protocol descriptions and security and privacy analysis, can be found here.