I have a problem setting where I want to contact a user by phone, and where I need to protect the integrity of the phone call as much as possible. I'm wondering about how to design the interaction in a way that best achieves this.
I am worried about a very specific security risk: the person I'm contacting might be using a smartphone and might have installed a malicious third-party app on their phone without realizing it; and I'm worried that this malicious app might try to tamper with the audio of the call. I'm worried that the malicious app might somehow be able to inject false audio into the call. I don't need to protect the confidentiality of the conversation -- the call won't contain anything secret -- but I care a lot about the integrity of the phone call.
My question: How can I minimize this risk as much as possible?
More elaboration on the problem. I have quite a few degrees of freedom available to me:
I can arrange for the phone call to be placed in either direction (either I can call the user; or I can arrange for the user to call me).
I control the contents of the phone call, so I can incorporate various "CAPTCHA-like" mechanisms to try to test that I'm speaking with a human instead of with malware, or I can have the user repeat back what said to him and vice versa as a form of confirmation, if that helps.
If the audio channel in one direction can be protected more effectively than in the other direction, I can design my interaction with the user around that. For instance, if it is possible to protect the integrity of the audio channel in the direction from the user to me (what the user is saying to me), but not the reverse direction, I can live with that. I can live with the other way, too -- as long as I know which direction can be protected.
My primary focus is on defense against a malicious app that cannot break out of its application sandbox (e.g., it can't get root). Let's assume the user's phone is not rooted, not jailbroken, etc. Also, let's assume that the malicious app is a third-party app that is restricted by the application sandbox, e.g., it is limited to using whatever APIs are available to third-party apps. (If it possible to also defend against apps that break out of the sandbox using some privilege escalation exploit, that would be a nice bonus, but I'm guessing there's no good defense against that threat, hence my focus on apps that stay within the sandbox.) For my application, it would be enough to detect tampering (though of course if it can be wholly prevented that's even better).
So, what is the best defense against this threat?
Constraints. The user is an average member of the public. They'll have whatever phone they have; I can't force them to use a different phone or give them a different phone. I suspect it won't be practical to require some special app for encrypting their phone calls (and I'm not sure if this would help, anyway...). I'm going to need to be able to contact many such users, so any solution must scale. I would like the solution to be as usable for the user as possible.
The research I've done. I've looked at what a malicious app might be do on Android and on iOS, using documented APIs.
On iOS, I haven't been able to find any way for a malicious third-party app to tamper with the contents of phone calls. I realize that's no guarantee it is impossible; I just haven't found a way to do it.
On Android, there appears to be no way to protect the integrity of an outgoing call placed by the user, in this threat model. A malicious app can observe when the user places an outgoing call, cancel the call, take control of the microphone and speaker, display a fake dialer screen, and make the user think that they are speaking to me when they are actually speaking to the malware. The malicious app will need at least the PROCESS_OUTGOING_CALLS
and RECORD_AUDIO
permissions (and possibly MODIFY_AUDIO_SETTINGS
), but this could plausibly happen if one of the user's installed apps is malicious.
However, in contrast, on Android it looks like calls placed to the user might be safe -- or, at least, might be able to be made safe if the conversation with the user is structured appropriately. Apps can detect that there's an incoming call. On old versions of Android, it was possible for an app to block the incoming call and prevent the phone from ringing or showing any sign of the incoming call. However, more recent versions of Android have removed the latter feature: there appears to be no programmatic way for an app to block the incoming call without the user realizing it. Moreover, if the user accepts the incoming call, then there doesn't seem to be any way for a third-party app to get access to the contents of the call or modify it. The situation is a little bit tricky, though. I think it is possible for an app to mute the audio channel in the direction from me to the user and play an audio clip (this might require the MODIFY_AUDIO_SETTINGS
permission, but let's assume the malicious app has that, too), thus faking the audio in that direction: fooling the user into thinking that the audio clip came from me, when it actually came from the malicious app. However, I haven't found any way for the malicious app to eavesdrop on the contents of the call, so if we introduce enough randomness into the call, we might make it hard for the malicious app to guess exactly when to do this attack, and if the malicious app guesses wrong, it might become apparent that something is wrong. So it seems at least plausible to me that we might be able to design some interaction script that makes it hard for a malicious app to fool me.
If the user is using a feature phone, they can't install third-party apps, so this concern goes away.
A candidate straw man solution. This research leads me to a candidate protocol/script for communicating information X to the user:
I call the user. (Don't call me, I'll call you.)
I make some idle chitchat with the user, for a random amount of time.
I speak the information X to the user.
I ask the user to confirm by repeating this information X back to me. The user obligingly says X to me.
I thank the user, say goodbye, and hang up.
I have some randomly selected music playing softly in the background throughout the call (i.e., the same song is playing throughout the entire phone call, and the user can hear it in the background throughout the call).
The purpose of the random-duration chitchat at the beginning of the call is to randomize the time when the information X is communicated, so that a malicious app can't "overwrite" it by muting the audio channel from me to the user and playing an audio clip (the malicious app won't know at what time to do this, because I've randomized the time at which X was communicated). The purpose of having the user confirm back to me is as an extra fallback defense in case the malicious app is trying to spoof me by muting me and playing an audio clip. The purpose of the music is so that the user stands a chance of noticing if part of the audio from me is replaced at any point in time.
This is just a candidate protocol that occurs to me. I mention it only as a starting point and as a straw man for you to critique. Hopefully you can find a better solution. Maybe you'll spot a serious problem with this protocol. That's fine. I'm looking for the best scheme I can come up with, and I'm not committed to any particular protocol or style of interaction -- you tell me what protocol I should use.
The application setting. If you care, the application setting is remote voting, where I want to use the phone channel to confirm the user's votes and prevent malware from changing the user's vote without detection. However, a good solution might be useful in other settings as well, e.g., phone confirmation of online banking transactions or other high-security settings that use phone confirmation as one step in the transaction.
Related. See also Can malicious phone software mount a MITM attack on a phone call?. However, that question focuses on the threat model of malicious code that breaks out of the app sandbox and gets root-level access to the phone; things look hopeless in that threat model. This question focuses on a slightly different threat model, where the user does have a malicious app installed but the malware isn't able to break out of the app sandbox.