Contact person by phone, when their phone might have malware

Question

I have a problem setting where I want to contact a user by phone, and where I need to protect the integrity of the phone call as much as possible. I'm wondering about how to design the interaction in a way that best achieves this.

I am worried about a very specific security risk: the person I'm contacting might be using a smartphone and might have installed a malicious third-party app on their phone without realizing it; and I'm worried that this malicious app might try to tamper with the audio of the call. I'm worried that the malicious app might somehow be able to inject false audio into the call. I don't need to protect the confidentiality of the conversation -- the call won't contain anything secret -- but I care a lot about the integrity of the phone call.

My question: How can I minimize this risk as much as possible?

More elaboration on the problem. I have quite a few degrees of freedom available to me:

I can arrange for the phone call to be placed in either direction (either I can call the user; or I can arrange for the user to call me).
I control the contents of the phone call, so I can incorporate various "CAPTCHA-like" mechanisms to try to test that I'm speaking with a human instead of with malware, or I can have the user repeat back what said to him and vice versa as a form of confirmation, if that helps.
If the audio channel in one direction can be protected more effectively than in the other direction, I can design my interaction with the user around that. For instance, if it is possible to protect the integrity of the audio channel in the direction from the user to me (what the user is saying to me), but not the reverse direction, I can live with that. I can live with the other way, too -- as long as I know which direction can be protected.

My primary focus is on defense against a malicious app that cannot break out of its application sandbox (e.g., it can't get root). Let's assume the user's phone is not rooted, not jailbroken, etc. Also, let's assume that the malicious app is a third-party app that is restricted by the application sandbox, e.g., it is limited to using whatever APIs are available to third-party apps. (If it possible to also defend against apps that break out of the sandbox using some privilege escalation exploit, that would be a nice bonus, but I'm guessing there's no good defense against that threat, hence my focus on apps that stay within the sandbox.) For my application, it would be enough to detect tampering (though of course if it can be wholly prevented that's even better).

So, what is the best defense against this threat?

Constraints. The user is an average member of the public. They'll have whatever phone they have; I can't force them to use a different phone or give them a different phone. I suspect it won't be practical to require some special app for encrypting their phone calls (and I'm not sure if this would help, anyway...). I'm going to need to be able to contact many such users, so any solution must scale. I would like the solution to be as usable for the user as possible.

The research I've done. I've looked at what a malicious app might be do on Android and on iOS, using documented APIs.

On iOS, I haven't been able to find any way for a malicious third-party app to tamper with the contents of phone calls. I realize that's no guarantee it is impossible; I just haven't found a way to do it.

On Android, there appears to be no way to protect the integrity of an outgoing call placed by the user, in this threat model. A malicious app can observe when the user places an outgoing call, cancel the call, take control of the microphone and speaker, display a fake dialer screen, and make the user think that they are speaking to me when they are actually speaking to the malware. The malicious app will need at least the PROCESS_OUTGOING_CALLS and RECORD_AUDIO permissions (and possibly MODIFY_AUDIO_SETTINGS), but this could plausibly happen if one of the user's installed apps is malicious.

However, in contrast, on Android it looks like calls placed to the user might be safe -- or, at least, might be able to be made safe if the conversation with the user is structured appropriately. Apps can detect that there's an incoming call. On old versions of Android, it was possible for an app to block the incoming call and prevent the phone from ringing or showing any sign of the incoming call. However, more recent versions of Android have removed the latter feature: there appears to be no programmatic way for an app to block the incoming call without the user realizing it. Moreover, if the user accepts the incoming call, then there doesn't seem to be any way for a third-party app to get access to the contents of the call or modify it. The situation is a little bit tricky, though. I think it is possible for an app to mute the audio channel in the direction from me to the user and play an audio clip (this might require the MODIFY_AUDIO_SETTINGS permission, but let's assume the malicious app has that, too), thus faking the audio in that direction: fooling the user into thinking that the audio clip came from me, when it actually came from the malicious app. However, I haven't found any way for the malicious app to eavesdrop on the contents of the call, so if we introduce enough randomness into the call, we might make it hard for the malicious app to guess exactly when to do this attack, and if the malicious app guesses wrong, it might become apparent that something is wrong. So it seems at least plausible to me that we might be able to design some interaction script that makes it hard for a malicious app to fool me.

If the user is using a feature phone, they can't install third-party apps, so this concern goes away.

A candidate straw man solution. This research leads me to a candidate protocol/script for communicating information X to the user:

I call the user. (Don't call me, I'll call you.)
I make some idle chitchat with the user, for a random amount of time.
I speak the information X to the user.
I ask the user to confirm by repeating this information X back to me. The user obligingly says X to me.
I thank the user, say goodbye, and hang up.
I have some randomly selected music playing softly in the background throughout the call (i.e., the same song is playing throughout the entire phone call, and the user can hear it in the background throughout the call).

The purpose of the random-duration chitchat at the beginning of the call is to randomize the time when the information X is communicated, so that a malicious app can't "overwrite" it by muting the audio channel from me to the user and playing an audio clip (the malicious app won't know at what time to do this, because I've randomized the time at which X was communicated). The purpose of having the user confirm back to me is as an extra fallback defense in case the malicious app is trying to spoof me by muting me and playing an audio clip. The purpose of the music is so that the user stands a chance of noticing if part of the audio from me is replaced at any point in time.

This is just a candidate protocol that occurs to me. I mention it only as a starting point and as a straw man for you to critique. Hopefully you can find a better solution. Maybe you'll spot a serious problem with this protocol. That's fine. I'm looking for the best scheme I can come up with, and I'm not committed to any particular protocol or style of interaction -- you tell me what protocol I should use.

The application setting. If you care, the application setting is remote voting, where I want to use the phone channel to confirm the user's votes and prevent malware from changing the user's vote without detection. However, a good solution might be useful in other settings as well, e.g., phone confirmation of online banking transactions or other high-security settings that use phone confirmation as one step in the transaction.

Related. See also Can malicious phone software mount a MITM attack on a phone call?. However, that question focuses on the threat model of malicious code that breaks out of the app sandbox and gets root-level access to the phone; things look hopeless in that threat model. This question focuses on a slightly different threat model, where the user does have a malicious app installed but the malware isn't able to break out of the app sandbox.

A software which is able to analyze audio conversations in real-time and use speech-synthesis to accurately mimic the voice of the speaker, also in real-time? That would be really impressive. — Philipp, Jan 31 '14 at 15:02
@Ben, absolutely not! This is a 100% practical problem. I want a solution that will work in the real world... and it is possible that a good solution might be deployed. — D.W., Jan 31 '14 at 18:57
Given the fact that the German Chancellor's phone was tapped by the NSA (http://www.telegraph.co.uk/news/worldnews/europe/germany/10407282/Barack-Obama-approved-tapping-Angela-Merkels-phone-3-years-ago.html) which was regarded as fully secure until the leak, it is necessary to ponder what type of malware is likely on an untrusted device — Samuel, Feb 04 '14 at 14:59
There are apps able to record the conversation on behalf of the user, which means they are be able to “eavesdrop on the contents”. — Ángel, Aug 23 '15 at 21:22

AJ Henderson · Answer 1 · 2014-02-04T14:19:44.603

8

If a client is compromised, it can not be trusted, period, full stop. Any device you don't control and can't trust could potentially misbehave in any way. You need a trusted third party to validate the behavior of the untrusted device if you want to trust it for a portion of the security.

Unfortunately, computers are too complex for an end user to be of much help here. With any general computing device anything is possible depending on the complexity of your attacker. Devices can be replaced, low level zero day hacks can be utilized, man in the middle attacks can be executed and concealed, etc. There is simply no way to for an end user to verify an untrusted device with an untrusted device. Trying to use some mechanism to work around this would be a bit like asking the attacker to pretty please repeat what you just said and then forget it.

If you have a trusted electronics device outside of the phone, you could use the untrusted phone to obtain a hashable copy of the phone call that could be compared to a securely signed hash provided by the caller. This would verify that the untrusted phone accurately relayed information to/from the caller with a trusted device, at least as long as you listen to the audio file and make sure it was unaltered.

There isn't a good way to do something this complex without a trusted computer though, so if that isn't an option, you have to communicating over untrusted devices. Since you don't have a computer for this, you must use the person doing the talking as the secure client. This means you are limited to encryption systems (cyphers, one time pads, etc) that the user can do in their head. Your only as secure as the crypto system you use (generally much, much less secure than digital, though OTPs are more secure). Since the attacker won't know what is being said, they won't know how to make meaningful alteration and the knowledge of the secret by both parties that are communicating validates authenticity.

edited Feb 04 '14 at 14:19

answered Jan 31 '14 at 14:39

AJ Henderson

41,816
5
63
110

You say there's no way to do it. Hmm. Well, you might be right, but I'm not persuaded yet. Do you have a proof or a technical argument to back up that statement? Something concrete and specific? P.S. I'd like to draw your attention to the straw man proposal in the question. If you're right that there's no way to do it, you should be able to exhibit an attack on my proposed straw man that falls within the threat model I articulated and that breaks my proposal, right? So what's the attack? I'd be glad to find out the scheme doesn't work, but I'd like to see the technical argument... – D.W. Jan 31 '14 at 19:01
@D.W. - taking another look, I have added some more explanation of the issues. Your simple straw man does not work because you can't be sure you were talking to the correct end user. If you were not, they would respond as you would expect but could say anything to the actual end user. I was looking at it from a more technical standpoint using normal speech, however I have expanded it with some more elaborate alternate forms of communication that could be used. – AJ Henderson Jan 31 '14 at 19:42
This could be done by using the phone to make a call that would setup the forwarder so that calls don't get routed to the phone by the carrier. The attacker answers and then initiates a VoIP call to the phone to act as though the connection completed successfully. – AJ Henderson Jan 31 '14 at 19:45
I see you got the OTP before me :-). But you are correct the threat model here assumes that the audio cannot be trusted so it is reduced to the situation of a telegram or letter. You sent a message - did it arrive unaltered? You got a reply - is it genuine? Having voice doesn't help if you assume the audio is untrustworthy. – Ben Feb 03 '14 at 14:03
@Ben - yeah, at first I was looking at the problem to narrowly as trying to use audio (which isn't going to be possible by hand). Then from the later comments from the poster I figured that a non-voice based authentication mechanism could work, though his definition of the situation pretty much precludes that since it appears to want to go without prior contact. – AJ Henderson Feb 03 '14 at 14:19
1

@AJHenderson, the call forwarding attack is a good one! If you have no objection, I'd like to edit your answer to add that one. That's the most persuasive argument I've seen. I have to admit I find the rest of the answer unpersuasive. It treats the situation as black-and-white when it's actually grey. It's not fair to say "you can't trust the client, period, full stop, end of the line" just because it has a malicious app installed, if the malicious app is still confined by the sandbox. But the call forwarding attack is a great one and might kill any hope of a good solution. Thank you! – D.W. Feb 04 '14 at 07:36
@D.W. I did a complete re-write of the answer to address the points of the discussion a little more clearly. – AJ Henderson Feb 04 '14 at 14:20

D.W. · Answer 2 · 2014-02-09T03:16:52.260

I'll summarize, based upon the answers and comments here and other research I've done.

If the malicious app can break out of the sandbox

If the app can break out of the sandbox, it looks hopeless. The app can take control of the microphone and speaker, so there is no trusted path and attacks are easy.

Then again, the question specifically says that this is focused on the case where the app cannot break out of the sandbox. So, let's focus on that from here on.

Android

On Android, things look dire, even if the malicious app can't break out of the sandbox. There are two serious problems.

First, a malicious Android app can intercept outgoing calls placed by the user and do arbitrary MITM attacks on them. As described in the question, a malicious app can observe when the user places an outgoing call, cancel the call, take control of the microphone and speaker, display a fake dialer screen, and make the user think that they are speaking to me when they are actually speaking to the malware. The malicious app will need at least the PROCESS_OUTGOING_CALLS and RECORD_AUDIO permissions (and possibly MODIFY_AUDIO_SETTINGS), but this could plausibly happen if one of the user's installed apps is malicious. So outgoing calls are right out.

Second, a malicious Android app can intercept incoming calls received by the user and do arbitrary MITM attacks on them. As AJ Henderson explains, the malicious app can temporarily set up call forwarding so that the incoming call is redirected to another number. All the malicious app needs to do, to set up call forwarding, is to place a phone call to something like *21*6661234567#; this forwards subsequent calls to 666-123-4567 (something controlled by the attacker). Now the attacker can receive those calls and mount a MITM attack on all of the phone calls, without limitation. There are even ways that such an attack could be cloaked from the user, e.g., by using conditional forwarding. I cannot find any way for the person placing the call to detect that the call has been intercepted in this way. The malicious app does need the CALL_PHONE permission, but a malicious app might plausibly have this.

So, on Android, we are hosed. If the phone call is placed by the user, it can be intercepted. If the phone call is made to the user, the call forwarding attack can intercept it. There seems to be no way to securely reach an Android user by voice, if they have a malicious app on their phone. Surprising, but true!

iPhone

For the iPhone, the situation appears to be different. As far as I can tell, on the iPhone, there is no way that a malicious app can interfere with phone calls without breaking out of the sandbox. Therefore, if the malicious app is unable to break out of the sandbox, it can't mess with voice calls, regardless of which direction they are placed (outgoing or incoming).

In short: the iPhone appears to be more secure. It is straightforward to establish a voice call with an iPhone user that even a malicious app on their phone cannot tamper with (again, assuming the malicious app is unable to break out of its sandbox).

Bottom line

The bottom line, if we need to place a phone call to someone, is that we're screwed. If we don't know what kind of phone they are using (and in practice we probably won't, for consumer applications), we won't know whether they're using a vulnerable platform like Android where it is impossible to establish a secure voice call in the presence of bad apps on their phone, or whether they're using a more secure platform like iOS or a feature phone. We can't tell. And, for consumer applications, it probably won't be feasible to get an average person off the street to use encrypted voice, just for our one-off phone call.

So, the situation doesn't look so good.

Why not treat the voice call as an untrusted channel?

Some respondents asked why we don't just treat the voice call as an untrusted channel. The answer is simple: this leads to unattractive solutions. If we treat the voice channel as completely untrusted, then our only remaining option is something like end-to-end encryption, which today most average people off the street don't have on their phone. So, if we treat the voice call as a wholly untrusted channel, then we're really screwed: the answer becomes, sorry, you can't do it given the desire to be able to call arbitrary people off the street. If we treat the voice call as wholly untrusted, then we're going to be forced to give up. I don't want to give up unless I absolutely have to.

One or two respondents asserted that the voice channel is wholly untrusted. However, they did not provide any evidence to support this assertion. At least in the case of iOS and the threat model articulated in the question (a malicious app, but one that is unable to break out of the sandbox), this assertion appears to be simply false: the voice channel is not wholly untrusted.

There is an important distinction between a channel that is wholly untrusted and one that is partly untrusted. For instance, consider a channel that the attacker can passively eavesdrop on, but where the attacker cannot modify messages. Such a channel provides integrity but not confidentiality, and this is sufficient for some purposes. We might call such a channel partly untrusted. However, treating it as wholly untrusted would be an excess of caution, as such a channel is clearly sufficient for some purposes.

Ben · Answer 3 · 2014-02-03T14:31:47.680

2

Essentially you need to treat this as any other protocol over an untrusted channel. I.e. you need to exchange messages, validate that all messages are received, and prove that they are unaltered. In this case the messages are spoken, but that is immaterial to the form of the solution, because you have indicated that the unique properties of voice communication (recognition of the voice, recognising whether a conversation flows correctly and so on) cannot be trusted in your scenario.

Since you cannot rely on the person's voice to authenticate messages (you have indicated that this could be synthesized or replayed), you will have to get them to use a pen-and-paper cipher with some sort of message authentication system. You could also use an external device for enciphering if they have one which can be trusted.

A one-time pad might be easiest to use and would provide encryption, authentication (by possession of the pad) and replay-resistance (because pads are used only once). Each message can be enciphered, and read out over the phone, as can each reply.

Of course this is no good if you have no way of distributing pads.

edited Feb 03 '14 at 14:31

answered Feb 03 '14 at 13:54

Ben

3,697
1
18
24

Thanks for taking the time to respond. However, I must confess I don't find this argument completely convincing, in terms of the claim that we must treat this as an untrusted channel. As far as I can tell, it's not a wholly untrusted channel. Rather, it seems to be a channel that an attacker can tamper with in certain limited ways -- but it appears that the attacker doesn't have full control over it. So I feel like this answer falls short of articulating an argument that solving this problem is impossible, or that it's *necessary* to treat this as a wholly untrusted channel. – D.W. Feb 04 '14 at 07:43
@D.W. I'm not really sure what your use-case is here for a "somewhat untrusted channel". Why not just treat it as untrusted and have done with it? What additional properties/advantages are you trying to obtain, which you think might be obtained in this way? Once you have hypothesized that the counterparty's voice might be synthesized I am wondering what is left of a voice conversation which is worth having. – Ben Feb 04 '14 at 09:59
Ben, it's not that I want a "somewhat untrusted channel"; it's that I *have* what appears to be a somewhat untrusted channel (namely, the voice channel), and I want to design a protocol that uses that channel to enable me to provide integrity for a communication with a user. Why not just treat it as untrusted? Because encryption is not something I can deploy, so if I treat it as untrusted, I'm forced to give up and declare defeat. On the other hand, my hope was that there'd be a way to take advantage of the fact that it is only *somewhat* untrusted to obtain a more deployable defense. – D.W. Feb 04 '14 at 18:20

score 1 · Answer 4 · answered Feb 04 '14 at 14:04

Provided that the malware is not paranoidly powerful, you could play some background music during your conversation. At the end, you ask the person to tell you what music was playing, and ask them whether they noticed any pause in the music (which would indicate that the conversation has been compromised). Repeating also works, if it is safe for the speaker to repeat the information. Last, if you can verify that both you and your listener can speak fluently a language other than what you're "meant to". That would fool even advanced malware, but is difficult to realize in practice.

If you go for the background music option, ensure that it is very familiar (the listener's national anthem, for example) and that it has very few pauses.

I like the idea, but I fear that there would be too many false positives because in practice pauses would arise due to network glitches. — Gilles 'SO- stop being evil', Feb 04 '14 at 15:21