OAuth access token/API key patterns for large web sites

Question

First off, let me preface this post by saying I'm not a security expert.

I'm trying to build regular expressions to find OAuth 2.0 access tokens and API Keys for common web sites such as Google, Twitter, Facebook, Slack etc. that may have been embedded in source code.

I couldn't find the token formats of the large sites documented in one place anywhere so I carried out my own research:

OAuth 2.0

| Site      | Regex                                         | Reference                                                                     | 
| --------- | --------------------------------------------- | ----------------------------------------------------------------------------- |
| Slack     | xox.-[0-9]{12}-[0-9]{12}-[0-9a-zA-Z]{24}      | https://api.slack.com/docs/oauth                                              |
| Google    | random opaque string upto 256 bytes           | https://developers.google.com/identity/protocols/OAuth2                       |
| Twilio    | JWT [1]                                       | https://www.twilio.com/docs/iam/access-tokens                                 |
| Instagram | [0-9a-fA-F]{7}\.[0-9a-fA-F]{32}               | https://www.instagram.com/developer/authentication/                           |
| Facebook  | [A-Za-z0-9]{125} (counting letters [2])       | https://developers.facebook.com/docs/facebook-login/access-tokens/            |
| Linkedin  | undocumented/random opaque string             | https://developer.linkedin.com/docs/v2/oauth2-client-credentials-flow#        |
| Heroku    | [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} | https://devcenter.heroku.com/articles/oauth                                   |
| Github    | [0-9a-fA-F]{40}                               | https://developer.github.com/apps/building-oauth-apps/authorizing-oauth-apps/ |

API Keys

| Site   | Regex                                                                       | Reference                                                     | 
| ------ | --------------------------------------------------------------------------- | ------------------------------------------------------------- |
| GCP    | [A-Za-z0-9_]{21}--[A-Za-z0-9_]{8}                                           | undocumented (obtained by generating token)                   |
| Heroku | [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} | https://devcenter.heroku.com/articles/platform-api-quickstart |
| Slack  | xox.-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-zA-Z0-9]{32}                          | https://api.slack.com/custom-integrations/legacy-tokens       |

Based on this research:

We can map credentials to web-sites with some level of accuracy based on position of hypens, periods, etc.
A more robust way to detect credentials might be to skip the regex altogether and use shannon entropy to find "unusually random" strings, as illustrated in this blog: http://blog.dkbza.org/2007/05/scanning-data-for-entropy-anomalies.html

I also stumbled on truffleHog which has its own completely different regexs which really confused me at first:

    ...
    "Facebook Oauth": "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].{0,30}['\"\\s][0-9a-f]{32}['\"\\s]",
    "Twitter Oauth": "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].{0,30}['\"\\s][0-9a-zA-Z]{35,44}['\"\\s]",
    "GitHub": "[g|G][i|I][t|T][h|H][u|U][b|B].{0,30}['\"\\s][0-9a-zA-Z]{35,40}['\"\\s]",
    "Google Oauth": "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")",
    "Heroku API Key": "[h|H][e|E][r|R][o|O][k|K][u|U].{0,30}[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}",
    ...

Applying these rules to find a GitHub token, would not match the token in GitHub's documentation: e72e16c7e42f292c6912e7710c838347ae178b4a but it would match assignment of this value to a suspiciously named variable, eg:

$github_key = "e72e16c7e42f292c6912e7710c838347ae178b4a" 

# ... but would _not_ match 
$key = "e72e16c7e42f292c6912e7710c838347ae178b4a"

My question is:

Does anyone have a more comprehensive list of credential formats
Does anyone have additional/better detection patterns?

[1] I implemented JWT detection as its own rule which matches a 3 section block of text delimited by . with two sections starting { (base64 encoded):

e(y|w)[^.]+\\.e(y|w)[^.]+\\.[^.]+

[2] https://www.youtube.com/watch?v=_hF099c0A9M (skip to 1:35)

OAuth 2 tokens are not standardized in any way. Every provider is free to choose their own format, and as such trying to create a regular expression to find any OAuth 2 token is meaningless. — , Aug 22 '19 at 08:33
@MechMK1 thanks for looking at this. Based on what I saw in the trufflehog source code this looked to be the way to go but there's nothing to stop a service completely changing its key format with zero notice. I guess the answer is if you find something in source code that looks like a key, that needs to be fixed - doesn't matter where that key gets used. — Geoff Williams, Aug 22 '19 at 10:59
Yes and no. It would invalidate all currently existing keys, requiring developers to update their applications or they'd not function anymore. Furthermore, these tokens are designed to be a black-box, meaning they are just a "thing" given to you, and a "thing" you give back. And in order to find them, it's easier to just search for something like "token", "secret", "key", etc... — , Aug 22 '19 at 11:11

OAuth access token/API key patterns for large web sites

0 Answers0