First off, let me preface this post by saying I'm not a security expert.
I'm trying to build regular expressions to find OAuth 2.0 access tokens and API Keys for common web sites such as Google, Twitter, Facebook, Slack etc. that may have been embedded in source code.
I couldn't find the token formats of the large sites documented in one place anywhere so I carried out my own research:
OAuth 2.0
| Site | Regex | Reference |
| --------- | --------------------------------------------- | ----------------------------------------------------------------------------- |
| Slack | xox.-[0-9]{12}-[0-9]{12}-[0-9a-zA-Z]{24} | https://api.slack.com/docs/oauth |
| Google | random opaque string upto 256 bytes | https://developers.google.com/identity/protocols/OAuth2 |
| Twilio | JWT [1] | https://www.twilio.com/docs/iam/access-tokens |
| Instagram | [0-9a-fA-F]{7}\.[0-9a-fA-F]{32} | https://www.instagram.com/developer/authentication/ |
| Facebook | [A-Za-z0-9]{125} (counting letters [2]) | https://developers.facebook.com/docs/facebook-login/access-tokens/ |
| Linkedin | undocumented/random opaque string | https://developer.linkedin.com/docs/v2/oauth2-client-credentials-flow# |
| Heroku | [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} | https://devcenter.heroku.com/articles/oauth |
| Github | [0-9a-fA-F]{40} | https://developer.github.com/apps/building-oauth-apps/authorizing-oauth-apps/ |
API Keys
| Site | Regex | Reference |
| ------ | --------------------------------------------------------------------------- | ------------------------------------------------------------- |
| GCP | [A-Za-z0-9_]{21}--[A-Za-z0-9_]{8} | undocumented (obtained by generating token) |
| Heroku | [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} | https://devcenter.heroku.com/articles/platform-api-quickstart |
| Slack | xox.-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-zA-Z0-9]{32} | https://api.slack.com/custom-integrations/legacy-tokens |
Based on this research:
- We can map credentials to web-sites with some level of accuracy based on position of hypens, periods, etc.
- A more robust way to detect credentials might be to skip the regex altogether and use shannon entropy to find "unusually random" strings, as illustrated in this blog: http://blog.dkbza.org/2007/05/scanning-data-for-entropy-anomalies.html
I also stumbled on truffleHog which has its own completely different regexs which really confused me at first:
...
"Facebook Oauth": "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].{0,30}['\"\\s][0-9a-f]{32}['\"\\s]",
"Twitter Oauth": "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].{0,30}['\"\\s][0-9a-zA-Z]{35,44}['\"\\s]",
"GitHub": "[g|G][i|I][t|T][h|H][u|U][b|B].{0,30}['\"\\s][0-9a-zA-Z]{35,40}['\"\\s]",
"Google Oauth": "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")",
"Heroku API Key": "[h|H][e|E][r|R][o|O][k|K][u|U].{0,30}[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}",
...
Applying these rules to find a GitHub token, would not match the token in GitHub's documentation: e72e16c7e42f292c6912e7710c838347ae178b4a
but it would match assignment of this value to a suspiciously named variable, eg:
$github_key = "e72e16c7e42f292c6912e7710c838347ae178b4a"
# ... but would _not_ match
$key = "e72e16c7e42f292c6912e7710c838347ae178b4a"
My question is:
- Does anyone have a more comprehensive list of credential formats
- Does anyone have additional/better detection patterns?
[1] I implemented JWT detection as its own rule which matches a 3 section block of text delimited by .
with two sections starting {
(base64 encoded):
e(y|w)[^.]+\\.e(y|w)[^.]+\\.[^.]+
[2] https://www.youtube.com/watch?v=_hF099c0A9M (skip to 1:35)