Patterns for searching a source repository for private data

Question

I recently discovered a case where a colleague had accidentally committed their login credentials (host, username, and password) to a local source code repository, and then pushed these changes to a public repository on GitHub. Of course, this was not an isolated incident – a few years back, GitHub killed its full-code search feature after people discovered hundreds of private keys and other credentials in public repositories.

I'd like to make sure that this sort of thing hasn't happened in the past with any of our other public-facing repositories (and, in case it has, to scrub the private data, change the exposed passwords, revoke the exposed keys, etc.). It's no problem for me to cobble together a shell script to pull past commits to a given Git or Subversion repository so that I can scan them for private data. But what sort of filename and text patterns should I use? For example, I want to catch files whose name suggests that they contain private keys or credentials (password.txt, id_dsa, id_rsa, secring.gpg, .netrc, and probably several more standard ones that I'm forgetting or am not even aware of). Is there a list somewhere covering the most common cases? Similarly, I'd like to scan the contents of text and source files for patterns that indicate hard-coded login credentials. Perhaps someone has already produced a list of regular expressions to start from?

[This Google search](https://www.google.nl/search?q=scanning+for+files+containing+passwords) will give you lots of suggestions for setting up patterns to search for *in* files. — , Jun 22 '16 at 09:13
I don't think there can be any definitive answer for such questions asking for lists. Nevertheless, Google Hacking Database's section about [password searches](https://www.exploit-db.com/google-hacking-database/9/) may be a good start. — WhiteWinterWolf, Jun 22 '16 at 12:25

score 2 · Answer 1 · answered Jun 22 '16 at 08:34

The important files vary by programming language, and environment. For example, if you're running nginx, .htaccess files, by default, won't affect the behaviour of the server. However, those same files could really mess things up if someone loaded your application into an Apache environment. Therefore, you need to customise any list to your own needs.

There are some files which are probably always considered sensitive though:

Private keys (id_rsa, id_dsa, *.pfx)
Shadow files (/etc/shadow) - if you're checking these into source control without a very good reason, you're doing something wrong!
History files (.bash_history and similar) - these often have passwords which were mistyped, or used in command lines for interactive tools stored
Log files (/var/log/*) - again, they often have details you might forget to look for in

More specific files that shouldn't be in source control:

.htaccess, .htpasswd - Apache directory specific configuration files
web.config - IIS directory specific config file
wp-config.php - Wordpress config
sites/*/*settings*.php - Drupal config files
*.jks - Keystore files
and so on...

Github have a good sample of gitignore file contents, although these also cover things that shouldn't be in source control due to other reasons (e.g. compiled output shouldn't usually be in source control, due to not being source...)

Also take a look at [Github help - Remove sensitive data](https://help.github.com/articles/remove-sensitive-data/) — , Jun 22 '16 at 09:15

score 1 · Answer 2 · answered Jun 22 '16 at 12:35

There is an application called "OpenDLP" (data loss prevention) that can be used to scour your network for sensitive data. It is based on regexs so you could configure it to search for whatever you'd like: Passwords, keywords in intellectual property, social security numbers, credit cards. This would definitely help minimize the occurrences of data leakage.

Whenever I perform pentesting, I enjoy finding data in repositories. Human error is always the number cause of a breach. During my pentesting, I run OpenDLP to assist me in scouring shares, fileservers, you name it, in search of what may be credentials, and or passwords. It is not only public facing systems that need to be addressed, but also internal systems, where an admin may leave a configuration file with credentials that has little protection on the file. This would enable an attacker if they got in via client side attack, to minimize other attacks against credentials (why crack passwords if they're giving them to me.)

Other than that, you really can't solve a social problem (forgetful employees) with technology. Training and awareness can only go so far. Enforcement, and testing of that training is what matters most. Take the time to get your employees to understand: "Before you submit/upload/change/deploy your work, take a moment to ensure you have not divulged sensitive information. Anyone not following the procedure is subject to a warning, followed by suspension, followed by termination."

Patterns for searching a source repository for private data

2 Answers2