GitHub Repositories Leak Thousands of Secrets, Study Shows

In case you were ever in doubt about how well users are protecting credentials in GitHub repositories and other code repositories, researchers at North Carolina State University recently discovered many thousands of leaked secrets and credentials.

I spoke with CyberArk Lab’s Security Research Team Lead Lavi Lazarovitz (@__Curi05ity__) to get a better understanding of how serious this situation is, who it will affect and whether we’re looking at the beginning of a larger trend. But, first some background:

The North Carolina State University researchers Michael Meli, Matthew McNeice and Bradley Reaves scanned billions of GitHub files as part of an academic study that found that over 100,000 of the service’s code repositories contain exposed authentication secrets, such as cryptographic keys and API tokens, and thousands more repositories are leaking new, unique secrets every day. Researchers scanned nearly 13% of GitHub’s public repositories to collect this information.

The researchers used two approaches to identify leaked secrets. The first method was querying a GitHub repository search engine for a period of almost six months. According to the paper, this was a means of discovering in real-time 99% of newly committed files containing secrets. The second method used BiqQuery – a web service for analyzing massive datasets – to query a weekly snapshot of GitHub activity, which is what provided the researchers with their scan of 13% of Github’s public repositories.

In their scan, the researchers found 85,311 unique Google API keys, 37,781 unique RSA Private Keys and 47,814 unique Google OAuth IDs. The researchers also estimated that of the secrets enabled access to sensitive systems or data and their exposure caused real risk.

How serious a problem is this?

Lazarovitz: It clearly highlights a huge and very real problem. While the researchers didn’t release the names of specific organizations with exposed secrets, they did share that they were dealing with some large, prominent organizations. This included AWS credentials for a site used by millions of college applicants in the U.S. and AWS secrets for a major government agency in a Western European country.

The other significant finding from the study was that 81% of the secrets discovered took two weeks or longer to be removed. This means that, the developers using these sensitive secrets probably weren’t aware that their secrets were exposed or drastically underestimate the risk this poses. Furthermore, even developers who were aware of the exposed keys and the need to keep them safe might delete the key, but fail to wipe the entire GitHub repository. In this case, the keys could still be found in the GitHub repository’s commit history, which holds changes made to the GitHub repository – including deleted keys.

Why are secrets and credentials exposed in GitHub and leaked? What are the repercussions of GitHub repositories leaking secrets?

Lazarovitz: Developers write code and scripts which become part of the build or manage the build. The code and scripts typically need secrets and other credentials to do their work interacting with other applications, with other tool chains in the CI/CD pipeline, accessing cloud resources and so on. The code and scripts are, of course, version controlled and stored in code repositories. DevOps processes require speed and agility, which effectively makes GitHub repositories and other code repositories an essential part of the CI/CD pipeline. But, that’s not the problem; the problem is there is very little security or oversight for how these credentials are managed or protected. It’s too easy for developers to hardcode credentials and, once the code is marked public, the organization’s cloud access keys are now in the public domain. An easy mistake for a developer to make when their primary focus is getting the next new feature out to customers.

What do you recommend GitHub users do to keep their keys secure? What behaviors and technologies are necessary to use GitHub responsibly?

Lazarovitz: There are some easy fixes, and many organizations do seem to be proactive. First and foremost, I’d strongly recommend never embedding credentials in code. Hardcoding secrets and credentials in code is a terrible business practice and the easiest thing to mitigate. Hardcoded credentials are one of the most common ways for credentials to leak to a repo. Instead secure and manage credentials in a vault and use API calls or other mechanisms to securely use them. Ideally secrets would never be exposed in an application. Other important controls to limit the privileged attack surface include changing or managing credentials on a regular basis or after use or using short lived or temporary tokens that are valid for a limited time. There are many valid approaches to take control.

Is this an example of a larger problem or trend? Is this kind of thing something we should expect to see more of?

Lazarovitz: As more and more organizations turn to using external services, repositories and resources whether part of their digital transformation or as cloud native, we should, unfortunately, expect the credential leakage to grow. Basically, if credentials are hardcoded, the cost of a misconfigured code repository, build pipeline or other tool in the tool chain is exposed IP and credentials, so every organization that is migrating code or apps using a GitHub repository or other code repository, DevOps or similar processes adds to the probability that credentials will leak.

What can attackers who get their hands on these keys do with them? What are they most likely to do with them?

Lazarovitz: Public repos are available to the whole world and don’t require great skill to scrape – hacking tools are readily available for attackers to put to misuse. And they do – attackers regularly troll GitHub, for example, for cloud access keys and other easily monetized credentials. They can also look at the history of code commits to find exposed credentials. The first attackers to find unprotected credentials are mostly opportunistic attackers that use the cloud access keys to take over compute resources to run crypto miners for immediate profit. In other cases, the attackers might sell the data or encrypt the data and ask for ransom.

What are your key takeaways? Should organizations avoid code repositories?

Lazarovitz: Of course not, Github repositories and code repositories, whether private or public, are an essential part of application development process. Instead, organization’s need to establish processes and approaches to avoid hardcoding credentials in the first place or even storing credentials on GitHub. Basically, as the research highlights, organizations and individuals are, for whatever reason exposing, potentially valuable credentials to the public and to attackers. Please don’t, as many organizations have found that it does not end well.