How am I ever going to be able to "vet" 120,000+ lines of Composer PHP code not written by me?

Question

I depend on PHP CLI for all kinds of personal and (hopefully, soon) professional/mission-critical "business logic". (This could be any other language and the exact same problem would still stand; I'm just stating what I personally use for the sake of context.)

To the furthest possible extent, I always code everything on my own. Only when absolutely necessary do I, reluctantly, resort to using a third-party library. For some things, this is simply necessary. For example, e-mail parsing and other very complicated stuff like that.

For managing such third-party libraries, I use PHP Composer. It's a library manager for PHP. It is able to download libraries, and their dependencies, and update them with commands similar to other "package managers". In a practical sense, this is much nicer than manually keeping track of this and manually downloading ZIP files and unpacking them and dealing with all sorts of problems. It at least saves a lot of practical headaches.

However, the most fundamental security problem still persists: I have no idea what this "installed" code contains, nor do I know what is added/changed with every update. One of the libraries' authors could have easily been compromised one day when my Composer fetches updates, causing my PHP CLI scripts to suddenly send my Bitcoin wallet.dat to some remote server, install a RAT/trojan on my machine, or even worse. In fact, it could already have happened, and I would be none the wiser. I simply have no idea. I logically cannot have any idea.

My own code base is about 15,000 lines in total. It takes me over a year to painstakingly go through that code base. And that's code that I have written and which I know intimately...

My "Composer" directory tree currently is at over 120,000 lines of code. And that's for the minimal number of crucial PHP libraries that I need. I use very few, but they have various dependencies and tend to overall be very bloated/inflated compared to my own code.

How am I ever supposed to "vet" all this?! It's simply not going to happen. I "zone out" very shortly after even attempting. I don't even know how I'm going to make it through another "vet round" of my own code -- let alone this 10x larger one, coded by other people.

When people say that it's a "must" to "vet third-party code", what exactly do they mean? I also agree that it's a "must", but then there's the pesky reality. I will simply never have the time and energy to do this. Also, I obviously don't have the money to pay somebody else to do it.

I spent countless hours trying to learn about Docker and see if there were some way I could "encapsulate" these untrusted third-party libraries somehow, but it's a losing battle. I found it utterly impossible to get that going, or have any of my many questions in regards to it answered. I don't even think it's possible in the way that I imagine it.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/102194/discussion-on-question-by-paranoid-android-how-am-i-ever-going-to-be-able-to-ve). — Rory Alsop, Dec 14 '19 at 12:30

score 140 · Answer 1 · edited Dec 11 '19 at 09:01

140

You can't vet individual lines of code. You'll just die trying to do that.

At some point, you have to trust someone else. In 1984, Ken Thompson, one of the co-inventors of much of Unix, wrote a short article on the limitations of trusts. At some point, you do have to trust other people, you have to trust that whoever wrote your text editor isn't automatically hiding some Trojan code that the PHP interpreter will execute into some Bitcoin stealing malware.

You have to do a cost-benefit analysis to prioritize what you vet.

For the most part, you should do the best you could to vet the authors of the code, the project's internal security practices, and how the code reaches you. Actually reviewing the code is expensive and hard, so they should be reserved for the parts that you consider most important for your project.

Is the library a popular library that's used by lots of people with a respectable company or a well-known project lead behind it? Does the project have proper project management processes in place? Does the library have a good past history of security issues, and how did they handle them? Does it have tests to cover all the behaviors it needs to handle? Does it pass its own tests? Then the risk of the library being compromised without anyone noticing is reduced.

Take a few sample files for deeper vetting. Do you see anything concerning there? If the few files you took have major issues, you probably can infer that the rest of the codebase has similar problems; if they look good, then it raises your confidence that the rest of the codebase is similarly well written. Note that in very large codebases, there will be different areas of the code with varying levels of code quality.

Does your package manager repository check for the package signature? Is there a pre-vetting system required to register a package into the repository or is it an open registration repository? Do you receive the library in the form of source code or as a precompiled binary? These affect how much you can trust the library, the risk factors, and how you can further improve trust.

You also have to consider the application and the execution environment that the application will run on. Is this for a national security code? Is this code handling part of an eCommerce or banking handling credit card numbers? Is this code running as a superuser? Is this code life/safety-critical? Do you have compensating controls to isolate and run code with different privileges (e.g. containers, VMs, user permissions)? Is this code for a weekend side project? How you answer those questions should let you define a budget on how much you can invest in vetting code, and therefore how to prioritize which libraries need vetting, at what level, and which ones are fine with lower trust.

edited Dec 11 '19 at 09:01

schroeder

123,438
55
284
319

answered Dec 09 '19 at 03:28

Lie Ryan

31,089
6
68
93

1

That article on trust isn't really relevant in the days where you can use a formally verified compiler to bootstrap a more useful toolchain without worrying about compiler-inserted backdoors. – forest Dec 09 '19 at 08:19
57

@forest: it's still just as relevant then as it is now. Who does the formal verification and what tools do they use for that verification? What if the formally verified compiler and the proof assistant used to verify that compiler are both also backdoored? It's turtles all the way down. – Lie Ryan Dec 09 '19 at 08:26
2

At a very minimum, you could do the verification manually to compile a very lightweight compiler that understands a subset of C. You could also manually vet the assembly of a very small compiler (or write it yourself). Assembly is simple enough that you could do it by hand without needing to trust an assembler. – forest Dec 09 '19 at 08:28
27

@forest: how do you know that the text editor/disassembler/hex dump tool you used aren't backdoored as well? – Lie Ryan Dec 09 '19 at 08:31
8

Store it on a medium simple enough to use an SPI reader on? At the point when you're worried that your logic analyzer hardware is backdoored, you're running into the sci-fi realm of [Coding Machines](https://www.teamten.com/lawrence/writings/coding-machines/). – forest Dec 09 '19 at 08:34
26

@forest unlikeliness of a backdoored SPI reader notwithstanding, your arguments don't change the point of the answer which is that *at some point you have to trust somebody else*. Your point is essentially that you *should* trust someone else. In any case, using an SPI reader to verify the assembly of a small compiler then use that to compile a full compiler then use that to compile the full toolchain, is unlikely to be considered a practical solution to most people. – Jon Bentley Dec 09 '19 at 15:27
34

@forest How do you know your graphics card isn't mining bitcoin on you? How do you know your network card isn't logging all of your packets and sending them somewhere else? The trust goes all the way down to the hardware level. Even if you build your own network card, what do you really know about the primitive components that those are built on? Are you going to rediscover 50 years of technological advancement? – Cruncher Dec 09 '19 at 21:28
5

@JonBentley Processor microcode. Physical transistors (the Intel HRNG is backdoored). – chrylis -cautiouslyoptimistic- Dec 09 '19 at 22:16
1

@Cruncher even if you built your own network card from silicon you mined yourself, how do you know some guy at your ISP isn't sniffing your packets? I'd go further than saying you have to trust somebody else. In some sense, you already are trusting someone else. If you use the internet at all, you're implicitly trusting however many billions of people also use the internet. Every single one of them could ping you after all. – Ryan_L Dec 10 '19 at 02:30
1

@Cruncher I can monitor the power usage of my graphics card (I actually do, but for other reasons). And my network card is not part of my TCB. Anyway, my overall point is that you don't strictly _need_ to trust a compiler, not that you can verify that no backdoors exist _anywhere_. – forest Dec 10 '19 at 10:16
2

@Ryan_L It's even worse; ethernet is designed for trusted networks. If you're on a local network with someone else, you're implicitly trusting everyone on the network (technically, token rings were closer to that - on ethernet, you're really trusting the switch; but switches are usually very "dumb", so you end up having to trust everyone). If the network is connected to another network through a router, you're trusting that router. This is especially fun when dealing with private pseudo-clouds, where a single malicious server can deny service to everyone, without even being easy to find. – Luaan Dec 10 '19 at 10:47
2

Of interest to this line of reasoning may be the validation of seL4 ARM assembly. The seL4 team has done formal proofs of their OS code *and* its compiled assembly code (if compiled by a "sane" ARM compiler). You can see how much work it took, and then you can look at their document describing [their assumptions](https://sel4.systems/Info/FAQ/proof.pml), and compare them to the concerns we have today for security. I like pointing to them because somebody ELSE did the work, and I can point and say "See, we don't want to have to do that!" – Cort Ammon Dec 10 '19 at 19:38
@LieRyan as a PhD student in formal methods, my answer to your “who does the verification ...” question would be, “almost no one.” There are at best a few thousand people in the world doing FM. – Max von Hippel Dec 12 '19 at 14:03

score 47 · Answer 2 · edited Dec 10 '19 at 13:27

My "Composer" directory tree currently is at over 120,000 lines of code. And that's for the minimal number of crucial PHP libraries that I need.

Your mistake is in trying to vet third-party code as if it were your own. You cannot and should not try to do that.

You haven't mentioned any of the libraries by name, but I'm going to assume that a fair chunk of it is there because you're using one of the larger frameworks, such as Laravel or Symfony. Frameworks like this, as with other major libraries have their own security teams; issues are patched quickly and installing updates is trivial (as long as you're on a supported release).

Rather than trying to vet all that code yourself, you need to let go and trust that the vendor has done - and continues to do - that vetting for you. This is, after all, why one of the reason you use third-party code.

Realistically, you should be treating third-party PHP libraries exactly the same as you would treat third-party libraries in a compiled environment like .NET or Java. In those platforms, the libraries come as DLL files or similar and you may never get to see the source code. You can't vet them and you wouldn't try. If your attitude toward a PHP library is any different to that, then you need to ask yourself why. Just because you can read the code doesn't mean you gain anything from doing so.

Where this all falls down of course is if your third party libraries include smaller ones that are unsupported or don't have a security policy. So this then is the question you need to ask of all the libraries you're using: Are they fully supported, and do they have a security policy that you are comfortable with. For any that do not, then you may want to consider finding an alternative to those libraries. But that still doesn't mean you should try to vet them yourself, unless you actually intend to take over support for them.

One thing I will add, however: If you want to do a security audit on your PHP code, I strongly recommend using the RIPS scanner. It's not cheap, but if you have strong security requirements, it's easily the best automated security analytics tool you can get for PHP. Definitely run it on your own code; you'll likely be surprised how many issues it picks up. You could, of course, run it on your third-party libraries as well if you're paranoid enough. It'll cost you a lot more though, and my points above still stand; you really should be trusting your third-party vendors to do this kind of thing for themselves.

+1, also if you are not willing to trust a major well-known framework then you have bigger problems because you also shouldn't trust your OS, your software, your firmware, your hardware, etc. — Jon Bentley, Dec 09 '19 at 15:31
@FrankHopkins Not necessarily. If you re-invent the wheel for those dependencies which are merely "convenient to use", you run the risk of introducing security flaws which were not present in the third party library (which is potentially developed by more experienced developers and has had more scrutiny). — Jon Bentley, Dec 10 '19 at 15:03
@JonBentley that's why I say minimize the dependencies to what you need. If you do crypto, you definitely do need those crypto libraries. But you likely don't need the big framework that - on top of a lot of other stuff - gives you convenient database access. Perhaps there is a library that already gives you nearly as convenient database tooling, then that is likely the better choice. Unless it's so unknown you're the only one that uses it. Etc. It's always a weighing one against the other problem, but big frameworks 'always' have some overlooked issue that will become exploitable somewhere. — Frank Hopkins, Dec 10 '19 at 23:40

score 27 · Answer 3 · edited Jun 16 '20 at 09:49

Welcome to the new paradigm of coding: you're using libraries on top of libraries. You're hardly alone, but you also need to understand that anytime you bring in code you didn't write, you bring in some risk.

Your actual question is how can I manage that risk?

Understand what your software is supposed to be doing

Too often, library managers become a convenient way to slap code in that "just works", without ever bothering to understand at a high level what it is supposed to be doing. Thus, when your trusted library code does bad things, you're caught flat-footed, wondering what happened. This is where unit testing can help, as it tests what the code is supposed to be doing.

Know your sources

Composer (or any package manager) can install from any source you specify, including a library rolled up yesterday by a completely unknown source. I've willingly installed packages from vendors who have SDKs, because the vendor is a highly trusted source. I've also used packages from sources that do other trusted work (i.e. someone in the PHP project has a library repo). Blindly trusting any source can get you in trouble.

Accept that there is some risk you can never fully mitigate

In 2016, a single NodeJS developer crippled a ton of packages when they quit the project and demanded their libraries be unpublished. They had one simple library that hundreds of other packages listed as a dependency. Or maybe the infrastructure wasn't built to handle package distribution so it fails randomly. The Internet has gotten so good at "making things work" in the distributed software development world, that people tend to be upset or confused when it just stops working.

When PHP 7.0 came out, I had to do a ton of work in making a open-source third party software package we use function in the 7.0 environment. It took some significant time on my part, but I was able help that package's author work through some issues and make it usable in the 7.0 environment. The alternative was replacing it... which would have taken even more time. It's a risk we accept because that package is quite useful.

Agreed, every piece of code has an associated risk, including operating systems and compilers. — Tracy Cramer, Dec 10 '19 at 16:30

score 3 · Answer 4 · edited Dec 11 '19 at 07:27

However, the most fundamental security problem still persists: I have no idea what this "installed" code contains, nor do I know what is added/changed with every update. One of the libraries' authors could have easily been compromised one day when my Composer fetches updates, causing my PHP CLI scripts to suddenly send my Bitcoin wallet.dat to some remote server, install a RAT/trojan on my machine, or even worse. In fact, it could already have happened, and I would be none the wiser. I simply have no idea. I logically cannot have any idea.

Look up Heartbleed, the massive security hole in OpenSSL. Heartbleed effectively nerfed SSL by first saving the last several hundred or thousand (network-encrypted) transactions as plaintext and then by leaving an easy and unlogged facility for anyone who knew about it to connect remotely and retrieve all the memory-cached transactions that users thought were safely encrypted, in plain text. By that time OpenSSL was protecting the vast majority of self-hosted websites and a huge number of banks and even government intelligence services.

Then look up Meltdown and Spectre, massive bugs built right into modern Intel CPUs. Meltdown and Spectre completely counteract running a CPU in Protected Mode at all and, being independent of the OS, are exploitable on every operating system.

Years and years ago, a piece of malware called MSBlaster exploited a (I'm not even sure it was a bug - just an exceptionally stupid) Windows XP background service which had no business even running by default - it would only be actively used by a vast minority of Windows users and then only known about by IT departments. This finally drove ISPs to issue hardware firewalls built into their modem devices, and drove Microsoft to embed a built-in software firewall into their operating systems. Around that same time, a distribution of the allegedly "virus proof" Linux platform was discovered to contain a built-in rootkit in the major distribution release.

As others have said: You have to trust somebody at some point. Both accidents and malice cause problems. I'm like you - big fan of The X-Files and Uplink (TRUST NO ONE!) - but the reality is that your SSL cryptography engine or your physical hardware devices are just as likely to present security holes and those are far more likely to represent mission-critical failures when they do present.

If you're serious about going that extra mile to reinvent the Composer wheel for your and your users' security, then be serious about going that extra mile: Engineer your own CPU, mainboard, RAM, HDD and optical drives. Write your own OS and hardware drivers. Make your own compilers too. And forget about PHP because there could be problems in the interpreter - in fact forget about C and C++ too because there could be problems in the compiler, and don't even think about assembly language with an assembler somebody else wrote. Write all your own software from the ground up in machine instructions, with a hex editor.

Or you could act like a member of the industry. Subscribe to Composer's/PHP's/YourLinuxDistro's updates newsletters and maybe get in on some independent security-based newsletters too, and get a subscription to Wired. Review your system logs. Periodically test your network with a PCAP to ensure there aren't any unauthorized network streams either in or out. Be pro-active about monitoring possible threats and not paranoid about things that haven't happened yet.

Well, a good chunk of your first paragraph is devoted to that incorrect understanding, and isn't really related to the question anyway. The bottom line is good advice, but the answer would benefit from being edited down a bit. In the end, it doesn't matter _why_ the various vulnerabilities you mention existed; the reason they're relevant is _where_ they existed, so cutting down on the speculation about back doors would make that clearer. — IMSoP, Dec 10 '19 at 17:52
This really doesn't answer the question. Your first 5 paragraphs appear to be complete tangents. The last paragraph is closer to being on-topic, but it's a tangent as well. It's not about how to review code (prevention) but how to detect a very specific type of action from a specific threat. — schroeder, Dec 11 '19 at 07:32
I don't see how a subscription to wired, a tech news magazine would help. — ave, Dec 12 '19 at 10:07

score 2 · Answer 5 · answered Dec 11 '19 at 21:53

As an intermediate to advanced level developer, I've considered the same problem. Some points to consider:

Prioritize reviewing code that is critical for security purposes. Obviously that would include things like authentication and login code, permissions validation, payment processor integrations. Anything that asks for sensitive information or makes network calls.
Visually skim things like styling libraries--you should be able to quickly determine that they are only doing styling--and things like utility functions. Uppercasing strings, whitespace substitutions, reordering arrays... you should be able to quickly skim the code and see that they're not doing anything unexpected.
Even if you don't fully reverse engineer code as if it were your own, you should be able to glance at the source and determine if it was intended to be friendly towards reverse-engineering. Code should be documented with helpful comments, variable and method names should be relevant and useful, functions and implementations should not be too long or too complex or contain unnecessary functionality. Code that is very-pleasant to the eye is certainly not the preferred attack vector for malicious hackers.
Confirm that the code has an established and mature user base. You want to gravitate towards projects that profitable and well-known companies are known to use.
Confirm the real world identities of lead contributors. For large-scale projects, the lead developer will be glad to take credit for their work. You should be able to find blog posts, social media accounts, and probably a resume or a marketing page for consulting work. Contact me! etc.
Confirm that open-source code is actively maintained with recent bugfixes. Look at outstanding bug reports -- there are bound to be a few -- and don't trust claims that a particular tool or library is bug-free. That's a delusional claim.
Avoid "freeware" sites with excessive ads. Avoid projects that don't have a demo site available, or where the demo is "ugly", badly maintained, or frequently offline. Avoid projects that are over-hyped, or excessive buzzwords, make untested claims of superior performance. Avoid downloading from anonymous blogs. Etc.
Think maliciously. If you wanted to break your site, what would you try? If you wanted to sneak unsafe code into a widely-used library, how would you do it? (Don't actually try this, obviously.)
Fork open-source projects, or download backups. Never trust that the official repo of the open-source project you like will remain online indefinitely.

So instead of attempting to read and understand every single line of code individually, just get an idea of what each library does, and why you believe it does that. I really think that, if your work is profitable, there is no upper-limit to how big a project can be; you can "vet" 1,200,000+ lines of code, or 120,000,000+ lines of code!

score 0 · Answer 6 · answered Dec 09 '19 at 22:42

Composer can work with a composer.lock file and, by default, downloads packages via https://packagist.org/ (note the HTTPS.) So you have a huge package repository and a secure download with accompanying SHA1 checksum to ensure that you download exactly what was once specified. That alone helps you quite a lot.

If you stay on the conservative side of dependency updates, you can also expect that the package versions have seen production use.

In the end though, you will have to trust someone. You can either trust yourself to write exploit-free code, or you can, like others, trust community projects used by thousands and seen by even more users.

In the end though, I don't think you have a choice. If others are "flying blindly", ie without the security audits that you want to do, and take "your" customers with lower prices and faster feature releases, noone will ever benefit from your secure self-written application anyway.

Encrypted transfers and checksums don't tell you anything about what the code actually does, and who has audited it. I could put a package on Packagist in about 5 minutes, tag it as v2.5.9 to look maintained, but fill it with code that uploaded as much data as it could access to a server of my choice. — IMSoP, Dec 10 '19 at 17:56

How am I ever going to be able to "vet" 120,000+ lines of Composer PHP code not written by me?

6 Answers6

Understand what your software is supposed to be doing

Know your sources

Accept that there is some risk you can never fully mitigate

Linked

Related