To add to the previous answers:
Use a deployment strategy that allows for rollbacks and instant switching, Capistrano or pretty much any other deployment system will help with this. You could use things like database snapshots and code symlinks to be able to quickly revert to a previous state.
Use complete configuration management, don't leave anything managed manually. Systems like SaltStack, Ansible and Puppet are examples. They can be applied to Docker container configurations and vagrant boxes as well.
Use HA to make sure you can hand off requests when upgrading a node. If the upgrade fails, simply down the node, and when it's rolled back, bring it back up and your HA solution will notice and push requests to said node again. HAProxy is an example, but nginx works just fine as well.
Make sure the application can handle concurrent instances, used central versioned data repositories for non-code data that needs to be stored on disk, such as cache. This way, you will never have en upgraded application run in to cache files from a different version. This would be done on top of purging caches and doing cache warmups of course. (The caching thing is just an example)
I usually set up workflows where team managers can approve merge requests to a special branch that does all the normal CI stuff, but as an additional last step also starts pushing to production nodes. What you basically do is run a manual CI deploy to a production instance. If that instance doesn't generate invalid responses, breaks, or does weird things to your data, you then mass-upgrade all nodes using your CI solution of choice. This way, if one deployment works, you know all deployments will work for a specific tag/commit.
Right now, it sounds as if you are running a production application on a single node, with one deployment process, one source and one target. This practically means that every single step in that workflow is a point of failure that by itself can break the website. Ensuring that such a thing cannot happen is the base of all the CI, HA and failover processes. Don't run just one node, don't run just one HA process, don't run on just one IP address, don't run just one CDN etc. It might sound expensive, but putting a duplicate of what you already have in a rack on a server with it's own connection usually costs less than one hour of downtime on a business site.