0

I am a fullstack engineer in a small company, and I am responsible for everything related to technology, from infrastructure, to code.

I arrived a few months ago, and quickly realised that there were many issues, both code-related but also infrastructure-related, as the company did not do any updates or upgrades since the first tech guy who created the whole infrastructure left (2 years ago)

As I am planning to stay in the company, I understand that is mandatory to upgrade the servers, but infrastucture is not my strong suit, so I am looking for any advice you could give me here.

As the question is quite broad, here is how i'll reduce the scope, this is what I have now

3 servers on Ubuntu 16.04.5 : 2 as backends and one acts as a load balancer The servers are part of a mongoDB cluster (if it matters) Currently the apache2 version used on all of them is 2.4.18 (built 2018/06/07) The application are deployed on the server using Ansible which is a good thing.

But many things that were setup 2 years ago are now broken, such as Nagios, or Jenkins, or SonarQube, and the issue is there is no documentation on how/when/why these things were setup.

So for the question:

I am looking into cleaning the useless libraries,and fixing/installing all the necessary monitoring tools, patching up the vulnerabilities by upgrading and so on.. but without disrupting the apps running on the servers.

Have any of you been in a similar case ? What would you advise ? Do you have any useful guides/tools*commands to go through/use?

P.S: I know the question is quite broad, as I am overwhelmed by the subject, I do not know how to properly split it up in concise questions yet, but if I could get an answer that points me a in global direction, I will probably write other follow-up questions more precise on each issue.

Youri
  • 103
  • 2

2 Answers2

1

The good news is that the OS is still supported, however not for long (Ubuntu 16.04 is going end of life in april 2021).

I'd start by making sure all the systems are up to date.

Then I'd make sure I have a working backup, including testing of recovery procedures (Ideally this would be the first point, but after 2 years without updates chances are high that you won't be able to install a backup solution easily).

Next I'd go over each service (Nagios, Jenkins, etc.), check it's state, decide if I want to keep using it or if I want to switch to something I'm more familiar with. If it's the former I'd fix it, it it's the latter (or if it's in an extremely bad state) I'd just reinstall the system with a current version of Ubuntu and reinstall what I need from scratch.

Afterwards I'd start updating the systems to a newer version of the OS, ideally up to 20.04.


Some things to plan for:

  • running system updates will cause short outages (services get restarted during package updates, reboots will be required). If you can, schedule this for times of low usage
  • for systems that form a cluster, make sure that you have enough time between nodes so the cluster can recover (can't be more specific as I'm not familiar with MongoDB).
  • select a maintenance window and communicate that to your users, so they know beforehand that there will be outages.
  • if you can, set up a testing environment to test the major upgrades before you run them on production systems.
Gerald Schneider
  • 19,757
  • 8
  • 52
  • 79
  • Thanks for your answer Gerald, sounds easy and clear when you lay it out that way, but I'm still troubled/scared with the fact, that this would mean I need to interrupt the service, and I don't know how long it would take me to setup everything back, or if I would manage to do it without issues. Do you see a way to make a bit less of a "leap of faith" ? – Youri Sep 22 '20 at 09:51
  • Also what do you mean by "all the systems are up to date" ? Wouldn't it be better to start by upgrading to 20.04 first before setting/fixing eveything back up ? Sorry if my questions sound a bit silly, I am really new to this, thanks for the help – Youri Sep 22 '20 at 09:56
  • You are going to do two major upgrades of the operating system, first to 18.04, then to 20.04. From experience I can tell you there WILL be problems after these upgrades. New versions of the packages with changes in configuration files, failing to start until the configuration file is corrected and stuff like that. It might be counterintuitive to do it in that order, but you will reduce unexpected downtime by making sure everything is running smoothly with the current versions first. – Gerald Schneider Sep 22 '20 at 10:03
  • Great,I see, makes sense ! So you would also upgrade apache2 beforehand ? And major system upgrade would mess the global configurations files up, but not the services ? Lastly, would you have in mind, any tool that would allow me to quickly revert the servers to a working state in case, I mess up badly ? Anyway I'll start messing with a test environment as soon as possible to find out the best way to proceed on the production server afterwards ! – Youri Sep 22 '20 at 10:17
  • Product recommendations are off topic. If you don't have any backups in place, this would be the first thing to address. – Gerald Schneider Sep 22 '20 at 10:19
  • Right again, there might be one already, or not, I'll look into it and start by setting a back up tool before going any further ! Thanks a lot again, i'll wait a day or two before accepting your answer ! – Youri Sep 22 '20 at 10:27
0

If possible take a snapshot of the servers start up your favourite virtualization software and try any upgrade on that first.

You say some of it was installed via ansible. Assuming you can get a vm to test on, try the ansible code (with --check first)

Timothy c
  • 386
  • 1
  • 8