10

After months of neglect, e-mail flames and management battles our current sysadmin was fired and handed over "the server credentials" to me. Such credentials consist in a root password and nothing else: no procedures, no documentation, no tips, nothing.

My question is: assuming he left boobytraps behind, how do I gracefully take over the servers with as little downtime as possible?

Here are the details:

  • one production server located in a server farm in the basement; ubuntu server 9.x probably, with grsec patches (rumours I heard last time I asked the admin)
  • one internal server that contains all internal documentation, file repository, wikis, etc. Again, ubuntu server, few years old.

Assume both servers are patched and up-to-date, so I'd rather not try to hack my way in unless there's a good reason (i.e. that can be explained to upper management).

The production server has a few websites hosted (standard apache-php-mysql), a LDAP server, a ZIMBRA e-mail suite/server, and as far as I can tell a few vmware workstations running. No idea what's happening in there. Probably one is the LDAP master, but that's a wild guess.

The internal server has an internal wiki/cms, a LDAP slave that replicates the credentials from the production server, a few more vmware workstations, and backups running.

I could just go to the server farm's admin, point at the server, tell them 'sudo shut down that server please', log in in single user mode and have my way with it. Same for the internal server. Still, that would mean downtime, upper management upset, the old sysadmin firing back at me saying 'see? you can't do my job' and other nuisances, and most importantly I'd have to lose potentially a few weeks of unpaid time.

On the other end of the spectrum I could just log in as root and inch trough the server to try to make an understanding of what's happening. With all risks of triggering surprises left behind.

I am looking for a solution in the middle: try to keep everything running as it is, while understanding what's happening and how, and most importantly avoiding triggering any booby traps left behind.

What are your suggestions?

So far I thought about 'practicing' with the internal server, disconnecting the network, rebooting with a live cd, dumping the root file system into a USB drive, and load it on a disconnected, isolated virtual machine to understand the former sysadmin way of thinking (a-la 'know your enemy'). Could pull the same feat with the production server, but a full dump would make somebody notice. Perhaps I can just log in as root, check crontab, check the .profile for any commands that's launched, dump the lastlog, and whatever comes to mind.

And that's why I'm here. Any hint, no matter how small, would be greatly appreciated.

Time is also an issue: there could be triggers happening in a few hours, or a few weeks. Feels like one of those bad Hollywood movies, doesn't it?

lorenzog
  • 2,719
  • 1
  • 18
  • 24
  • 5
    Why was the sysadmin fired? This looks like a no win situation. If you are not sure what to do and what is exactly on the servers this will not end well. – cstamas Jun 18 '11 at 21:19
  • @cstamas the sysadmin was fired because for every request we did (i.e. add user to mailing list, or create an e-mail alias, etc.) the time it took was a random variable between t = 1 day and t = 2 months (inclusive). And he never admitted that. Plus a bunch of other bad behaviours that I won't go into detail here. – lorenzog Jun 19 '11 at 08:32
  • @lorenzog now it makes sense. Looks like it will not be an easy task. There are great answers already. Good luck! – cstamas Jun 19 '11 at 16:41
  • Something which is not part of a helpful answer: This whole story sounds like you are responsible for some infrastructure hired some junior admin and he turned out to be not quite good enough. Now you're left with undocumented systems and (to be honest) if you are indeed responsible for the infrastructure you don't deserve better. You should have been persistent enough so that your admins do proper documentation (yes harsh and opionated -- but by no means take it personally i'd say this to everyone in that situation) – Martin M. Jun 19 '11 at 18:35
  • 1
    @serverhorror: no, they simply hired him before I joined this company, and now he turned out to not to be good enough. Since I knew him from before I had the task of 'dealing with him'. Careful with your assumptions. – lorenzog Jun 20 '11 at 10:19
  • 1
    @lorenzog: This isn't about you. The point is that it actually is the managers fault (whoever that is) that the situation of undocumented infrastructure could even happen -- as I said: no offense just observation (granted a subjective observation) – Martin M. Jun 20 '11 at 12:15
  • @serverhorror: You're right. And given my reaction, I suppose there are indeed more personal feelings than I thought.. I'd say lesson learned for the future.. – lorenzog Jun 21 '11 at 08:34
  • @lorenzog I realize this wont help you much, but I dont see how you can avoid downtime if you want to be completely safe. Booby traps come in all kinds of shapes forms, from "alias ls=rm -rf /" to a process watching logins and do that rm -rf / when root logs in. Only completely safe way would be a complete reinstall I'd guess. – Tuncay Göncüoğlu Nov 30 '12 at 18:19

5 Answers5

12

As others have said that looks like a loose-loose situation.

(Starting at the end)

  • Completely new deployment

Of course you can't just take the servers down and let the installer do it's magic.

General Process

  • Get budget for a backup server (backup as in storage for the data)
  • create snapshots of the data and place them there before doing anything
  • Get that signed off by management!
  • Gather a list of requirements (is the wiki needed, who is using the VMWare instances, ...)
    • From Management and
    • From Users
  • Get that signed off by management!
  • Shut down unlisted services for a week (one service at a time - iptables may be your friend if you want to just shut down external services but have the suspection that it might still be used from an application on the same host)
    • No reaction? -> final backup, remove from server
    • Reaction? -> Talk to the users of the service
    • Gather new requirements and Geet that signed off by management!
  • all unlisted services down for a month and no reaction? -> rm -rf $service (sounds harsch but what I mean is decommission the service)
  • get budget for a spare server
  • migrate one service at a time to the spare
  • get that signed off by management!
  • shut down the migrated server (power off)
  • find out more people come screaming at you -> yay, you just found the leftovers
  • gather new requirements
  • start up again and migrate services
  • repeat last 4 steps until there are no people coming after your for a month
  • redeploy the server (and get that signed off by management!)
  • rinse and repeat the whole process.
    • the redeployed server is your new spare

What did you gain?

  • Inventory of all services (for you and management)
  • Documentation (after all you need to write something down for management, why not do it properly and make something for you and management)

Been there done that, it's no fun at all :(

Why do you need to get it signed off by management?

  • Make the problems visible
  • Be sure you won't get fired
  • Opportunity to explain risks
    • It's fine if they don't want you to do it, but after all it's their decision to make after they got enough input to judge wether the investement is worth it.

Oh, and present the overall plan to them before you start, with some estimates about what will happen in the worst and best case.

It will cost a lot of time regardless of redeployment if you don't have documentation. There's no need to think of backdoors, IMHO if you don't have documentation a rolling migration is the only way to reach a sane state that will deliver value for the company.

Martin M.
  • 6,428
  • 2
  • 24
  • 42
  • That is a very good perspective. Thank you. I will certainly follow your advices re: getting things signed off management and doing a slow redeploy of servers. It will hurt, but it sounds like the best reasonable course of actions. – lorenzog Jun 19 '11 at 08:38
  • By proper documentation I suggest this: http://serverfault.com/questions/25404/documentation-as-a-manual-vs-documentation-as-a-checklist/25535#25535 (also see the general topic) works very well (at least for me) – Martin M. Jun 19 '11 at 14:30
4

First of all, if you're going to invest extra time in this I'd advise you to actually get paid for it. It seems you've accepted unpaid overtime as a fact, judging from your words - it shouldn't be that way, in my opinion, and specially not when you're in such a pinch because of someone else's fault (be it management, the old sysadmin or probably a combination of both).

Take the servers down and boot into single user mode (init=/bin/sh or 1 at grub) to check for commands that run on root's login. Downtime is necessary here, make it clear to management that there's no choice but some downtime if they want to be sure they will get to keep their data.

Afterwards look over all cronjobs, even if they look legit. Also perform full backups as soon as possible - even if this means downtime. You can turn your full backups into running VMs if you want.

Then if you can get your hands on new servers or capable VMs I would actually migrate the services to new, clean environments one by one. You can do this in several stages as to minimize perceived downtime. You'll gain much needed in-depth knowledge of the services while restoring your confidence in the base systems.

In the meantime you can check for rootkits using tools as chkrootkit. Run nessus on the servers to look for security holes that the old admin may use.

Edit: I guess I didn't address the "gracefully" part of your question as well as I could. The first step (going into single user mode to check for login traps) can be probably skipped - the old sysadmin giving you the root password and setting up the login to do a rm -rf / would be pretty much the same that deleting all files himself, so there's probably no point on doing that. As per the backup part: try using an rsync based solution so you can do most of the initial backup online and minimize downtime.

Eduardo Ivanec
  • 14,531
  • 1
  • 35
  • 42
4

Do you have reason to believe that the previous admin left something bad behind, or do you just watch a lot of movies?

I'm not asking to be facetious, I'm trying to get an idea what sort of threat you think is there and how probable it is. If you think the chances really are very high that some sort of seriously disruptive problem might really exist then I'd suggest treating it as if it were a successful network intrusion.

In any case, your bosses don't want the disruption of downtime while you deal with this - what is their attitude to planned downtime to tidy systems up vs. unplanned downtime if there is a fault in the system (whether a real fault or a rogue admin) and if their attitude is realistic vs. your assessment of the probability that you will really have a problem here.

Whatever else you do, consider the following:

Take an image of the systems right now. Before you do anything else. In fact, take two and put one aside and don't touch it again until you know what, if anything, is happening with your system, this is your record of how the system was when you took it over.

Restore the "2nd" set of images to some virtual machines and use these to probe what is going on. If you're worried about things being triggered after a certain date then set the date forward a year or so in the virtual machine.

Rob Moir
  • 31,664
  • 6
  • 58
  • 86
  • I do have reasons to suspect there might be something lurking, since we did not part on best of terms. Previous sysadmin was a good friend, we were roommates during college and I "taught him" many of the tricks he later used to become a sysadmin while I took the path of software development and project management. Because there are personal feelings involved (he accused me to have managed to get him fired) I can't expect a reasonable behaviour. Take it as a father/son relationship, where the son wants to prove his goodness to the father, to some extent. – lorenzog Jun 19 '11 at 08:37
0

I'll invest time in learning what apps run on those servers. After you know what is what at any time you can install a new server. In case you feel that may be some backdoor it will be a good idea to Just boot in single mode or have some firewall in between the servers and The external net.

silviud
  • 2,677
  • 2
  • 16
  • 19
0

You are getting paranoid about security. There is no need to get paranoid. (b'cos you talk about booby traps). Go through the software list installed. See what are the service running (netstat, ps , etc), see cron jobs. Disable the previous sys admin user account without deleting the account (easily done by pointing the shell to nologin). See through the log files. I think with these steps and from your knowledge of company needs from which you can guess the use of the servers , i think you should be able to maintain them without any major goofups.

bagavadhar
  • 538
  • 4
  • 14
  • 1
    I agree it's not about security in the first place (otherwise they shouldn't have hired the old admin at all). But it's about how much value one can add. I completely disagree about all the rest. There just is no sane way without some kind of inventory to manage things. User will come and hit you after some time because something you never heard before stopped working. After all there's quite some infrastructure behind every user visible service. And there isn't even documentation about those services... – Martin M. Jun 18 '11 at 22:12