62

Not a technical question, but a valid one nonetheless. Scenario:

HP ProLiant DL380 Gen 8 with 2 x 8-core Xeon E5-2667 CPUs and 256GB RAM running ESXi 5.5. Eight VMs for a given vendor's system. Four VMs for test, four VMs for production. The four servers in each environment perform different functions, e.g.: web server, main app server, OLAP DB server and SQL DB server.

CPU shares configured to stop the test environment from impacting production. All storage on SAN.

We've had some queries regarding performance, and the vendor insists that we need to give the production system more memory and vCPUs. However, we can clearly see from vCenter that the existing allocations aren't being touched, e.g.: a monthly view of CPU utilization on the main application server hovers around 8%, with the odd spike up to 30%. The spikes tend to coincide with the backup software kicking in.

Similar story on RAM - the highest utilization figure across the servers is ~35%.

So, we've been doing some digging, using Process Monitor (Microsoft SysInternals) and Wireshark, and our recommendation to the vendor is that they do some TNS tuning in the first instance. However, this is besides the point.

My question is: how do we get them to acknowledge that the VMware statistics that we've sent them are evidence enough that more RAM/vCPU won't help?

--- UPDATE 12/07/2014 ---

Interesting week. Our IT management have said that we should make the change to the VM allocations, and we're now waiting for some downtime from the business users. Strangely, the business users are the ones saying that certain aspects of the app are running slowly (compared to what, I don't know), but they're going to "let us know" when we can take the system down (grumble, grumble!).

As an aside, the "slow" aspect of the system is apparently not the HTTP(S) element, i.e.: the "thin app" used by most of the users. It sounds like it's the "fat client" installs, used by the main finance bods, that is apparently "slow". This means that we're now considering the client and the client-server interaction in our investigations.

As the initial purpose of the question was to seek assistance as to whether to go down the "poke it" route, or just make the change, and we're now making the change, I'll close it using longneck's answer.

Thank you all for your input; as usual, serverfault has been more than just a forum - it's kind of like a psychologist's couch as well :-)

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Simon Catlin
  • 5,222
  • 3
  • 16
  • 20
  • 5
    LART / Clue-by-four? (http://www.catb.org/jargon/html/L/LART.html) (http://www.catb.org/jargon/html/C/clue-by-four.html) – Christopher Karel Jul 08 '14 at 20:17
  • 2
    Clue by four - PMSL :-) – Simon Catlin Jul 08 '14 at 21:54
  • 1
    Typo in your question I think. Should "give the production system for memory and vCPU" be "give the production system more memory and vCPU". I got ver puzzled the first time through and it was only the penultimate sentence that actually clarified what the problem was. :) – Chris Jul 09 '14 at 10:59
  • 5
    This remains my preferred LART: http://laughingsquid.com/cat-5-o-nine-tails-ethernet-cable-whip/ It's for network diagnostics. Honest. – Sobrique Jul 09 '14 at 12:00
  • 17
    Out of interest have you checked storage performance? Asking for more CPU/RAM might just be a layman response to poor performance which could easily be cause by high disk queue depth. Seems like a lot of folks forgot about SQL storage best practices when virtualisation came in. – Ashigore Jul 09 '14 at 12:43
  • 7
    *grumble*. That's right, blame storage! But more seriously - it's a good point. If there's a problem and RAM/CPU isn't helping, then it might be IO. Especially if we're talking VMWare, because it's not uncommon for ... well, the storage performance side of a system to be almost entirely ignored - whilst forgetting that you intrinsically get a massive bottleneck if you feed a lot of VMs on a limited number of HBAs. – Sobrique Jul 09 '14 at 14:16
  • 1
    May be the vendor has a point? Do you overcommit pCPUs, by chance? See this: http://www.zdnet.com/virtual-cpus-the-overprovisioning-penalty-of-vcpu-to-pcpu-ratios-4010025185/ – mustaccio Jul 09 '14 at 16:34
  • 6
    Is HP your vendor in this case? Because I work there. I can confirm we don't care. – Christopher Wirt Jul 09 '14 at 22:05
  • As noted in comment from @Benubird below, can you clarify "vendor"? Not so much the identity, but at least the category. They supplied the server? An application that you run on the server? Specific hardware components? That is, are you being asked to spend more money to get more "stuff" from the same vendor? Or are they pushing blame away from their product onto your hardware? Need to know the specific area this vendor should have appropriate competence, especially the individuals you're dealing with in the Support group. – user2338816 Jul 10 '14 at 23:19
  • Inside a VM the usual tools (taskmanager, top etc) can act weird. Sometimes IMHO they appear to report on the resources the VM is currently given by the host, not necessarily what has been granted by the host. That aside, look closely at the peaks when they happen (a monthly CPU view is only a meaningless average), check IO and SAN. Check they configured their database and application properly to use more cores and memory? Performance problems are difficult to trace, I always start with the network cables and work up from there. A half duplex network connection or faulty cable can do it. – jqa Jul 11 '14 at 03:50
  • 1
    Thanks for comments folks. I will try to answer them in order. a) Have been looking at storage this week following above comments, and it appears to be OK (occasional disk queue blip, but generally healthy). b) We do have a 2:1 ratio of vCPU vs CPU, but, if the dev system isn't being used, it's 1:1. Hopefully, the CPU shares would limit the impact of any clash anyway. c) HP not involved, the vendor is a "systems integrator" of a large Oracle based system. When I say large, I mean an unnecessarily complex mix of Oracle middleware and "financial" products. – Simon Catlin Jul 12 '14 at 08:26
  • 2
    `"systems integrator"` Ah. Not specialists, but generalists. (_Specialist:_ Knows more and more about less and less until eventually knowing everything about nothing. _Generalist:_ Knows less and less about more and more until eventually knowing nothing about everything.) Probably a maze working through Support before getting to an appropriate person (if exists). Still, are vendor recommendations to purchase more (memory, etc.) from them? And did they provide (and recommend) the current configuration? I.e., are they now saying they were not accurate (wrong) earlier? – user2338816 Jul 12 '14 at 10:09
  • 1
    If an operating system resource gets saturated (locking on a file or similar) you might still see sluggish responses with spare cpu-power and memory. Adding more memory and cpu may alter the configuration enough for saturation not to happen. In other words, your diagnosis _may_ be wrong. – Thorbjørn Ravn Andersen Jul 13 '14 at 20:12
  • ServerFault is not a forum! – Lightness Races in Orbit Jul 14 '14 at 10:39

10 Answers10

94

I suggest that you make the adjustments they have requested. Then benchmark the performance to show them that it made no difference. You could even go so far to benchmark it with LESS memory and vCPU to make your point.

Also, "We're paying you to support the software with actual solutions, not guesswork."

longneck
  • 22,793
  • 4
  • 50
  • 84
  • 10
    ...wise words. I reckon this might be the way forward, as much as it pains us to make the change. The good (?) thing is that the changes will require a reboot, and we can be clear to our business users that this is due to the vendor's request... which will almost certainly prove to be pointless. Sounds like I'm getting petty, but we're growing tired of the vendor's apparent lack of proper troubleshooting. – Simon Catlin Jul 08 '14 at 22:00
  • 6
    It's not unusual for vendors to play this sort of stunt. I think it's partly down to service level metrics - fob off, ask for more information and suggest a (pointless) workaround, because at least some of the time, the problem goes away/gets fixed in the meantime. If you've 'pull' with the vendor, having a chat with the account manager might do the trick. But don't hold your breath. – Sobrique Jul 09 '14 at 12:03
  • 1
    Had a similar situation once with a SQL server for SCCM (system center config mgr) 4 CPU 30% util avg. Console terribly slow. Bumped to 8 CPU still 30% util, console finally responds in a normal manner. – Clayton Jul 09 '14 at 14:32
  • 2
    Excellent suggestion. There's nothing quite like data to shut people up. "We will make the change you suggest. If it doesn't give the projected improvement, you eat the cost." Not sure how many systems are impacted here but your time proving them wrong QUICKLY becomes more expensive than plugging in some extra RAM. – Floris Jul 12 '14 at 01:05
67

Providing you are confident you are within the given system specs they document.

Then any claim they are making in regards to requiring more RAM or CPU they should be able to back up. As the experts in their system I hold people to account on this.

Ask them specifics.

  • What information provided on the system indicates more RAM is needed and how did you interpret this?

  • What information provided on the system indicates more CPU is needed and how did you interpret this?

  • The data I have - at first glance - contradicts what you are telling me. Can you explain to me why I may be interpreting this incorrectly?

  • I am interpreting this [obvious series of data] to mean [obvious interpretation]. Can you confirm I am interpreting it correctly with regards to my problem?

Having dealt with support in the past I have asked the same questions. Sometimes I was right and they were not focusing their attention on my problem properly. Other times however, I was wrong and I was interpreting the data incorrectly, or failing to include other data which was important in my analysis.

In any case, both of these situations were a net benefit to me, either I learnt something new I did not know before - or I have got their support teams to think harder about my problem to get a a decent root cause.

If the support team are unable to provide you with a logical expansion of their argument to a basis you can be satisfied with (you need to have an open mind to compromise yourself, be reasonable to accept your interpretation of the data is wrong) then it should become very present in their response. Even in the worst case scenario you can use this as a basis for escalating the problem.

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71
  • 10
    +1 for the recognition that human error can go two ways (and making support squirm a little when they have indeed tried to "fob off"). – Cosmic Ossifrage Jul 09 '14 at 17:28
17

The big thing is to be able to prove that you are using best practices for your system allocation, notably RAM and CPU reservations for your SQL server.

All this being said the easiest thing is to make the adjustments requested, at least temporarily. If nothing else it tends to get vendors over feet dragging. I can't count the number of times I've needed to do something crazy like this to satisfy a tech on the other end of the line that it really is their software not behaving.

Tim Brigham
  • 15,465
  • 7
  • 72
  • 113
17

For this specific situation (where you have VMware and application developers or a third party who does not understand resource allocation), I use a week's worth of metrics obtained from vCenter Operations Manager (vCops - download a demo if needed) to pinpoint the real constraints, bottlenecks and sizing requirements of the application's VM(s).

Sometimes, I've been able to satisfy the more stubborn consumers by modifying VM reservations or changing priorities to handle contention scenarios; "If RAM|CPU are tight, YOUR VM will take precedence!". Bad-bad things have happened when I've allowed software vendors to dictate their requirements on my vSphere clusters without real analysis.

But in general, numbers and data should win-out.


An example of something I used to justify VM sizing to the developer of a Tomcat application:

Dev: The VM needs MOAR cpu!

Me: Well, memory is your biggest constraint, and here's a heat map of your performance versus time... Wednesdays at 6pm are the most stressful periods, so we can spec around that peak period. Oh, and here's a sizing recommendation based on the past 6 weeks of production metrics...

enter image description here

enter image description here

enter image description here

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 9
    I should add that analysis based on averages might lead to wrong results. There are conditions where peak performance is important but you don't see the peaks in load statistics when they are significantly shorter than your collection / averaging interval. So you might have a nice colorful "your overall utilization is <60%" stats graph but see severe performance degradation in 1-minute peaks occurring 8 times an hour at the same time. – the-wabbit Jul 09 '14 at 15:33
  • Maybe I've completely misread the question, but isn't this the *opposite* of what the OP asked? I thought they were the dev, who knew they didn't need more cpu, which the vendor was trying to sell them - it sounds like you are describe the inverse, where a dev is asking for more cpu that they don't need. – Benubird Jul 09 '14 at 15:38
  • 1
    I'm using a convenient example. I take the same approach with vendors who have rigid requirements (4 vCPU and 16GB RAM), as well as to identify undersized systems that need resources. In terms of monitoring granularity, you can go back to the host-level statistics to deal with peaks... – ewwhite Jul 09 '14 at 15:46
  • Thanks for this. We don't have vCops, but I reckon our vSphere "estate" is now mature enough to require this level of detail. I'll add this to our Capex wish list for next year. – Simon Catlin Jul 12 '14 at 08:32
  • 2
    @SimonCatlin you don't need to buy it. You can download the demo for free and use it for 60 days. It's perfect for this type of situation. – ewwhite Jul 12 '14 at 08:34
10

I used to work in support - and part of what you're asking sounds highly rational (and probably is): but there are a few questions to ask yourself prior to just doing the "performance enhancement" they're requesting

  • are you running at least at the vendor's stated minimum system requirements already?
  • if you're at least at minimum sysreqs, are you already at their "recommended" system settings?

Vendors will 99 times out of 100 (in my experience - both on the support side and the customer/field side) not even deal with performance-related issues until/unless the systems match what their documentation calls for. Maybe it's a system that runs fine 99.5% of the time with 1 CPU and 512M RAM - but if the system requirements say 4 CPUs and 4G RAM and you've only got 2 CPUs and 1G RAM, they're well within their rights to demand more resources be assigned*.

It is probable that they're asking you to increase system resources because of something they found in the lab/development wherein an issue magically disappears if you cross a specific threshold; if this is the case, yes it's an example of potentially-poor debugging on their end, but keep in mind they don't have time to eliminate every possible bug/issue that arises - some just need to be worked-around, and if that is the case here, just go with it.

There's also a not-insignificant chance that the issues you're seeing aren't even part of "their" software, but a component they rely on from some other source (vendor, OSS library, etc). I ran into this exact situation related to swap size, BEA WebLogic, and the Sun JRE at a customer a few years ago.

tl;dr:

In short, work with their support team, escalating as needed, until you find a resolution - but don't be surprised when some of the suggestions/debugging steps/fixes sound off-the-wall or pointless.


*If it truly doesn't "need" those extra resources, you're likely in a place to be able to file a doc bug / RFE for future versions - but don't push that route until you've demonstrated it's not the issue at hand
^an eBook I wrote you may find helpful on the topic: Debugging and Supporting Software Systems

warren
  • 17,829
  • 23
  • 82
  • 134
  • 2
    Anything performance related takes a lot of time and resources to troubleshoot and diagnose. After all, there's nothing that's _broken_ so you have to trace through painfully. – Sobrique Jul 09 '14 at 14:14
  • 1
    @Sobrique absolutely - and they're usually in pretty remotely-related (even apparently unrelated) segments of the product at hand – warren Jul 09 '14 at 14:21
  • That's a good point, a lot of debugging steps can be very counter-intuitive, although I don't think it would be unreasonable to insist that they provide a reason for doing it. If they can't say what benefit doing something will provide (even if it's just "to see if it affects X") then either they're working through a checklist that they don't understand, or they have no idea and are making wild guesses, or they are hiding something - none of these are very encouraging. – Benubird Jul 09 '14 at 15:48
  • @Benubird - sadly some of these things come down to gut instinct or "it fixed it somewhere else ..." :( – warren Jul 09 '14 at 16:30
  • 2
    "it fixed it somewhere else" is a terrible reason to do something. True, sometimes there isn't time to properly debug a problem, and you have to go by gut instinct, but the thought of it still makes me shudder. I've seen plenty of bugs that "appeared" to be fixed by doing X, only to discover later that the problem was actually in something seemingly totally unrelated, which went onto cause more problems elsewhere until we figured it out. – Benubird Jul 09 '14 at 16:49
  • @Benubird - no argument ... just saying that's what happens :) – warren Jul 09 '14 at 16:53
8

Either ask to escalate the ticket or ask for a different rep. Depending on which vendor it is escalation may help if you say that you feel that the current level of support doesn't adequately address the issue. If they will not escalate then asking for a different rep may help because that requires much less "justification" since all it needs is to not be happy with the current one.

If it's a large vendor then simply closing the ticket and opening a new one on the same issue may work as it may be routed to a different rep, but I'd advise against it because it's poor form.

You could also stand your ground and ask for a rationale as to how more RAM/vCPU will help, or you could just give it more RAM/vCPU to prove that it won't help.

Reality Extractor
  • 1,480
  • 2
  • 14
  • 23
4

I'll throw in my two cents. We've been pretty successful with this approach -- much better outcomes and less frustration on everyone's part. It requires a lot more effort than the blame-game and blindly adding resources, but it also has better chances of finding the underlying problem.

When we have serious issues with our on-premise apps that are backed by vendor support contracts, and the vendors begin their dodge shuffling dance (which always seems to include outlandish non-data-driven demands for more CPU or RAM), we tend to do these 3 things:

  1. Escalate the priority to system-down equivalent -- they usually balk, but usually back down when you explain it is effectively unusable even if it's technically "working". Treat it as a serious problem for them to solve. Around here we refer to that as a tiger team, which meets daily to get status updates from all the stakeholders. Usually the vendor will be asking you to change stuff. If it's a prod system, that's problematic, but if you want them to help, you will need to accept the responsibility to help them isolate the problem, so it helps if you've got a dev/staging environment where you can run tests.

  2. Tell the vendor you want them to replicate your environment, so that THEY can isolate the problem in their lab. They can even host stuff in some cloud environment if need be. It does not have to be an exact match of your environment, although that would be ideal. The point is that you want the VENDOR to be actively trying to replicate your problem, so that they can test their guesswork on their system instead of yours. Ask them for the diagrams, specs, etc of that replicated environment to make sure they are doing it.

  3. Provide them (under NDA of course) with your actual dataset so that they can run/replay it for real instead of guessing. In our case, most of our vendor-provided app issues (both transient and chronic) frequently turn out to be issues with the accompanying vendor-provided databases. I cannot count the number of times we've done this and they have finally pinpointed the problem down to something unexpected in the actual data -- weird artifacts from app upgrades 2 years ago where something didn't convert cleanly; stale records exposing a problem with the GC settings; queries not working quite right because OUR data values breaking some transmog routine in the vendor code, etc. Stuff we would never be able to identify on our own.

We've done this with quite a few vendors over the last few years, and they are initially very resistant to doing it our way. However, after it works, it always comes up as a positive highlight in the quarterly reviews we hold with our vendors. And it helps cement our technical relationship with those vendors. They don't want vague problems. They do want specific problems that they can analyze to improve their products.

Hope the suggestion helps. I know it's not a one-size-fits-all approach, but if you can swing it I think you'll find it worthwhile.

pdapel
  • 41
  • 1
3

The real question is, who is in charge here? If you can't realistically switch to an alternative vendor, then they have the power, and all you can really do is go along with whatever they say and hope it will work out. Not a happy situation! Otherwise, I suggest you ask for another rep (as others have said), but make it clear you are not happy with the service and will look elsewhere if they cannot do the job.

Don't just "make the adjustments they suggested" if you're sure they won't work, as that is setting up a pattern for your relationship that will hurt you in the long run. You are paying them to provide you a service, and they shouldn't be able to dictate your actions any more than someone I hire to paint my house can dictate what colour it will be.

This may sound drastic, as it sounds like this is not a hugely critical issue, but the fact is that if they are messing you around on something minor, they will likely do the same for something big, and the last thing you want is to run into some sort of horrible charlie foxtrot six months down the line and have the same trouble with the vendor then.

Make sure that whatever steps you take to resolve the issue now, will work equally well when you're two days from a deadline and everything breaks...

Benubird
  • 523
  • 1
  • 5
  • 11
  • 4
    I'd have thought it'd give ammunition in a counter argument - you asked us to do this nonsensical thing last time; we did as a gesture of goodwill. This time we want some more detail as to your reasoning why this will make any difference. – Sobrique Jul 09 '14 at 12:04
  • @Sobrique That makes sense, and it might work out that way - I don't know enough psychology to say one way or another. My instinct though, is that if you've done something now just because they said to - effectively admitting they know more than you - they'll expect the same in the future. Either way, if you're having to argue with them (ammunition or not) you're already wasting time that could be spent solving the problem. – Benubird Jul 09 '14 at 15:35
  • "We did it your way last time. You were wrong. Are you prepared to accept that you might be wrong again? We do have precedent here." – Sobrique Jul 09 '14 at 15:38
3

I'm going to post a view from the vendor's side.

We had this customer that had this recurrent problem where the performance of the software would drop off every few hours or so to some truly abysmal rate then come back a few hours later.

The bulitin profiler in the system indicated the system CPU (or possibly memory) speed was disgustingly slow, something like 100MHZ rather than the expected 2GHZ. Doubling the CPU provided by the VM didn't change the symptom and they thought we were being wasteful.

As they couldn't get a faster CPU (more CPUs wasn't going to help), we then tried swapping TEST and PROD VMs. The problem then showed up on TEST the next day. Then we tried promoting one of the clients to a standalone (serverless) instance. No problem on that workstation while the server was choking.

They produced reports from the VM host indicating no performance problems and tried again to claim it was an application problem.

Finally I [an engineer] (I had zero support from those in dedicated support roles) asked specifically for a physical box. The customer screamed bloody murder but with nobody having any other potential solution they did it. What do you know, the problem magically disappeared.

We never did find out what the problem was. All benchmark programs showed normal but the application profiler was telling us computing resources simply weren't adequate. There's kind of a specific signature we look for in the profiler now. If we see it, we know before we get any farther the problem is VM interaction, but it just wasn't known at the time.

They sure thought I was full of it. I wasn't. I was out of options.

EDIT, Update from years later:

With more and more customers wanting to run on VMs and management willing to attempt to solve the problem at all cost, we got good VM hardware. I was able to construct a specialized VM burn program that ran in userspace (and required no privileges) on two single-core VMs with 512mb RAM, that was able to drain 1/3 memory performance out of another single-core VM with only 4 total cores out of 16 in use on the VM host and most of its ram still free. The program raised no alarms, and showed nothing out of the ordinary on the VM host nor any of the guests, except for memory access was slow.

Now we can tell customers we know that there is a problem with VMs, and it's not our software. We still get customer requests from time to time for VM compatible software. I wonder why management doesn't let support tell them we were able to develop a piece of software that slows down every other VM on the same host.

The scary thing is the technique involved is a simple transform of well-known programming technique involving lock-free synchronization. Hundreds of software vendors could have this VM drainer in their software and not know it. Getting an atomic instruction lock that hotly contested should be rare but not impossible. The amusing part of it all is I was getting the lock to contest ACROSS VMs.

joshudson
  • 403
  • 4
  • 10
-3

I would suggest a very different approach to the ones mentioned so far. Before argueing with the vendor, why not look more closely at the problem reported and see what that tells you.

What are the actual problems being reported and what are the users expectations. If a user is saying something "take too long", ask them exactly what 'it' is (so you can reproduce it), how long they think it should take, and why they think it should take that long. If their expectations are reasonable, measure the actual performance and system impact of what they are trying to do. The fact that your system shows a 30% spike over a month does not mean it is not running at >100% when the user is trying their query. If you can demonstrate to your vendor that cpu and memory are not being strained by the problematic task, then you can ask the vendor to justify recomendations that will cost you money.

Paul Smith
  • 97
  • 1
  • 1
    The whole first half of your suggestion seems to have been done already. The whole second half is exactly what the OP is asking. – Chris S Jul 10 '14 at 17:52
  • I would disagree. There is no evidence presented of problem analysis and the cpu and mem figures quoted are monthly aggregations that have no apparent relevance to the issue at hand. – Paul Smith Jul 11 '14 at 14:13