Oracle 10g on RHEL - All memory gets eaten up nightly at 20.15

Question

Okay, to start things off, I'm brand new to Oracle DB. Much experience with Microsoft products (boo) and Ubuntu linux, but RHEL and Oracle are very new to me.

The environment ... Oracle DB 10g Standard v10.2.0.1.0 - 64bit, on RedHat Enterprise Linux 5. Total database size is less than 8GB.

The issue ... Every night at 20.15 on the dot, the server's memory drops to ZERO. We previously had 4GB of RAM in the server, so our memory statistics always showed ~100MB available. We never knew there was an issue besides insufficient RAM until recently. The other day, we upgraded 4GB -> 12GB, and saw some legitimate memory available. Please reference the attached images for details.

Daily Memory Usage : This graph shows a period of 24 hours, where the beginning until 20.15 shows how memory usage is during the day (when users are heavily pounding the server). At 20.15, after all users have been gone from the building for over 2 hours, the memory pretty much disappears until the server is rebooted.

Weekly Memory Usage : This graph shows the period of time before we upgraded the memory, and the three days since upgrading. As you can see, there is essentially no physical RAM available at all before the upgrade, though there is memory being used for cache. After the upgrade, we have a huge chunk of available RAM and cache -- until 20.15, when both disappear. Each time you see the amount of RAM jump is immediately after a reboot. This available RAM lasts until, you guessed it, 20.15.

The vendor who built this server has absolutely no answer for us. In fact, what they tell us is absolutely ridiculous, and makes them look grossly incapable. They truly have no idea about anything, and it's obvious. So there's no way we're going to get an answer that way. About a week ago, we were assured that the server did not need more RAM and had more than sufficient resources. It was also built with only two physical hard disks (2x146GB 15K RPM), I believe set up as a RAID1.

I have checked (what I believe to be) all scheduler jobs, all cron jobs, and any other timed tasks/jobs I can possibly find. I have disconnected all idle database sessions to no avail. The only evidence I can find pointing to the culprit is an Oracle process that begins taking up ~50% of the CPU after 20.15. During the day, there are a few dozen (~40+) Oracle processes that show about 2.2GB worth of VM usage each -- this holds true immediately after a reboot as well, and also after the 20.15 "event".

I'm beyond stumped. And our software/hardware vendor is worthless.

Any suggestions or help would be greatly appreciated! Thank you! ~Laz Peterson

You said the vendor told you something ridiculous. What was it? — Michael Hampton, Mar 16 '13 at 22:12
The exact quote: "All process or automation will run accordingly at their designated time which in most cases, between 9 to midnight." That is the closest answer we have received to the problem (out of about 5 independent questions). Other answers they have given us are truly irrelevant, and not worth wasting the space to post. They have yet to acknowledge what is sucking up all the resources. — Laz Peterson, Mar 16 '13 at 22:15
@vonbrand: You're right, that answer isn't too ridiculous. I'm just super-frustrated, as that's the best answer they've given. The other answers have usually detracted from the question, such as them telling us that there is more than enough memory, etc. — Laz Peterson, Mar 17 '13 at 01:47
@vonbrand: Regarding the memory usage, we have users complaining that their application (an electronic medical records app) freezes often while saving, sometimes up to 30-45 seconds. I'm guessing this has something to do with the disks and the caching. But my biggest concern is why the memory disappears every day at 8.15p, and why the maker of the server is unable to tell me why. Thanks for your comments and info. — Laz Peterson, Mar 17 '13 at 01:49
You have to get the insights from the database at that time. Since you have SE, the AWR is not an option, however, you still can use good old Statspack. See here to get started: http://www.akadia.com/services/ora_statspack_survival_guide.html — grassroot, Mar 17 '13 at 07:26
Here's an additional image showing CPU usage, CPU average load, swap IO, swap bytes, and memory usage. Root/data drive (not included) IO and bytes during this time are trivial. Pictures speak a lot more than my words: [All Stats](http://paravis.net/oracle/oracle-memory-issue.jpg) -- you can see where it all starts. Maybe I'm just looking too deep in, but logically, it seems there should at least be an answer? Will try out Statspack. Thanks again. — Laz Peterson, Mar 17 '13 at 17:02
Since you know the time of occurrence of issue, can you collect /proc/meminfo over a period of 10 minutes before the issue. A simple for loop with some sleep should do it. And then pastebin the output. — Soham Chakraborty, Mar 18 '13 at 17:22
@Soham: thank you, I will put together a script and check the meminfo too. — Laz Peterson, Mar 19 '13 at 19:24
@grassroot: I'm all set to run some Statspack snapshots this evening. Hoping for some good info! Thanks again everyone. — Laz Peterson, Mar 19 '13 at 19:24
Related http://serverfault.com/questions/85470/meaning-of-the-buffers-cache-line-in-the-output-of-free?lq=1 — kubanczyk, Sep 29 '13 at 21:52

score 1 · Answer 1 · answered Mar 16 '13 at 22:54

First off, what the Oracle people say isn't a "ridiculous answer", it is just that Oracle is running some housekeeping tasks at that time, probably through cron. If that time is inconvenient, you should ask how to reschedule them.

Your memory usage graphs look fine. When Linux loads data into memory, it stays there until the space is needed for something else (that is what free(1) reports as cache/buffers). The logic behind this is that deleting it is explicit work; if the data is needed again, it is available for free. There are 2GiB really free there. Unless this changes (i.e., memory leaks somewhere) I would not worry for now.

score 1 · Answer 2 · answered Mar 18 '13 at 12:21

Take two Statspack snapshots, 20:00 and 00:00. Generate a report between them. It will tell you what causes the high load. Especially look for "Top 5 Timed Events" and then SQL queries causing the misbehaviour in question. Additionally take a look at Oracle's memory "Advisories" for additional suggestions.

Based on details currently available, I believe the memory consumption itself is not an issue. Something load consuming transactions start at 20:15, which cause buffered data to be fetched from the swap.

Take a look at Red Hat's recommendations for Oracle in http://www.redhat.com/f/pdf/rhel/Oracle-10-g-recommendations-v1_2.pdf. Especially set the 'swappiness' to 0 (or close) and take Huge Pages into use.

Oracle 10g on RHEL - All memory gets eaten up nightly at 20.15

2 Answers2