What's causing AppCrash and BSOD events, general instability?

5

1

SOLUTION: It was the RAM settings all along :-| It never occurred to me that the stock settings on a stock board with stock RAM would be so far off that it'd cause system instability. I've never done any overclocking, so I never looked very closely at those settings. Once I chose the DOCP profile that matched my RAM, everything cleared up, and it's even a little faster. Thanks to Twisty Impersonator for the process guide and to magicandre1981 for the suggestion that prompted me to check the settings. Hopefully, this will save someone else 2 years of frustration.

EDIT: Well, I think the cause has become clear. After replacing ALL the hardware, and STILL seeing a problem, I decided to go back to the hardware idea. In short: if I run with two sticks of RAM, everything is fine. It doesn't matter which two sticks. If I put in all four, I start having problems. This seems like a pretty clear indication of a bad motherboard.

The Symptoms:

For the last several years my machine has been generally unstable, off and on. Typically manifests as BSODs with varying stop codes.

  • Upgrading the RAM improved the stability for a while.
  • Upgrading the motherboard improved the stability for a while.
  • Replacing the C: drive improved the stability for a while.
  • Refreshing or reinstalling the OS has occasionally been necessary, and usually improves stability for a while.

I have replaced literally every functional component in the system, except the CPU and Blu-ray drive. I have not ruled out the CPU, but there is still a vast swath of software-"things" that might also be at fault.

Each time, the problem has returned after a few months.


Most recently, the symptoms have changed slightly. I am open to the possibility that this is a completely unrelated problem, but it seems too similar to the problems I have been battling the whole time, to be mere coincidence.

A few weeks I rebooted my computer to update, and it would not POST. I fussed with it for a while (checking connections, MemOK! button, disconnect power, TPU on/off, EPU on/off, etc.) and got it to POST, but the OS would not load. I forget the exact presentation of symptoms, but IIRC it would just sit and spin.

Reinstalled the OS and things were quiet for a week or so, until apps began crashing. At first, it seemed like all the apps that were crashing were installed on the same SSD. Without room to move things around and test, I upgraded to the new Samsung drives. But apps are still crashing.

  • Flashed latest BIOS update. No change.
    • Turns out, you have to reset the CMOS when you flash the BIOS. Potential symptoms are much like mine. I reset the CMOS. No change.
  • It was generally high-demand applications that would crash (Dishonored 2, Diablo III, ESO, etc). But crashes are happening between 35°C-45°C for CPU and GPU - So probably not temperature.
  • It is not running out of RAM.
  • MemTest has never shown any problems. I have run it dozens of times.
  • No CPU test has ever shown any issues, except at high temperatures.
  • No GPU test has ever shown any issues, except at high temperatures.
  • I've reinstalled my video drivers a few dozen times.
  • I had Task Manger crash while I was watching yesterday.
  • Tried to install a Windows Store App. Some background process crashed. Had to try again. Worked fine.
  • Event Viewer has just AppCrash events

AppCrash events are being produced by a wide range of applications. Varying sizes, locations, demands, etc. It is typically once a day, maybe less. But high-resource applications crash pretty reliably within 30 minutes or so.

I should clarify that these are not Windows is looking for a solution AppHang events. The application just vanishes, like I closed it, and Windows has nothing to say about it except the AppCrash event in the Event Viewer. Less often, there is a BSOD. Lately, I have seen IRQ not less than or equal, and others that I cannot remember... (I don't have any memory dumps anymore? That's weird...).

System specs:

  • OS: Windows 10 Pro (upgraded from Win7 during free upgrade period)
  • CPU: AMD Phenom II 1090 (no overclocking)
  • Cooling: CoolerMaster 150mm CPU fans, several case fans
  • Mainboard: ASUS M4A99X EVO R2.0
  • RAM: G.Skill 16GB(4x4) DDR3-1333
  • GPU: MSI GTX 970 (no overclocking)
  • PSU: Corsair CX750M
  • System drive: Samsung 850 EVO 500GB
  • Other drives: Samsung 850 EVO 500GB, other conventional drives, optical drive
  • A/V: Windows Defender, no other AV

Crash dump:

Prompted by this post: https://superuser.com/questions/1281659/possible-to-determine-which-core-a-faulting-application-was-on-when-it-crashed

Hit a new BSOD while it was idling last night. Details from WhoCrashed below:

Crash dump directory: C:\WINDOWS\Minidump
Crash dumps are enabled on your computer.

On Wed 1/3/2018 9:00:13 AM GMT your computer crashed
crash dump file: C:\WINDOWS\Minidump\010318-12546-01.dmp
This was probably caused by the following module: ntoskrnl.exe (nt+0x1640E0)
Bugcheck code: 0x1E (0xFFFFFFFFC0000005, 0xFFFFF8019CED183E, 0xFFFF968442FBEB68, 0xFFFF968442FBE3B0)
Error: KMODE_EXCEPTION_NOT_HANDLED
file path: C:\WINDOWS\system32\ntoskrnl.exe
product: Microsoft® Windows®
Operating System company: Microsoft Corporation
description: NT Kernel & System
Bug check description: This indicates that a kernel-mode program generated an exception
which the error handler did not catch. This appears to be a typical software driver bug
and is not likely to be caused by a hardware problem.  The crash took place in the Windows
kernel. Possibly this problem is caused by another driver that cannot be identified at this time. 

On Wed 1/3/2018 9:00:13 AM GMT your computer crashed
crash dump file: C:\WINDOWS\memory.dmp
This was probably caused by the following module: ntdll.sys (ntdll!ZwFlushBuffersFile+0x14)
Bugcheck code: 0x1E (0xFFFFFFFFC0000005, 0xFFFFF8019CED183E, 0xFFFF968442FBEB68, 0xFFFF968442FBE3B0)
Error: KMODE_EXCEPTION_NOT_HANDLED
Bug check description: This indicates that a kernel-mode program generated an exception
which the error handler did not catch. This appears to be a typical software driver bug
and is not likely to be caused by a hardware problem.  A third party driver was identified
as the probable root cause of this system error. It is suggested you look for an update for
the following driver: ntdll.sys.G
Google query: ntdll.sys KMODE_EXCEPTION_NOT_HANDLED

Memory dumps (full and mini) will be here, as they are available: https://1drv.ms/f/s!AhSzRvnavkrXhPpNy8Qjhaj6LbbTwQ


@magicandre1981 recommended chkdsk /f based on the results of my memory dump. C: is the only drive for which a pagefile is enabled (it's system managed), so that's the one I ran it on. Here are the results:

Checking file system on C: The type of the file system is NTFS.

A disk check has been scheduled.
Windows will now check the disk.                         

Stage 1: Examining basic file system structure ...
  605184 file records processed.                                                         File verification completed.
Deleting orphan file record segment 699DD.
  10717 large file records processed.                                      0 bad file records processed.                                      
Stage 2: Examining file name linkage ...
  14846 reparse records processed.                                         704776 index entries processed.                                                        Index verification completed.
  0 unindexed files scanned.                                           0 unindexed files recovered to lost and found.                       14846 reparse records processed.                                       
Stage 3: Examining security descriptors ...
Cleaning up 1426 unused index entries from index $SII of file 0x9.
Cleaning up 1426 unused index entries from index $SDH of file 0x9.
Cleaning up 1426 unused security descriptors.
Security descriptor verification completed.
  49797 data files processed.                                            CHKDSK is verifying Usn Journal...
  37651904 USN bytes processed.                                                            Usn Journal verification completed.
CHKDSK discovered free space marked as allocated in the
master file table (MFT) bitmap.
CHKDSK discovered free space marked as allocated in the volume bitmap.

Windows has made corrections to the file system.
No further action is required.

 487284001 KB total disk space.
 209659436 KB in 259738 files.
    162276 KB in 49798 indexes.
         0 KB in bad sectors.
    729085 KB in use by the system.
     65536 KB occupied by the log file.
 276733204 KB available on disk.

      4096 bytes in each allocation unit.
 121821000 total allocation units on disk.
  69183301 allocation units available on disk.

Internal Info:
00 3c 09 00 f0 b8 04 00 7e 93 08 00 00 00 00 00  .<......~.......
98 05 00 00 66 34 00 00 00 00 00 00 00 00 00 00  ....f4..........

Windows has finished checking your disk.
Please wait while your computer restarts.

No luck. Even after chkdsk fixed these issues, I'm still having the same crashes, though no new BSODs yet.


Another BSOD as I was opening the browser to update this question. Memdumps available once they finish uploading.

But the original reason I came to update is that I found a whole crapton (51 to be precise) of events that look exactly the same. It looks like they happened about every half-hour, starting right after I left for work (7:30am) until about 8:30pm. They might still be happening. They all look like exactly this:

Fault bucket 0x1E_c0000005_fltmgr!FltpPreFsFilterOperation, type 0
Event Name: BlueScreen
Response: Not available
Cab Id: 0

Problem signature:
P1: 1e
P2: ffffffffc0000005
P3: fffff8019ced183e
P4: ffff968442fbeb68
P5: ffff968442fbe3b0
P6: 10_0_16299
P7: 0_0
P8: 256_1
P9: 
P10: 

Attached files:
\\?\C:\WINDOWS\Minidump\010318-12546-01.dmp
\\?\C:\WINDOWS\TEMP\WER-18531-0.sysdata.xml
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER5795.tmp.WERInternalMetadata.xml
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER57A5.tmp.csv
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER57B6.tmp.txt
\\?\C:\Windows\Temp\WER8F12.tmp.WERDataCollectionStatus.txt

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\Kernel_1e_b49232881f44bde28acca17f0ad8bac3b4fbb67_00000000_cab_031c57c4

Analysis symbol: 
Rechecking for solution: 0
Report Id: 3c2abe43-d7d6-4561-9b0d-2adf1f40c745
Report Status: 388
Hashed bucket: 

I have a hard time believing that the CPU would have this issue for so long, and the computer still be functional. I haven't had much success exploring software/configuration issues.

Any ideas?


Almost 3 weeks later.... After MUCH shenanigans, I finally acquire a new CPU (upgraded from Phenom II to FX-8350). Replacement was easy enough. Then probe common problem-areas, and apps are still crashing.

As soon as I posted "sad-face," Windows told me something about a "Device Health Report." It reports trouble with a driver. Unfortunately, but unsurprisingly, the Troubleshooter was unable to detect any kind of problem. I uninstalled the two "USB Root Hub" devices in error state from the Device Manager.

It rhymes with Pool

Does this provide any additional clues? I'm really at a loss, now...


Here is a list of driver information...? https://docs.google.com/spreadsheets/d/1xAliAOt1s8rQ_ePX5OwTRVFPB3kFYgc3-1HRUznMpR0/edit?usp=sharing

mHurley

Posted 2018-01-03T02:35:32.620

Reputation: 163

share the dmp files so that we can debug them – magicandre1981 – 2018-01-03T17:09:12.797

Will do! I'll add links to the main question as soon as they're available. – mHurley – 2018-01-03T18:02:48.710

Thanks for the extensive edit, @flolilolilo That's much easier to read, now. – mHurley – 2018-01-03T18:13:05.537

1analyzing the dump shows it crashes while doing volume shadow operation (CVssQueuedVolume::OnOpenVolumeHandle). so run chkdsk /f to check HDD file system for errors. – magicandre1981 – 2018-01-03T18:58:10.010

Excellent information! I'm still mystified that people can get that kind of information out of that file. Running chkdsk will certainly help with that problem, but is there good evidence to show that this is the cause of ALL the AppCrash events I've been seeing recently? – mHurley – 2018-01-03T23:58:31.337

Added chkdsk results to my question. – mHurley – 2018-01-05T03:26:30.920

ok, chkdsk fixed NTFS issues. now wait if you get new crashes (BSOD or app crashes) – magicandre1981 – 2018-01-05T11:40:19.097

No luck :-( Still crashing. – mHurley – 2018-01-07T22:20:50.100

what crashes? BSOD or app crash? Which process? – magicandre1981 – 2018-01-08T16:35:17.497

AppCrash - Diablo III. I suspect there were others (browser acting weird, apps seemed not to load when I launched them), but I haven't had a chance to track them down, yet. – mHurley – 2018-01-08T21:20:48.837

Looks like there are many other Application Error events, but it's hard to tell. I literally opened the browser to update this question, when I had another BSOD. New MemDumps coming soon. Also, I have a lot of "info" events about the Bluscreen. I'll post an example in the question. – mHurley – 2018-01-09T03:12:54.490

could be HW issue, last dump shows (IP_MISALIGNED, MODULE_NAME: hardware). so yes, it could be the AMD Phenom(tm) II X6 1090T that fails. – magicandre1981 – 2018-01-09T17:34:44.690

:-( Sadface-making – mHurley – 2018-01-09T18:21:59.633

Alright, just BSOD. Last set of memdumps, just to see if there's anything interesting. Available at the usual link, once they've uploaded. – mHurley – 2018-01-11T04:41:37.267

last dump shows this : " *** Memory manager detected 1 instance(s) of page corruption, target is likely to have memory corruption." so still HW issue – magicandre1981 – 2018-01-11T16:21:36.857

Well... I guess that does it. If it's hardware, really the most likely candidate is the CPU. It's not really the answer I was hoping for, but I'm more confident now that it's the real answer. Thanks for all your help, guys. – mHurley – 2018-01-12T04:17:29.340

if you already changed motherboard and RAM the CPU could be the issue. look on ebay if you can find a x6 replacement CPU for small amount of money – magicandre1981 – 2018-01-12T16:32:15.420

:-( New CPU. Still crashing. No BSOD, yet, so IDK what kind of information I can get. – mHurley – 2018-01-26T02:11:13.957

...and suddenly, there's new information. See edit. – mHurley – 2018-01-26T02:17:29.877

share the new dumps – magicandre1981 – 2018-01-26T16:21:07.380

No BSOD, so no new dumps. Just AppCrash. I've been running verifier for the last 36 hours. Standard settings, no BSODs from that either. – mHurley – 2018-01-26T23:19:52.257

if you have no BSOD this is good. app crashes can be caused by a lot of other things. look in eventlog / Reliability Monitor for details about which applications crash: https://lifehacker.com/how-to-troubleshoot-windows-10-with-reliability-monitor-1745624446

– magicandre1981 – 2018-01-27T07:25:05.980

New info... see edit. – mHurley – 2018-02-02T04:15:33.390

ok, so your board has stability issues when using all RAM modules. try to increase voltage of RAM a bit (only a small amount otherwise you kill the RAM) – magicandre1981 – 2018-02-02T05:20:05.273

Interesting... it never occurred to me that stock RAM on a stock board might need a voltage tweak. Could this be true even if I'm not overclocking anything? I've always been intimidated by these settings, before. What do you mean by "small?" – mHurley – 2018-02-03T14:16:34.480

only a small voltage increasement – magicandre1981 – 2018-02-03T16:34:31.837

Answers

2

Divide & Conquer

First, you must try to determine if this is hardware or software issue. Sometimes it involves both, but initially it's best to assume not.

In my experience, the most effective way to determine which camp is at fault is to boot to a second, completely different OS (without changing any hardware, mind you) and attempt to reproduce the problem. It's best to use an OS that doesn't use any of the same code as the suspect OS. For example, if your suspect system runs Windows, you could use Ubuntu for your test OS. Live CDs are good for this.

With intermittently occurring problems this can be challenging, but however you go about it, you need to know if:

  • Both OSes are affected, meaning you have a hardware issue, or
  • Only your suspect OS is affected, meaning you may have either:

    • A software issue, or
    • An incompatibility between a hardware component and specific software (which is almost always a 3rd party driver).

If you think it's hardware

You've already tested and replaced a lot of components. If the unwanted behavior manifests itself in your test OS, you are armed with conclusive evidence something you've not yet replaced is at fault. For those components that don't lend themselves to comprehensive testing (e.g. the motherboard), you'll probably want to try replacing other, less costly components first, but eventually you may have no choice but swap the more expensive components as well.

If you think it's software

If the test OS doesn't trigger the fault, you can be more confident there's a problem with the software in your target OS. However, if the failure has historically not been able to be produced on-demand or otherwise occurs only intermittently, there remains a chance it's still a hardware issue that simply wasn't triggered in the test OS. Don't dwell on this; just keep it in mind when testing your tentative solutions.

When sorting out what code is at fault, you obviously want to follow up on specific error messages, such as Windows' bugcheck codes, errors logged in the event logs, or in application-specific logs. I'll skip over these steps based on the assumption you've exhausted those leads and need a more general approach.

When it's unclear what software is at fault, your weapon of choice is to remove the software from the equation and run the system long enough to give the problem a chance to occur, if it's going to. You can do this by:

  1. Uninstall the software.
  2. Disable it using a tool such as Microsoft AutoRuns.
  3. Disable it by booting into Safe Mode.
  4. Create a second Windows installation without the software in question (useful if you really need the software for day-to-day use and want to be able to easily switch between "testing" and "production" mode).

When doing this I like to categorize the system's software as follows and troubleshooting accordingly:

  1. Windows own code and inbox drivers. Least likely to be at fault. Easily confirmed by testing the system using a pristine install (one without any 3rd party code).
  2. Third party drivers. Always causing trouble. Usually crash in non-random ways such that a pattern emerges. Test by using different driver versions, or by swapping out hardware components.
  3. Third party system-level software (e.g. security software). Troublesome. These are rarely required for proper system operation and can be completely uninstalled in order to test their influence.
  4. User applications. Highly variable crash behavior. On modern versions of Windows these rarely crash or lockup the entire system. Failures only occur when the application is running, so it's easy to track failures and correlate them with programs that were running at the time. Watch out for user applications that have an always-on component such as startup items or systems services.

Keep a semi-detailed work log

Final thought here. Keep a log of ask the problems you encounter and troubleshooting steps you take. With a difficult and drawn-out problem like this one it's easy to forget details. Being able to review this as you work may help you rule out causes or make connections between facts that otherwise might be lost in the struggle.


Anecdotal Story

I worked on a system that reminds me of your situation. It was a laptop (which limited my hardware swapping options) that would lock up randomly. It would do it 10 seconds after power-on, then not for days, and then after being on for hours. I updated everything, tested and replaced every hardware component I could, and reinstalled Windows (at least once, if not twice).

It ended up being the motherboard. After it was replaced, the laptop ran for many years without further trouble.

I say Reinstate Monica

Posted 2018-01-03T02:35:32.620

Reputation: 21 477

Thanks for the input. The CPU is the ONLY component that could even remotely cause these kinds of problems, and hasn't been completely replaced... and I agree, it's less likely than other problems. Running another OS is an interesting approach I haven't considered before. I've dual-booted Ubuntu before, but I'm concerned I won't be able to put enough hours in it to generate an observable fault. Also... a negative test (which I think is likely) doesn't get me very much at all: it's something to do with the other install... I'll keep at it. – mHurley – 2018-01-03T05:20:39.547

With any luck, one of these days Google will apply their Deep Mind to analyze event and application logs to diagnose these kinds of things... – mHurley – 2018-01-03T05:23:11.230

Why do you say the CPU is the only component able to cause these problems? – I say Reinstate Monica – 2018-01-03T12:49:59.720

Of the components I haven't already replaced, the CPU is the only one that could cause apps to crash and BSODs. Literally, the only other things I haven't replaced yet, are some case fans, the case itself, and a Bluray drive I rarely use. As I've said before, I'm still open to the possibility, but I'm becoming less and less confident that this is a hardware issue at all. ....or I'm just really unlucky and one of the replacement parts I got is also bad in the same way as the original. – mHurley – 2018-01-03T13:47:30.283

@mHurley I must disagree. RAM is equally able to cause such faults (and in my 20 years experience, does so with much greater frequency). While I think in your case there's an elevated possibility it's the CPU, I'd encourage you to consider trying different RAM. And forget the RAM tests. RAM frequently exhibits intermittent failure that's near impossible to catch with a RAM test. – I say Reinstate Monica – 2018-01-03T14:03:58.080

I certainly agree that RAM could cause the issue, but I'm less suspicious of it because I've replaced it all recently. But, your testing info is new (and disappointing) information. Should it be sufficient to Warranty return the RAM? Is the manufacturer (reasonably) guaranteed to detect any defects, or could that be just as undependable as a session of MemTest? – mHurley – 2018-01-03T18:00:02.427

1@mHurley I would try to warranty return the RAM. I don't know how they would handle the return, but most returns I do they take your due diligence as sufficient evidence. A memory manufacturer knows not all RAM failures can be detected with tests (by end user, at least). – I say Reinstate Monica – 2018-01-03T18:58:23.797