Computer BSODs ONLY when launching Overwatch

Okay so i posted about this on the OW forums, though it seems like no one there cares. So im posting here just in case its a hardware problem and not an overwatch problem since i seem to be an outstanding case.

So i've built a gaming rig to suit all my gaming needs, and it has for close to 2 1/2 years. Ive been playing OW for about the same time and everything was fine until the recent OW patch. (which is why i think its a problem on their end) Now, i play many games that are more graphically intensive than OW and i've never had a crash with them, DOOM, Fallout 4 and Witcher 3 are just a few examples.

The Crash occurs ONLY when i launch OW, it hangs on a black screen and if i have music on in the background it holds a note until the computer BSODs and restarts. The most recent bsod said something along the lines of "clock" and something about my second core?

Things ive tried; Memcheck, Un and Re installed the game, updated BIOS, Updated graphics and even re-installed windows.

Not sure if this is related, but i recently got a new Razer Ornata Keyboard, could this be effecting it? Ill run a trial and error and update this post. Update; unplugged keyboard, no change.

Specs in attached image.

https://i.gyazo.com/23e5bf70eed481bb45678be16da44915.png

Most recent minidump: https://www.filehosting.org/file/details/758289/092618-20607-01.rar

Help a guy out? Hopefully this problem doesn't make me look as dumb as my last one.

surazaL

Posted 2018-09-27T03:16:01.113

Reputation: 13

Answers

The minidump says that the bugcheck code is WHEA_UNCORRECTABLE_ERROR .

WHEA = Windows Hardware Error Architecture. (i.e. you've experienced a hardware problem.) The bugcheck parameters reported in the minidump are:

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. 
Parameter 1 identifies the type of error source that reported the error. 
Parameter 2 holds the address of the WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa80070778f8, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value.

In brief, the CPU raised an exception called a "machine check". Such are always fatal to the OS, as far as I know. The minidump says you have an AMD CPU; the AMD processor architecture manual says that the processor will raise a machine check exception in these circumstances:

Cache errors associated with reading and writing data, probing, cache-line fills, and cache-line writebacks. [note that these are all inside-the-CPU things. Has nothing to do with e.g. the Windows file cache. -jeh]
Parity errors associated with the caches and TLBs. [also inside-the-CPU -jeh]
ECC errors associated with the caches and DRAM. [ECC errors in the caches are inside the CPU. You are very unlikely to be running ECC RAM so I'll assume that doesn't apply. -jeh]
Bus errors associated with reading and writing on the processor external bus. [like it says - "external bus", not inside the CPU -jeh]

We can get more information about this by formatting the WHEA_ERROR_RECORD structure, whose address Windows conveniently put in bugcheck argument 2.

1: kd> !errrec fffffa80`070778f8
===========================================================================
Common Platform Error Record @ fffffa80070778f8
---------------------------------------------------------------------------
Record Id     : 01d45625295c3b26
Severity      : Fatal (1)
Length        : 928
[...]
Error         : BUSLG_GENERIC_ERR_*_TIMEOUT_ERR (Proc 1 Bank 0)
  Status      : 0xb880000000020f0f

So - we had a timeout on a bus. i.e. a transaction on the bus was started but was not completed soon enough. The "bus" was probably PCIexpress.

Given the circumstances you describe, this does strongly point to the graphics card.

But first I would try swapping your power supply for a better/more powerful one, particularly one with more current on the 12V rail. Modern GPUs are very power-hungry.

Here is a Microsoft page that goes into more detail on interpreting this type of memory dump (that is, bugcheck code 0x124).

There is not much more info available from the minidump. The only thing that can be seen is the current thread info; that thread is dedicated to reporting WHEA errors so it has no information about what was happening in other threads, maybe on other logical processors, at the time, and the dump doesn't contain any of that. Usually I would try e.g. !running, !ready, etc., but here the debugger just says "unable to read from fffff800030b9000". That's because of info that's missing from the minidump - which is typical for WHEA errors. If you enabled kernel or automatic dumps and reproduced the problem it is possible that the larger dump file might have more information, but it looks to me as though you have a clear path to follow without that, i.e. hardware swaps. Sorry about that.

Jamie Hanrahan

Posted 2018-09-27T03:16:01.113

Reputation: 19 777

Thank you for your answer! I dont suppose the fact that i JUST bought a new PSU like a month ago (an evga 600w) effect your solution? Im almost certain its a gpu problem now, which i guess is understandable given the advanced age of it. Just find it odd that it only melts down on one program. – surazaL – 2018-09-28T19:19:30.300

If you just bought a good qualiity PSU, then it's probably not a suspect... unless it's an "infant mortality". situation. re "only on one program" (and you say that program is not the heaviest, gfx-wise, it's not always about the amount of activity. Sometimes one program will just exercise a component in ways that other don't. Maybe you can find a less-expensive gfx card to swap in as a troubleshooting step. – Jamie Hanrahan – 2018-09-28T23:10:15.153

Would disabling my gpu and using the on board graphics work the same way? – surazaL – 2018-09-29T14:35:56.520

It's something to try, and it won't cost you anything. But I was thinking of a lower-cost version of the same card, using the same drivers, if at all possible. What card is it? – Jamie Hanrahan – 2018-09-29T14:41:25.470

Wait. I dont think i have integrated graphics, its not showing up in my display adapters at least. Well i guess the only option is to go buy one. Even if it turns out to be another problem, i needed an upgrade anyway. Hope its not the board. – surazaL – 2018-09-29T14:53:56.350

Yeah... the FX 6100 doesn't have integrated graphics. – Jamie Hanrahan – 2018-09-29T15:00:05.387

Alright man, thanks for your help and interpretation of the bug report. Ill eventually scrape together the change for a new gpu and then update but thats probably going to be a while. Regardless, thanks again. – surazaL – 2018-09-29T15:08:08.537

I'm only guessing here, but based on the amount of stuff you've tried (even reinstalling Windows), I'm tempted to say that your GPU has suffered a small, localized hardware failure. A tiny part of the GPU itself, the board, or the VRAM is defective in such a way that only specific sequences of graphics draw calls cause it to manifest. It's entirely possible for only a single game to do this.

I had a similar problem about 10 years ago with a much older Nvidia card that was widely known to suffer partial failure effects with age; one specific MMO would display artifacts then crash, but other MMOs and FPS games would run fine.

If your GPU is 2.5 years old, it's definitely old enough to start deteriorating in some "early failure" type of way. This is usually more common on laptops where the chips consistently run hotter than on desktops (for example: Macbook Pros have had short-lived GPUs for years), but maybe you just got unlucky.

As a gross generalization I believe that this sort of issue is very rare on modern desktop graphics cards, but that doesn't mean it can't happen. The only reason it has gotten less frequent is that, for the last few generations, Nvidia and AMD have been investing more QA and stress testing on their products than they used to to ensure their long-term reliability. Of course, if you run a chip too hot, eventually it will break -- it's just a question of when.

Ultimately, without very specialized equipment (most likely an electron scanning microscope and/or x-ray microscope, as well as thousands of dollars of additional microelectronics equipment) there is no way to know for certain what the problem is with your GPU (if one exists) and how/why it happened.

For a usual consumer, unfortunately, the alternative is a simple but often expensive one: when you suspect a part to be "bad", replace it with a new (or at least different but known-working) device with equivalent functionality.

For example, if you had a GTX 970 that you suspect is bad, you could borrow a friend's GTX 960 (that they've tested and know it works) and install that in your system just to see if it will work. If it doesn't work, the problem is something else. If it works, then your GTX 970 is bad.

Repeat this process for every imaginable component: motherboard, CPU, RAM, conceivably even something like a WiFi card.

If you don't have any friends willing to let you borrow computer parts, you may have to buy them to do these tests. Or, if you know a friendly local computer repair shop, they might let you troubleshoot with their spare hardware and might only charge a small diagnostic fee (if anything), which is much cheaper than buying a new GPU. You could also take your chances on the used market if you want.

Once you identify bad hardware, all you can do is replace it. In most cases it is not economical to take a broken GPU and try to fix it, because the time required for a professional with high-end equipment to actually find and fix the problem will exceed the value of the GPU -- unless it's brand new. And if it's brand new, you have a warranty and you should send it back to the manufacturer for repair or replacement. GPUs depreciate too quickly for out-of-warranty repair to be economical, sadly.

If swapping hardware doesn't fix your problem, then it could still be a software problem -- but given that you've completely reinstalled Windows (and, one assumes, Overwatch), my bet is that you will eventually find defective hardware rather than something software related. Besides, your typical data corruption type of error doesn't cause a BSOD.

This issue could be tough to diagnose, nearly impossible to root cause, and likely expensive to fix if it's what I think it is. Most GPUs have a 1 or 2 year warranty, not 2.5+ years, so it's almost definitely out of warranty unless you have a very good manufacturer who commits to a longer warranty. If you're not covered under warranty and you determine that the problem is with the GPU hardware, you're going to need to buy a new GPU.

allquixotic

Posted 2018-09-27T03:16:01.113

Reputation: 32 256

As allquixotic indicates there could be a lot of reasons for your problem and it may be difficult to isolate. One suggestion from your comment on 'something about my second core' would be to try out different core affinities with your exe (OW) - Task Mgr, Details, OW exe, Set Affinity. just to see if you have any issues with your CPU. – reben – 2018-09-27T12:04:19.150

But what if i replace my gpu only to find out its a board problem? Also reben, how can i set affinity if the program crashes as soon as it opens? – surazaL – 2018-09-27T20:37:07.477

If you replace the GPU and it isn’t a GPU problem, you’d either have to return the new GPU for a refund (if allowed), keep it, or sell it. It’s very much a trial and error process. – allquixotic – 2018-09-28T05:42:00.897

"I had a similar problem about 10 years ago with a much older Nvidia card [...]" Yep, I had that issue on an nVidia mobile card in my main work laptop. The laptop was still what I needed otherwise, so swapped the motherboard for one with Intel graphics. – Jamie Hanrahan – 2018-09-29T15:21:22.287

(so *I swapped...) However those symptoms were much worse than described by the OP here: started with flickering, went to massively disrupted display, then blackness. Never had an OS crash. – Jamie Hanrahan – 2018-09-29T19:40:08.090