Windows/Linux hang issue - BSOD

0

System Details:

enter image description here

DELL Latitude 3540 with Windows 7 & Ubuntu linux, dual boot.


Observation:

I see below error in windows 7 system logs frequently. Windows/Linux gets hanged 5 minutes after switching/working on laptop.

Interesting observation is, if Windows/Linux OS is working under high bandwidth wifi coverage, then windows/Linux continue working properly. Under low bandwidth wifi coverage Windows/Linux hang within 5 minutes.

That means, Ubuntu/Windows7 works for properly(for ever), until, until some application/browser connects to Internet thru low bandwidth wifi.

memtest86+ tool shows all tests PASS and amidst testing CPU temp is 50°C to 53°C

In ubuntu, wifi details taken,

mohet01@mohet01-ubuntu:~$lspci -knn | grep Net 02:00.0 Network controller [0280]: Qualcomm Atheros QCA9565 / AR9565 Wireless Network Adapter [168c:0036] (rev 01) Subsystem: Dell QCA9565 / AR9565 Wireless Network Adapter [1028:020c]


Consequence:

OS hangs consistently and need to restart within 5 minutes.

Received core dump in windows 7(once) with blue screen(shown below)

/**************************************************/

Problem signature:
  Problem Event Name:   BlueScreen
  OS Version:   6.1.7601.2.1.0.768.3
  Locale ID:    4105

Additional information about the problem:
  BCCode:   124
  BCP1: 0000000000000000
  BCP2: FFFFFA8007067028
  BCP3: 00000000BE000000
  BCP4: 0000000000800400
  OS Version:   6_1_7601
  Service Pack: 1_0
  Product:  768_1

Files that help describe the problem:
  C:\Windows\Minidump\031717-25412-01.dmp
  C:\Users\....\AppData\Local\Temp\WER-78016-0.sysdata.xml

/**************************************************/

enter image description here


Log details in XML format:

- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
  <Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{C26C4F3C-3F66-4E99-8F8A-39405CFED220}" /> 
  <EventID>18</EventID> 
  <Version>0</Version> 
  <Level>2</Level> 
  <Task>0</Task> 
  <Opcode>0</Opcode> 
  <Keywords>0x8000000000000000</Keywords> 
  <TimeCreated SystemTime="2017-03-13T16:14:23.559321500Z" /> 
  <EventRecordID>14505</EventRecordID> 
  <Correlation ActivityID="{0B7D462B-CDAA-4CD9-B8D9-C9532A2A345F}" /> 
  <Execution ProcessID="1796" ThreadID="1968" /> 
  <Channel>System</Channel> 
  <Computer>mohet01-w7</Computer> 
  <Security UserID="S-1-5-19" /> 
  </System>
- <EventData>
  <Data Name="ErrorSource">3</Data> 
  <Data Name="ApicId">0</Data> 
  <Data Name="MCABank">6</Data> 
  <Data Name="MciStat">0xae0000000040110a</Data> 
  <Data Name="MciAddr">0xffb07540</Data> 
  <Data Name="MciMisc">0x238a0000086</Data> 
  <Data Name="ErrorType">9</Data> 
  <Data Name="TransactionType">2</Data> 
  <Data Name="Participation">256</Data> 
  <Data Name="RequestType">0</Data> 
  <Data Name="MemorIO">256</Data> 
  <Data Name="MemHierarchyLvl">2</Data> 
  <Data Name="Timeout">256</Data> 
  <Data Name="OperationType">256</Data> 
  <Data Name="Channel">256</Data> 
  <Data Name="Length">928</Data> 
  <Data Name="RawData">435045521002FFFFFFFF03000100000002000000A0030000110D10000D0311140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131FE6FF5E89C91C54CBA8865ABE14913BB055163B7149CD20102000000000000000000000000000000000000000000000058010000C00000000102000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000001000000000000000000000000000000000000000000000018020000800000000102000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000001000000000000000000000000000000000000000000000098020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E8000000000000000000000000000000000100000000000000000000000000000000000000000000007F0100000000000000020100000200005106040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000070000000000000000000000000000005106040000081000BFFBDA7FFFFBEBBF0000000000000000000000000000000000000000000000000000000000000000F50157A5EFE3DE43AC72249B573FAD2C03000000000000009F008206000000004075B0FF000000000000000000000000000000000000000000000000000000000100000001000000609E6CB9149CD2010000000000000000000000000000000000000000060000000A114000000000AE4075B0FF00000000860000A0380200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data> 
  </EventData>
  </Event>

Process id 1796,

enter image description here

All levels of DELL support for this laptop is expired in 2016.


Diagnostics:

I ran below diagnostics on Intel processor of this laptop, looks fine.

enter image description here


Question:

Is it something to do with hardware issue?

or

Is it something to do with WIFI driver issue in both(Linux/Windows7) that triggers this problem?

overexchange

Posted 2017-03-13T16:34:24.967

Reputation: 135

also post the details as XML, this is easier to read – magicandre1981 – 2017-03-14T16:08:54.127

@magicandre1981 Yup, updated – overexchange – 2017-03-14T19:19:02.690

have you tried to use LAN cable and not Wifi? Do you also see crashes here or not? – magicandre1981 – 2017-03-21T16:33:42.353

@magicandre1981 ####1) 16 crash dump files created in c:/windows/MiniDump/, before testing on high bandwidth WIFI. Now, no crash dumps/BSOD/hang. There is no LAN cable facility, but I updated the WIFI & Ethernet drivers last night from DELL support website. Since last night, there is no glitch, with low bandwidth(fluctuating ON/OFF) WIFI. Let me have observation this week. ####2) How often do we need to clean laptop components? It has been 3 years using this laptop, at home, which is open air.

– overexchange – 2017-03-21T17:09:44.903

I have no idea why it helped. maybe the driver now no longer uses the damaged parts of the cache – magicandre1981 – 2017-03-21T17:19:31.300

@magicandre1981 Processor passed the intel processor stress tool test, as updated in query with results. – overexchange – 2017-03-21T17:26:51.173

share the dmps with the 0x124 code, o that I can see why your PC crashed – magicandre1981 – 2017-03-22T16:34:53.537

copy the minidump folder to desktop, create a zip , upload the zip to a cloud service (onedrive, dropbox) and post a share link here – magicandre1981 – 2017-03-22T16:54:35.740

@magicandre1981 Here are dump files

– overexchange – 2017-03-22T16:54:58.270

I looked at the dump, the cpu is damaged. You have L2 cache issues. replace the CPU with a new one. look on ebay to buy the same cpu again – magicandre1981 – 2017-03-22T17:42:05.980

@magicandre1981 1) Can you share the dump analysis? 2)Is this the cpu to buy? I can't find in ebay

– overexchange – 2017-03-22T18:19:12.517

I posted the windbg output as edit in my answer. if you can't find a cpu, your out of luck. time to buy a new device – magicandre1981 – 2017-03-23T15:41:23.830

@magicandre1981 I have uploaded all 16 dumps, here. Are they showing same error?

– overexchange – 2017-03-24T20:38:15.983

yes, I check the latest 2 ones. both sow GCACHEL2_ERR_ERR. open the dmp in windbg and run !errrec valueFromArg2 to dump the data for 0x124 crash – magicandre1981 – 2017-03-25T07:49:32.443

@magicandre1981 I installed windbg and opened the dump, but donot get command prompt to type command. I selected command option in view, but no luck – overexchange – 2017-03-25T14:20:57.590

Answers

2

Event with ID 18 is fatal error. This is how Event 18 is declared in Windows:

<event channel="chidSystem" level="win:Error" message="$(string.EventMessage.XpfMcaErr)" symbol="WHEALOGR_XPF_MCA_ERROR" template="tidMachineCheck" value="18" />

<string id="EventMessage.XpfMcaErr" value="A fatal hardware error has occurred.%n%nReported by component: Processor Core%nError Source: %1%nError Type: %7%nProcessor APIC ID: %2%n%nThe details view of this entry contains further information." />

An cache hierarchy error could be an issue with the L1, L2 cache of you CPU. Run CPU stress tool that access cpu cache very heavily. Maybe it reports any errors. I think you need to replace the CPU, like this user who sees the same message.

The 0x124 dump shows L2 Cache issues:

===============================================================================
Common Platform Error Record @ fffffa8006d0d8f8
-------------------------------------------------------------------------------
Record Id     : 01d2a2c0cacd2425
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 3/22/2017 4:00:10 (UTC)
Flags         : 0x00000002 PreviousError

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa8006d0d978
Section       @ fffffa8006d0da50
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Generic
Flags         : 0x00
Level         : 2
CPU Version   : 0x0000000000040651
Processor ID  : 0x0000000000000000

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa8006d0d9c0
Section       @ fffffa8006d0db10
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000000
CPU Id        : 51 06 04 00 00 08 10 00 - bf fb da 7f ff fb eb bf
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa8006d0db10

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa8006d0da08
Section       @ fffffa8006d0db90
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : GCACHEL2_ERR_ERR (Proc 0 Bank 6)
  Status      : 0xae0000000040110a
  Address     : 0x00000000ffb07540
  Misc.       : 0x000001f8a0000086

So sometimes the code uses the L2 cache and crashes and sometimes not. The dump only shows generic error, so I have no real idea what is wrong.

Buy the same CPU again on ebay and replace it.

magicandre1981

Posted 2017-03-13T16:34:24.967

Reputation: 86 560

Do u think airflow issue in laptop would be one cause? – overexchange – 2017-03-16T04:51:08.747

do you had fan/heat issues? – magicandre1981 – 2017-03-16T16:28:43.903

On touching laptop it looks ok. But anyways, you want me to check the FAN condition and clean the laptop to avoid dust? – overexchange – 2017-03-16T21:02:52.853

if the heat already damaged the cpu it is too late. try to clean the cpu cooler from dust (but don't use a Vacuum cleaner, this can damaged the fan be causing it to rotate too fast) – magicandre1981 – 2017-03-17T15:46:42.387

But if cpu is damaged, how kernel boots? Now windows is hanging in e 5 minutes after boot – overexchange – 2017-03-17T15:48:19.413

not all programs cause the same usage of cpu resources (cache). you should backup all data before you can no longer boot the laptop. – magicandre1981 – 2017-03-17T16:07:28.510

How to detect if cpu is gone? Can I check it physically opening laptop? – overexchange – 2017-03-17T16:10:16.763

run this Intel tool: https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool

– magicandre1981 – 2017-03-17T16:28:39.717

Ran the intel tool, Query updated – overexchange – 2017-03-18T01:12:45.180

run otehr stress test tools like Prime95. but when you now also get BSODs, the CPU seams to be gone. 0x124 = fatal hw error: https://msdn.microsoft.com/de-de/library/windows/hardware/ff557321(v=vs.85).aspx . according to this picture (http://www.laptoprepairworld.com/wp-content/uploads/2016/06/Dell-Latitude-3540-Motherboard.jpg) the CPU is socket based and you could change it. look for a cpu on ebay

– magicandre1981 – 2017-03-18T08:01:40.660

memtest86+ tool shows CPU temperature as 50-53 degree centigrade. All tests passed – overexchange – 2017-03-21T16:27:16.407