Diary of war with bugs: The bug is on the table

A few months ago there was an interesting story in the company where I work. One guy from a neighboring department asked me to help him. He encountered one crash after another over the past few days. Unusual in these crashes was that they occurred in different processes and under different circumstances, but within one computer on the test bench (consisted of 5 machines) and with identical symptoms. In all cases it was an access violation caused by illegal instruction pointer.

Here is an example of one of the call stacks:

0:000> knL

 # ChildEBP RetAddr  

00 0012d7f8 7c90e9ab ntdll!KiFastSystemCallRet

01 0012d7fc 7c8094f2 ntdll!ZwWaitForMultipleObjects+0xc

02 0012d898 7c809c86 kernel32!WaitForMultipleObjectsEx+0x12c

03 0012d8b4 6945763c kernel32!WaitForMultipleObjects+0x18

04 0012e248 694582b1 faultrep!StartDWException+0x5df

05 0012f2bc 7c863059 faultrep!ReportFault+0x533

06 0012f530 7c36e289 kernel32!UnhandledExceptionFilter+0x4cf

07 0012f54c 0041b84f msvcr71!_XcptFilter+0x15f

08 0012f558 7c363943 radar!WinMainCRTStartup+0x1d7

09 0012f580 7c9037bf msvcr71!_except_handler3+0x61

0a 0012f5a4 7c90378b ntdll!ExecuteHandler2+0x26

0b 0012f654 7c90eafa ntdll!ExecuteHandler+0x24

0c 0012f654 fc910d93 ntdll!KiUserExceptionDispatcher+0xe

WARNING: Frame IP not in any known module. Following frames may be wrong.

0d 0012f950 1f4d5318 0xfc910d93

0e 0012fa10 7c363593 0x1f4d5318

0f 0012fa58 087e6827 msvcr71!free+0xc3

10 0012fa8c 087e6f41 NaviRadarRendererPlain!std::vector,std::allocator > >::_Insert_n+0x147

11 0012fafc 087e46f1 NaviRadarRendererPlain!TScanlines::calculate+0x5d1

12 0012fc2c 087e418e NaviRadarRendererPlain!TRadarToScreen_::RenderBlock+0x521

13 0012fc64 087e2378 NaviRadarRendererPlain!TRadarToScreen_::Render+0x16e

14 0012fc88 087c4cf0 NaviRadarRendererPlain!n_d3d_render::renderer_t::render+0x68

15 0012fce0 03d5c5e7 NaviRadarLayer!TRadarLayerImpl::Draw+0x360

16 0012fd1c 03d55c8e TotUser!TRadarManager_::Update+0x77

17 0012fd4c 087c821a TotUser!TCrtPanel_::InvalidateRadar+0xbe

18 0012fd54 10ac8959 NaviRadarLayer!TRadarLayerImpl::OnVideoBlockReceived+0x7a

19 0012fd7c 10ac8ab6 TkRadar20!IAdviseHostImpl::TypedForEach+0x49

1a 0012fd90 10ac8e88 TkRadar20!RadarConnector::VideoBlockReceived+0x26

1b 0012fda0 77d48709 TkRadar20!detail::WndProc+0x48

1c 0012fdcc 77d487eb user32!InternalCallWinProc+0x28

1d 0012fe34 77d489a5 user32!UserCallWinProcCheckWow+0x150

1e 0012fe94 77d4bccc user32!DispatchMessageWorker+0x306

1f 0012fea4 7c1b1645 user32!DispatchMessageA+0xf

20 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x3e

21 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54

22 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30

23 0012ff18 0041b7fd mfc71!AfxWinMain+0x68

24 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185

25 0012fff0 00000000 kernel32!BaseProcessStart+0x23

When I looked at the call stack for the first time I made several assumptions about IP corruption reason (those that I considered more likely):
1. Failed IP address was sent to the call instruction.
2. Wrong return address was restored by the ret instruction.

To conform or refute each of them I started from the stack investigation.

Content of the stack around the current stack pointer was the following:

0:000> .frame /c 0d
0d 0012f950 1f4d5318 0xfc910d93
eax=000000f2 ebx=00000000 ecx=0000acce edx=1f4b7330 esi=1f4b6ec0 edi=01f20000
eip=fc910d93 esp=0012f954 ebp=0012fa10 iopl=0         nv up ei ng nz na po cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000             efl=00010283
fc910d93 ??              ???

0:000> dds @esp-50 @esp-4
0012f904 00000000
0012f908 00000000
0012f90c 00000000
0012f910 00000000
0012f914 00000000
0012f918 00000000
0012f91c 00000000
0012f920 00000000
0012f924 00000000
0012f928 00000000
0012f92c 00000000
0012f930 00000000
0012f934 00000000
0012f938 00000000
0012f93c 00000000
0012f940 00000000
0012f944 00000000
0012f948 00000000
0012f94c 00000000
0012f950 00000000

0:000> dds @esp @esp+150
0012f954 1f4d5318
0012f958 1f4b6ec8
0012f95c 0000008e
0012f960 11200000
0012f964 7c397a63 msvcr71!_87except+0xc4
0012f968 1f4b6ec0
0012f96c 00000008
0012f970 00000258
0012f974 00000258
0012f978 7c397a97 msvcr71!_87except+0xf8
0012f97c 0000037f
0012f980 01f20178
0012f984 00000000
0012f988 00000000
0012f98c 00000000
0012f990 00000000
0012f994 00000000
0012f998 1f4d5318
0012f99c 00000000
0012f9a0 000000d4
0012f9a4 00000000
0012f9a8 00000000
0012f9ac 00000000
0012f9b0 00000470
0012f9b4 1f4d5ac8
0012f9b8 00000000
0012f9bc 1f4d5310
0012f9c0 00000000
0012f9c4 00000000
0012f9c8 00000000
0012f9cc 1f4d5310
0012f9d0 000006a0
0012f9d4 1f4d5318
0012f9d8 01f20000
0012f9dc 01f20178
0012f9e0 000006a0
0012f9e4 00000000
0012f9e8 00000000
0012f9ec 00000000
0012f9f0 0101ffff
0012f9f4 000000f2
0012f9f8 0012f954
0012f9fc 0012f578
0012fa00 0012fa48
0012fa04 7c90ee18 ntdll!_except_handler3
0012fa08 7c910570 ntdll!CheckHeapFillPattern+0x64
0012fa0c ffffffff
0012fa10 0012fa58
0012fa14 7c363593 msvcr71!free+0xc3
0012fa18 01f20000
0012fa1c 00000000
0012fa20 1f4b6ec8
0012fa24 1f4d5318
0012fa28 0012fbf4
0012fa2c 0000008e
0012fa30 00000000
0012fa34 0012fa80
0012fa38 7c3638e2 msvcr71!_except_handler3
0012fa3c 7c39a3c0 msvcr71!`string'+0xc
0012fa40 0012fa24
0012fa44 0012f578
0012fa48 0012fa80
0012fa4c 7c3638e2 msvcr71!_except_handler3
0012fa50 7c39f150 msvcr71!`string'+0x24
0012fa54 ffffffff
0012fa58 0012fa8c
0012fa5c 087e6827 NaviRadarRendererPlain!std::vector,std::allocator > >::_Insert_n+0x147
0012fa60 1f4b6ec8
0012fa64 0012fbf4
0012fa68 00000384
0012fa6c 0000033b
0012fa70 0000033b
0012fa74 00000384
0012fa78 1f4d5318
0012fa7c 0012fa64
0012fa80 0012fc24
0012fa84 087e91b0 NaviRadarRendererPlain!_dllonexit+0x1d4
0012fa88 ffffffff
0012fa8c 0012fafc
0012fa90 087e6f41 NaviRadarRendererPlain!TScanlines::calculate+0x5d1
0012fa94 1f4b7330
0012fa98 00000698
0012fa9c 1f4d5780
0012faa0 02023f10
0012faa4 02022ff8

And there were few remarkable facts about it:

1. The stack above the stack pointer (that is the stack which has just been released) was empty.
Actually it could be result of exception handling code execution. Anyway this is a sad fact because of information we could get from this part of the stack could be very useful.

2. The last value on the stack is 0x1f4d5318
It’s seems unremarkable, but if we look at the call stack in the point of failure we’ll see that debugger consider this value as a return address from the last function call. It seems the debugger assumes that corrupted instruction pointer was sent to the call instruction. Then the preceding value on the stack is return address. But since this not the only option, I decided to check this assumption first.

Let’s look at the address 0x1f4d5318 more closely.

0:000> !address 1f4d5318
Usage:
Allocation Base:        1f200000
Base Address:           1f200000
End Address:            1fa00000
Region Size:            00800000
Type:                   00020000    MEM_PRIVATE
State:                  00001000    MEM_COMMIT
Protect:                00000004    PAGE_READWRITE

It’s not inside any module!

Hence the debugger’s assumption (as well as my first one) is probably not right. And this means that the frame number 0D does not exist.

3. But the frame number 0C looks much more suitable
Thus the obvious next step I thought about is to figure out what function has created this frame. Knowing the frame return address we can find out a call instruction leading to the frame creation.

0:000> ub 7c363593 L1
msvcr71!free+0xbd [f:\vs70builds\6030\vc\crtbld\crt\src\free.c @ 101]:
7c36358d ff1580a0397c call dword ptr [msvcr71!_imp__HeapFree (7c39a080)]

Let’s look where the table value _imp__HeapFree points to.

0:000> ln poi(msvcr71!_imp__HeapFree)
(7c91043d) ntdll!RtlFreeHeap | (7c91058d) ntdll!RtlpQueryDepthSList
Exact matches:
ntdll!RtlFreeHeap

Well, all I knew at that moment is that our failure was somehow related to the ntdll!RtlFreeHeap function and it wasn’t consequence of the call instruction.

Thus the only assumption remained is the assumption regarding a frame return address. There are several reasons why the return address can recover erroneous. Most common of them is different assumption of callee and caller about function calling convention or function signature (e.g. as a result of “dll hell”). But can we talk about inappropriate function calling convention or function signature regarding to ntdll.dll function? I really don’t think so!

Well, the situation became more and more intriguing :) Unfortunately the evidences I had wasn’t enough to make it clearer. I definitely needed more information and it would be great if I could get the part of the stack we just released as well. For that purpose I decided to attach the debugger to a number of relevant processes. Once the debugger receives an access violation event we will have a good chance to get the stack untouched.

While I was waiting for a dump with untouched released stack I had examined several another dumps from the same computer.

Here is some of the call stacks:

0:033> knL
# ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 0d58fed8 0a37285e 0xfc36225d
01 0d58fee8 04249d3f 0xa37285e
02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf
03 0d58ff74 03a46463 AtomDB!TSubscriberNotification_::Thread+0x23f
04 0d58ff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23
05 0d58ffb4 7c80b50b msvcr71!_threadstartex+0x6f
06 0d58ffec 00000000 kernel32!BaseThreadStart+0x37

0:070> knL
# ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 067cf2e4 01199be0 0xfc3c1e8d
01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150
02 067cf434 00bfec52 FBP_VideoTranseiver!n_rlt_video_source::TFBPVideoTransmitter::OnNewData+0xbe
03 067cf478 00bfe34b MFRLT_Scanner!TVideoHdrAccum::NotifySubscribers+0xf2
04 067cf4a0 00bfc5a7 MFRLT_Scanner!TVideoHdrAccum::OnNewData+0xdb
05 067cfd10 00d99753 MFRLT_Scanner!n_rlt_scanner::TMFRLT_Scanner::SetData+0xc7
06 067cfd78 00d99004 RIBSupportProxy!n_rlt_video_source::TRibSupportProxy_::SetData+0x93
07 067cfdb0 10040aed RIBSupportProxy!n_rlt_video_source::TVideoSourceSeqServer_::OnCall+0x84
08 067cfdb8 10040cf5 ETL!TSeqServer_::OnCall+0x1d
09 067cff74 10046463 ETL!TSeqServer_::Thread+0xd5
0a 067cff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23
0b 067cffb4 7c80b50b msvcr71!_threadstartex+0x6f
0c 067cffec 00000000 kernel32!BaseThreadStart+0x37

Despite the fact that all call stacks seemed to be different I found something common in they failure course. Look at the instruction pointers which leaded to access violation:

0d 0012f950 1f4d5318 0xfc910d93
0e 0012fa10 7c363593 0x1f4d5318
0f 0012fa58 087e6827 msvcr71!free+0xc3

00 0d58fed8 0a37285e 0xfc36225d
01 0d58fee8 04249d3f 0xa37285e
02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf

00 067cf2e4 01199be0 0xfc3c1e8d
01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150

They all have the same value of the high byte. And needless to say that in 32th bit Windows OS there is no valid addresses with high bit set. But let’s look at those addresses without high bit:

0:000> ln 0x7c910d93
(7c91043d) ntdll!RtlFreeHeap+0x3a7 | (7c91058d) ntdll!RtlpQueryDepthSList

0:001> ln 0x7c36225d
(7c362247) msvcr71!strncmp+0x16 | (7c36227f) msvcr71!__sse2_available_init

0:070> ln 0x7c3c1e8d
(7c3c1e8d) msvcp71!std::basic_string,std::allocator >::~basic_string,std::allocator > | (7c3c1e9b) msvcp71!std::basic_string,std::allocator >::_Copy
Exact matches:
msvcp71!std::basic_string,std::allocator >::~basic_string,std::allocator > (void)

Now let’s try to restore the call stack to the point of failure (last known call in black, restored instruction pointer in red):

7c910d93 3d00fe0000 cmp eax,0FE00h (ntdll!RtlFreeHeap+0x3a7)
7c36358d ff1580a0397c call dword ptr [msvcr71!_imp__HeapFree (7c39a080)] (msvcr71!free+0xc3)

7c362247 55 push ebp (msvcr71!strncmp+0x16)
04249d39 ff152c852904 call dword ptr [AtomDB!_imp__strncmp (0429852c)] (AtomDB!TPatternAtomFilter_::Filter+0xdf)

7c3c1e8d 6a00 push 0 (msvcp71!std::basic_string,std::allocator >::~basic_string,std::allocator > (void))
01199bda ff15d8401a01 call dword ptr [FBP_VideoTranseiver!_imp_??1?$basic_stringDU?$char_traitsDstdV?$allocatorD (011a40d8)] (FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150)

Seems true, isn’t it? But to prove or disprove this conjecture I need to have untouched released stack anyway.

And here it is:

0:000> knL
# ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 0012fe78 004a6310 0xfc1ac1c9
01 0012fe94 7c1b15e8 0x4a6310
02 0012fe9c 7c1b14f7 mfc71!CWinThread::PreTranslateMessage+0x9
03 0012fea4 7c1b1632 mfc71!AfxPreTranslateMessage+0x15
04 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x2b
05 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54
06 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30
07 0012ff18 0041b7fd mfc71!AfxWinMain+0x68
08 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185
09 0012fff0 00000000 kernel32!BaseProcessStart+0x23

0:000> dds @esp-50 @esp-4
0012fe2c 00000004
0012fe30 004a6310
0012fe34 00010b52
0012fe38 004a7664
0012fe3c 0012fe5c
0012fe40 7c20080e mfc71!_AfxForceVectorDelete+0x17fc
0012fe44 ffffffff
0012fe48 0012fe68
0012fe4c 7c1d7572 mfc71!AfxGetModuleThreadState+0x16
0012fe50 7c1d73e9 mfc71!CThreadLocal::CreateObject
0012fe54 7c1abc18 mfc71!afxMapHWND+0x10
0012fe58 00010b52
0012fe5c 0012fefc
0012fe60 7c20088d mfc71!_AfxForceVectorDelete+0x187b
0012fe64 ffffffff
0012fe68 00000000
0012fe6c 004a6310
0012fe70 00010b52
0012fe74 7c1ac1b6 mfc71!CWnd::WalkPreTranslateTree+0x10
0012fe78 00010b52

0:000> dds @esp @esp+50
0012fe7c 004a6310
0012fe80 1f2874f0
0012fe84 7c1b14b1 mfc71!AfxInternalPreTranslateMessage+0x3b
0012fe88 00010b58
0012fe8c 004a6310
0012fe90 004a6310
0012fe94 004a62e0
0012fe98 7c1b15e8 mfc71!CWinThread::PreTranslateMessage+0x9
0012fe9c 004a6310
0012fea0 7c1b14f7 mfc71!AfxPreTranslateMessage+0x15
0012fea4 004a6310
0012fea8 7c1b1632 mfc71!AfxInternalPumpMessage+0x2b
0012feac 004a6310
0012feb0 004a6310
0012feb4 004338f8 radar!theApp
0012feb8 7c1b1357 mfc71!CWinThread::Run+0x54
0012febc 004338f8 radar!theApp
0012fec0 004338f8 radar!theApp
0012fec4 0012ff08
0012fec8 ffffffff
0012fecc 00000001

Let’s check our assumption first. Last known call in this case is

7c1b14ac e8f5acffff call mfc71!CWnd::WalkPreTranslateTree (7c1ac1a6)

Thus, if I was right, the last instruction should be somewhere in the mfc71!CWnd::WalkPreTranslateTree function.

0:000> ln 0x7c1ac1c9
(7c1ac1a6) mfc71!CWnd::WalkPreTranslateTree+0x23 | (7c1ac1e8) mfc71!CWnd::SendChildNotifyLastMsg

Oh, yes! That is it!

But what the instruction is?

mfc71!CWnd::WalkPreTranslateTree+0x23
3134 7c1ac1c9 3b74240c cmp esi,dword ptr [esp+0Ch]
3134 7c1ac1cd 740d je mfc71!CWnd::WalkPreTranslateTree+0x36 (7c1ac1dc)

More than strange isn’t it? How can we corrupt the instruction pointer executing cmp instruction? How the hell is this possible?!!!

The only thing I knew for sure – this is hardware problem. This explained a lot. The only thing left to do is to find the culprit.

I placed the suspects in the following order (in order of increasing probability):

1.    RAM corruption
I considered it’s not very likely because in the case of memory cell corruption we would have problems in the arbitrary pieces of data and code, but not only for instruction pointers values.

2.    Hard drive corruption
I considered it’s not very likely for the same reasons as the previous issue. We would have failures either in the same place (in the case of executable module corruption) or failures due to different reasons (in the case of page file corruption). Besides, I never heard that the surface of the hard drive corrupted only for one bit ?

3.    Motherboard corruption
Well, can’t say anything about it. Perhaps there is a mechanism that could lead to such problems, but I know nothing about it.

4.    CPU
The most likely candidate in my opinion. In the case the error is somewhere in instruction pointer increasing mechanism or somehow related to thread context restoring it may well be.

Yes, I know. If you hear “somewhere” and “somehow” too often there is “something” wrong :)

Thus, I needed evidence. I decided to replace the hardware components one by one (fortunately we have absolutely identical computers on the test bench). Replacement of RAM didn’t lead to any result. As well as replacement of HD. Unfortunately we couldn’t replace motherboard and CPU separately (because of our industrial PCs architecture), so we replaced them together.

And from this moment we hadn’t received any crashes at all!!!

Turns out we spent a lot of time and efforts to find the bug that was not in our (or some one else) code but “stood on the table” all this time!