Saturday, June 26, 2010

The bug is on the table

A few months ago there was an interesting story in the company where I work. One guy from a neighboring department asked me to help him. He encountered one crash after another over the past few days. Unusual in these crashes was that they occurred in different processes and under different circumstances, but within one computer on the test bench (consisted of 5 machines) and with identical symptoms. In all cases it was an access violation caused by illegal instruction pointer.

Here is an example of one of the call stacks:

0:000> knL
 # ChildEBP RetAddr 
00 0012d7f8 7c90e9ab ntdll!KiFastSystemCallRet
01 0012d7fc 7c8094f2 ntdll!ZwWaitForMultipleObjects+0xc
02 0012d898 7c809c86 kernel32!WaitForMultipleObjectsEx+0x12c
03 0012d8b4 6945763c kernel32!WaitForMultipleObjects+0x18
04 0012e248 694582b1 faultrep!StartDWException+0x5df
05 0012f2bc 7c863059 faultrep!ReportFault+0x533
06 0012f530 7c36e289 kernel32!UnhandledExceptionFilter+0x4cf
07 0012f54c 0041b84f msvcr71!_XcptFilter+0x15f
08 0012f558 7c363943 radar!WinMainCRTStartup+0x1d7
09 0012f580 7c9037bf msvcr71!_except_handler3+0x61
0a 0012f5a4 7c90378b ntdll!ExecuteHandler2+0x26
0b 0012f654 7c90eafa ntdll!ExecuteHandler+0x24
0c 0012f654 fc910d93 ntdll!KiUserExceptionDispatcher+0xe
WARNING: Frame IP not in any known module. Following frames may be wrong.
0d 0012f950 1f4d5318 0xfc910d93
0e 0012fa10 7c363593 0x1f4d5318
0f 0012fa58 087e6827 msvcr71!free+0xc3
10 0012fa8c 087e6f41 NaviRadarRendererPlain!std::vector,std::allocator > >::_Insert_n+0x147
11 0012fafc 087e46f1 NaviRadarRendererPlain!TScanlines::calculate+0x5d1
12 0012fc2c 087e418e NaviRadarRendererPlain!TRadarToScreen_::RenderBlock+0x521
13 0012fc64 087e2378 NaviRadarRendererPlain!TRadarToScreen_::Render+0x16e
14 0012fc88 087c4cf0 NaviRadarRendererPlain!n_d3d_render::renderer_t::render+0x68
15 0012fce0 03d5c5e7 NaviRadarLayer!TRadarLayerImpl::Draw+0x360
16 0012fd1c 03d55c8e TotUser!TRadarManager_::Update+0x77
17 0012fd4c 087c821a TotUser!TCrtPanel_::InvalidateRadar+0xbe
18 0012fd54 10ac8959 NaviRadarLayer!TRadarLayerImpl::OnVideoBlockReceived+0x7a
19 0012fd7c 10ac8ab6 TkRadar20!IAdviseHostImpl::TypedForEach+0x49
1a 0012fd90 10ac8e88 TkRadar20!RadarConnector::VideoBlockReceived+0x26
1b 0012fda0 77d48709 TkRadar20!detail::WndProc+0x48
1c 0012fdcc 77d487eb user32!InternalCallWinProc+0x28
1d 0012fe34 77d489a5 user32!UserCallWinProcCheckWow+0x150
1e 0012fe94 77d4bccc user32!DispatchMessageWorker+0x306
1f 0012fea4 7c1b1645 user32!DispatchMessageA+0xf
20 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x3e
21 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54
22 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30
23 0012ff18 0041b7fd mfc71!AfxWinMain+0x68
24 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185
25 0012fff0 00000000 kernel32!BaseProcessStart+0x23



When I looked at the call stack for the first time I made several assumptions about IP corruption reason (those that I considered more likely):
1.    Failed IP address was sent to the call instruction.
2.    Wrong return address was restored by the ret instruction.

To conform or refute each of them I started from the stack investigation.

Content of the stack around the current stack pointer was the following:


0:000> .frame /c 0d
0d 0012f950 1f4d5318 0xfc910d93
eax=000000f2 ebx=00000000 ecx=0000acce edx=1f4b7330 esi=1f4b6ec0 edi=01f20000
eip=fc910d93 esp=0012f954 ebp=0012fa10 iopl=0         nv up ei ng nz na po cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010283
fc910d93 ??              ???

0:000> dds @esp-50 @esp-4
0012f904  00000000
0012f908  00000000
0012f90c  00000000
0012f910  00000000
0012f914  00000000
0012f918  00000000
0012f91c  00000000
0012f920  00000000
0012f924  00000000
0012f928  00000000
0012f92c  00000000
0012f930  00000000
0012f934  00000000
0012f938  00000000
0012f93c  00000000
0012f940  00000000
0012f944  00000000
0012f948  00000000
0012f94c  00000000
0012f950  00000000

0:000> dds @esp @esp+150
0012f954  1f4d5318
0012f958  1f4b6ec8
0012f95c  0000008e
0012f960  11200000
0012f964  7c397a63 msvcr71!_87except+0xc4
0012f968  1f4b6ec0
0012f96c  00000008
0012f970  00000258
0012f974  00000258
0012f978  7c397a97 msvcr71!_87except+0xf8
0012f97c  0000037f
0012f980  01f20178
0012f984  00000000
0012f988  00000000
0012f98c  00000000
0012f990  00000000
0012f994  00000000
0012f998  1f4d5318
0012f99c  00000000
0012f9a0  000000d4
0012f9a4  00000000
0012f9a8  00000000
0012f9ac  00000000
0012f9b0  00000470
0012f9b4  1f4d5ac8
0012f9b8  00000000
0012f9bc  1f4d5310
0012f9c0  00000000
0012f9c4  00000000
0012f9c8  00000000
0012f9cc  1f4d5310
0012f9d0  000006a0
0012f9d4  1f4d5318
0012f9d8  01f20000
0012f9dc  01f20178
0012f9e0  000006a0
0012f9e4  00000000
0012f9e8  00000000
0012f9ec  00000000
0012f9f0  0101ffff
0012f9f4  000000f2
0012f9f8  0012f954
0012f9fc  0012f578
0012fa00  0012fa48
0012fa04  7c90ee18 ntdll!_except_handler3
0012fa08  7c910570 ntdll!CheckHeapFillPattern+0x64
0012fa0c  ffffffff
0012fa10  0012fa58
0012fa14  7c363593 msvcr71!free+0xc3
0012fa18  01f20000
0012fa1c  00000000
0012fa20  1f4b6ec8
0012fa24  1f4d5318
0012fa28  0012fbf4
0012fa2c  0000008e
0012fa30  00000000
0012fa34  0012fa80
0012fa38  7c3638e2 msvcr71!_except_handler3
0012fa3c  7c39a3c0 msvcr71!`string'+0xc
0012fa40  0012fa24
0012fa44  0012f578
0012fa48  0012fa80
0012fa4c  7c3638e2 msvcr71!_except_handler3
0012fa50  7c39f150 msvcr71!`string'+0x24
0012fa54  ffffffff
0012fa58  0012fa8c
0012fa5c  087e6827 NaviRadarRendererPlain!std::vector,std::allocator > >::_Insert_n+0x147
0012fa60  1f4b6ec8
0012fa64  0012fbf4
0012fa68  00000384
0012fa6c  0000033b
0012fa70  0000033b
0012fa74  00000384
0012fa78  1f4d5318
0012fa7c  0012fa64
0012fa80  0012fc24
0012fa84  087e91b0 NaviRadarRendererPlain!_dllonexit+0x1d4
0012fa88  ffffffff
0012fa8c  0012fafc
0012fa90  087e6f41 NaviRadarRendererPlain!TScanlines::calculate+0x5d1
0012fa94  1f4b7330
0012fa98  00000698
0012fa9c  1f4d5780
0012faa0  02023f10
0012faa4  02022ff8

And there were few remarkable facts about it:

1.    The stack above the stack pointer (that is the stack which has just been released) was empty.
Actually it could be result of exception handling code execution. Anyway this is a sad fact because of information we could get from this part of the stack could be very useful.

2.    The last value on the stack is 0x1f4d5318
It’s seems unremarkable, but if we look at the call stack in the point of failure we’ll see that debugger consider this value as a return address from the last function call. It seems the debugger assumes that corrupted instruction pointer was sent to the call instruction. Then the preceding value on the stack is return address. But since this not the only option, I decided to check this assumption first.

Let’s look at the address 0x1f4d5318 more closely.

0:000> !address 1f4d5318
Usage:                 
Allocation Base:        1f200000
Base Address:           1f200000
End Address:            1fa00000
Region Size:            00800000
Type:                   00020000    MEM_PRIVATE
State:                  00001000    MEM_COMMIT
Protect:                00000004    PAGE_READWRITE

It’s not inside any module!

Hence the debugger’s assumption (as well as my first one) is probably not right. And this means that the frame number 0D does not exist.

3.    But the frame number 0C looks much more suitable
Thus the obvious next step I thought about is to figure out what function has created this frame. Knowing the frame return address we can find out a call instruction leading to the frame creation.

0:000> ub 7c363593 L1
msvcr71!free+0xbd [f:\vs70builds\6030\vc\crtbld\crt\src\free.c @ 101]:
7c36358d ff1580a0397c    call    dword ptr [msvcr71!_imp__HeapFree (7c39a080)]

Let’s look where the table value _imp__HeapFree points to.

0:000> ln poi(msvcr71!_imp__HeapFree)
(7c91043d)   ntdll!RtlFreeHeap   |  (7c91058d)   ntdll!RtlpQueryDepthSList
Exact matches:
    ntdll!RtlFreeHeap

Well, all I knew at that moment is that our failure was somehow related to the ntdll!RtlFreeHeap function and it wasn’t consequence of the call instruction.

Thus the only assumption remained is the assumption regarding a frame return address. There are several reasons why the return address can recover erroneous. Most common of them is different assumption of callee and caller about function calling convention or function signature (e.g. as a result of “dll hell”). But can we talk about inappropriate function calling convention or function signature regarding to ntdll.dll function? I really don’t think so!

Well, the situation became more and more intriguing :)  Unfortunately the evidences I had wasn’t enough to make it clearer. I definitely needed more information and it would be great if I could get the part of the stack we just released as well. For that purpose I decided to attach the debugger to a number of relevant processes. Once the debugger receives an access violation event we will have a good chance to get the stack untouched.

While I was waiting for a dump with untouched released stack I had examined several another dumps from the same computer.

Here is some of the call stacks:

0:033> knL
# ChildEBP RetAddr 
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 0d58fed8 0a37285e 0xfc36225d
01 0d58fee8 04249d3f 0xa37285e
02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf
03 0d58ff74 03a46463 AtomDB!TSubscriberNotification_::Thread+0x23f
04 0d58ff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23
05 0d58ffb4 7c80b50b msvcr71!_threadstartex+0x6f
06 0d58ffec 00000000 kernel32!BaseThreadStart+0x37

0:070> knL
 # ChildEBP RetAddr 
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 067cf2e4 01199be0 0xfc3c1e8d
01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150
02 067cf434 00bfec52       FBP_VideoTranseiver!n_rlt_video_source::TFBPVideoTransmitter::OnNewData+0xbe
03 067cf478 00bfe34b MFRLT_Scanner!TVideoHdrAccum::NotifySubscribers+0xf2
04 067cf4a0 00bfc5a7 MFRLT_Scanner!TVideoHdrAccum::OnNewData+0xdb
05 067cfd10 00d99753 MFRLT_Scanner!n_rlt_scanner::TMFRLT_Scanner::SetData+0xc7
06 067cfd78 00d99004 RIBSupportProxy!n_rlt_video_source::TRibSupportProxy_::SetData+0x93
07 067cfdb0 10040aed      RIBSupportProxy!n_rlt_video_source::TVideoSourceSeqServer_::OnCall+0x84
08 067cfdb8 10040cf5 ETL!TSeqServer_::OnCall+0x1d
09 067cff74 10046463 ETL!TSeqServer_::Thread+0xd5
0a 067cff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23
0b 067cffb4 7c80b50b msvcr71!_threadstartex+0x6f
0c 067cffec 00000000 kernel32!BaseThreadStart+0x37

Despite the fact that all call stacks seemed to be different I found something common in they failure course. Look at the instruction pointers which leaded to access violation:

0d 0012f950 1f4d5318 0xfc910d93
0e 0012fa10 7c363593 0x1f4d5318
0f 0012fa58 087e6827 msvcr71!free+0xc3

00 0d58fed8 0a37285e 0xfc36225d
01 0d58fee8 04249d3f 0xa37285e
02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf

00 067cf2e4 01199be0 0xfc3c1e8d
01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150

They all have the same value of the high byte. And needless to say that in 32th bit Windows OS there is no valid addresses with high bit set. But let’s look at those addresses without high bit:

0:000> ln 0x7c910d93
(7c91043d)   ntdll!RtlFreeHeap+0x3a7   |  (7c91058d)   ntdll!RtlpQueryDepthSList

0:001> ln 0x7c36225d
 (7c362247)   msvcr71!strncmp+0x16   |  (7c36227f)   msvcr71!__sse2_available_init

0:070> ln 0x7c3c1e8d
 (7c3c1e8d)   msvcp71!std::basic_string,std::allocator >::~basic_string,std::allocator >   |  (7c3c1e9b)   msvcp71!std::basic_string,std::allocator >::_Copy
Exact matches:
    msvcp71!std::basic_string,std::allocator >::~basic_string,std::allocator > (void)

Now let’s try to restore the call stack to the point of failure (last known call in black, restored instruction pointer in red):

7c910d93 3d00fe0000   cmp     eax,0FE00h (ntdll!RtlFreeHeap+0x3a7)
7c36358d ff1580a0397c call    dword ptr [msvcr71!_imp__HeapFree (7c39a080)] (msvcr71!free+0xc3)

7c362247 55              push    ebp (msvcr71!strncmp+0x16)
04249d39 ff152c852904    call    dword ptr [AtomDB!_imp__strncmp (0429852c)] (AtomDB!TPatternAtomFilter_::Filter+0xdf)

7c3c1e8d 6a00            push    0 (msvcp71!std::basic_string,std::allocator >::~basic_string,std::allocator > (void))
01199bda ff15d8401a01    call    dword ptr [FBP_VideoTranseiver!_imp_??1?$basic_stringDU?$char_traitsDstdV?$allocatorD (011a40d8)] (FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150)

Seems true, isn’t it? But to prove or disprove this conjecture I need to have untouched released stack anyway.

And here it is:

0:000> knL
# ChildEBP RetAddr 
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 0012fe78 004a6310 0xfc1ac1c9
01 0012fe94 7c1b15e8 0x4a6310
02 0012fe9c 7c1b14f7 mfc71!CWinThread::PreTranslateMessage+0x9
03 0012fea4 7c1b1632 mfc71!AfxPreTranslateMessage+0x15
04 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x2b
05 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54
06 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30
07 0012ff18 0041b7fd mfc71!AfxWinMain+0x68
08 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185
09 0012fff0 00000000 kernel32!BaseProcessStart+0x23

0:000> dds @esp-50 @esp-4
0012fe2c  00000004
0012fe30  004a6310
0012fe34  00010b52
0012fe38  004a7664
0012fe3c  0012fe5c
0012fe40  7c20080e mfc71!_AfxForceVectorDelete+0x17fc
0012fe44  ffffffff
0012fe48  0012fe68
0012fe4c  7c1d7572 mfc71!AfxGetModuleThreadState+0x16
0012fe50  7c1d73e9 mfc71!CThreadLocal::CreateObject
0012fe54  7c1abc18 mfc71!afxMapHWND+0x10
0012fe58  00010b52
0012fe5c  0012fefc
0012fe60  7c20088d mfc71!_AfxForceVectorDelete+0x187b
0012fe64  ffffffff
0012fe68  00000000
0012fe6c  004a6310
0012fe70  00010b52
0012fe74  7c1ac1b6 mfc71!CWnd::WalkPreTranslateTree+0x10
0012fe78  00010b52

0:000> dds @esp @esp+50
0012fe7c  004a6310
0012fe80  1f2874f0
0012fe84  7c1b14b1 mfc71!AfxInternalPreTranslateMessage+0x3b
0012fe88  00010b58
0012fe8c  004a6310
0012fe90  004a6310
0012fe94  004a62e0
0012fe98  7c1b15e8 mfc71!CWinThread::PreTranslateMessage+0x9
0012fe9c  004a6310
0012fea0  7c1b14f7 mfc71!AfxPreTranslateMessage+0x15
0012fea4  004a6310
0012fea8  7c1b1632 mfc71!AfxInternalPumpMessage+0x2b
0012feac  004a6310
0012feb0  004a6310
0012feb4  004338f8 radar!theApp
0012feb8  7c1b1357 mfc71!CWinThread::Run+0x54
0012febc  004338f8 radar!theApp
0012fec0  004338f8 radar!theApp
0012fec4  0012ff08
0012fec8  ffffffff
0012fecc  00000001

Let’s check our assumption first. Last known call in this case is

7c1b14ac e8f5acffff      call    mfc71!CWnd::WalkPreTranslateTree (7c1ac1a6)

Thus, if I was right, the last instruction should be somewhere in the mfc71!CWnd::WalkPreTranslateTree function.

0:000> ln 0x7c1ac1c9
 (7c1ac1a6)   mfc71!CWnd::WalkPreTranslateTree+0x23   |  (7c1ac1e8)   mfc71!CWnd::SendChildNotifyLastMsg

Oh, yes! That is it!

But what the instruction is?

mfc71!CWnd::WalkPreTranslateTree+0x23
 3134 7c1ac1c9 3b74240c        cmp     esi,dword ptr [esp+0Ch]
 3134 7c1ac1cd 740d            je      mfc71!CWnd::WalkPreTranslateTree+0x36 (7c1ac1dc)

More than strange isn’t it? How can we corrupt the instruction pointer executing cmp instruction? How the hell is this possible?!!!

The only thing I knew for sure – this is hardware problem. This explained a lot. The only thing left to do is to find the culprit.

I placed the suspects in the following order (in order of increasing probability):

1.    RAM corruption
I considered it’s not very likely because in the case of memory cell corruption we would have problems in the arbitrary pieces of data and code, but not only for instruction pointers values.

2.    Hard drive corruption
I considered it’s not very likely for the same reasons as the previous issue. We would have failures either in the same place (in the case of executable module corruption) or failures due to different reasons (in the case of page file corruption). Besides, I never heard that the surface of the hard drive corrupted only for one bit ?

3.    Motherboard corruption
Well, can’t say anything about it. Perhaps there is a mechanism that could lead to such problems, but I know nothing about it.

4.    CPU
The most likely candidate in my opinion. In the case the error is somewhere in instruction pointer increasing mechanism or somehow related to thread context restoring it may well be.

Yes, I know. If you hear “somewhere” and “somehow” too often there is “something” wrong :)

Thus, I needed evidence. I decided to replace the hardware components one by one (fortunately we have absolutely identical computers on the test bench). Replacement of RAM didn’t lead to any result. As well as replacement of HD. Unfortunately we couldn’t replace motherboard and CPU separately (because of our industrial PCs architecture), so we replaced them together.

And from this moment we hadn’t received any crashes at all!!!

Turns out we spent a lot of time and efforts to find the bug that was not in our (or some one else) code but “stood on the table” all this time!

No comments:

Post a Comment