Here is an example of one of the call stacks:
0:000> knL # ChildEBP RetAddr 00 0012d7f8 7c90e9ab ntdll!KiFastSystemCallRet 01 0012d7fc 7c8094f2 ntdll!ZwWaitForMultipleObjects+0xc 02 0012d898 7c809c86 kernel32!WaitForMultipleObjectsEx+0x12c 03 0012d8b4 6945763c kernel32!WaitForMultipleObjects+0x18 04 0012e248 694582b1 faultrep!StartDWException+0x5df 05 0012f2bc 7c863059 faultrep!ReportFault+0x533 06 0012f530 7c36e289 kernel32!UnhandledExceptionFilter+0x4cf 07 0012f54c 0041b84f msvcr71!_XcptFilter+0x15f 08 0012f558 7c363943 radar!WinMainCRTStartup+0x1d7 09 0012f580 7c9037bf msvcr71!_except_handler3+0x61 0a 0012f5a4 7c90378b ntdll!ExecuteHandler2+0x26 0b 0012f654 7c90eafa ntdll!ExecuteHandler+0x24 0c 0012f654 fc910d93 ntdll!KiUserExceptionDispatcher+0xe WARNING: Frame IP not in any known module. Following frames may be wrong. 0d 0012f950 1f4d5318 0xfc910d93 0e 0012fa10 7c363593 0x1f4d5318 0f 0012fa58 087e6827 msvcr71!free+0xc3 10 0012fa8c 087e6f41 NaviRadarRendererPlain!std::vector 11 0012fafc 087e46f1 NaviRadarRendererPlain!TScanlines::calculate+0x5d1 12 0012fc2c 087e418e NaviRadarRendererPlain!TRadarToScreen_ 13 0012fc64 087e2378 NaviRadarRendererPlain!TRadarToScreen_ 14 0012fc88 087c4cf0 NaviRadarRendererPlain!n_d3d_render::renderer_t::render+0x68 15 0012fce0 03d5c5e7 NaviRadarLayer!TRadarLayerImpl::Draw+0x360 16 0012fd1c 03d55c8e TotUser!TRadarManager_::Update+0x77 17 0012fd4c 087c821a TotUser!TCrtPanel_::InvalidateRadar+0xbe 18 0012fd54 10ac8959 NaviRadarLayer!TRadarLayerImpl::OnVideoBlockReceived+0x7a 19 0012fd7c 10ac8ab6 TkRadar20!IAdviseHostImpl 1a 0012fd90 10ac8e88 TkRadar20!RadarConnector::VideoBlockReceived+0x26 1b 0012fda0 77d48709 TkRadar20!detail::WndProc+0x48 1c 0012fdcc 77d487eb user32!InternalCallWinProc+0x28 1d 0012fe34 77d489a5 user32!UserCallWinProcCheckWow+0x150 1e 0012fe94 77d4bccc user32!DispatchMessageWorker+0x306 1f 0012fea4 7c1b1645 user32!DispatchMessageA+0xf 20 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x3e 21 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54 22 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30 23 0012ff18 0041b7fd mfc71!AfxWinMain+0x68 24 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185 25 0012fff0 00000000 kernel32!BaseProcessStart+0x23 |
When I looked at the call stack for the first time I made several assumptions about IP corruption reason (those that I considered more likely):
1. Failed IP address was sent to the call instruction.
2. Wrong return address was restored by the ret instruction.
To conform or refute each of them I started from the stack investigation.
Content of the stack around the current stack pointer was the following:
0:000> .frame /c 0d 0d 0012f950 1f4d5318 0xfc910d93 eax=000000f2 ebx=00000000 ecx=0000acce edx=1f4b7330 esi=1f4b6ec0 edi=01f20000 eip=fc910d93 esp=0012f954 ebp=0012fa10 iopl=0 nv up ei ng nz na po cy cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010283 fc910d93 ?? ??? 0:000> dds @esp-50 @esp-4 0012f904 00000000 0012f908 00000000 0012f90c 00000000 0012f910 00000000 0012f914 00000000 0012f918 00000000 0012f91c 00000000 0012f920 00000000 0012f924 00000000 0012f928 00000000 0012f92c 00000000 0012f930 00000000 0012f934 00000000 0012f938 00000000 0012f93c 00000000 0012f940 00000000 0012f944 00000000 0012f948 00000000 0012f94c 00000000 0012f950 00000000 0:000> dds @esp @esp+150 0012f954 1f4d5318 0012f958 1f4b6ec8 0012f95c 0000008e 0012f960 11200000 0012f964 7c397a63 msvcr71!_87except+0xc4 0012f968 1f4b6ec0 0012f96c 00000008 0012f970 00000258 0012f974 00000258 0012f978 7c397a97 msvcr71!_87except+0xf8 0012f97c 0000037f 0012f980 01f20178 0012f984 00000000 0012f988 00000000 0012f98c 00000000 0012f990 00000000 0012f994 00000000 0012f998 1f4d5318 0012f99c 00000000 0012f9a0 000000d4 0012f9a4 00000000 0012f9a8 00000000 0012f9ac 00000000 0012f9b0 00000470 0012f9b4 1f4d5ac8 0012f9b8 00000000 0012f9bc 1f4d5310 0012f9c0 00000000 0012f9c4 00000000 0012f9c8 00000000 0012f9cc 1f4d5310 0012f9d0 000006a0 0012f9d4 1f4d5318 0012f9d8 01f20000 0012f9dc 01f20178 0012f9e0 000006a0 0012f9e4 00000000 0012f9e8 00000000 0012f9ec 00000000 0012f9f0 0101ffff 0012f9f4 000000f2 0012f9f8 0012f954 0012f9fc 0012f578 0012fa00 0012fa48 0012fa04 7c90ee18 ntdll!_except_handler3 0012fa08 7c910570 ntdll!CheckHeapFillPattern+0x64 0012fa0c ffffffff 0012fa10 0012fa58 0012fa14 7c363593 msvcr71!free+0xc3 0012fa18 01f20000 0012fa1c 00000000 0012fa20 1f4b6ec8 0012fa24 1f4d5318 0012fa28 0012fbf4 0012fa2c 0000008e 0012fa30 00000000 0012fa34 0012fa80 0012fa38 7c3638e2 msvcr71!_except_handler3 0012fa3c 7c39a3c0 msvcr71!`string'+0xc 0012fa40 0012fa24 0012fa44 0012f578 0012fa48 0012fa80 0012fa4c 7c3638e2 msvcr71!_except_handler3 0012fa50 7c39f150 msvcr71!`string'+0x24 0012fa54 ffffffff 0012fa58 0012fa8c 0012fa5c 087e6827 NaviRadarRendererPlain!std::vector 0012fa60 1f4b6ec8 0012fa64 0012fbf4 0012fa68 00000384 0012fa6c 0000033b 0012fa70 0000033b 0012fa74 00000384 0012fa78 1f4d5318 0012fa7c 0012fa64 0012fa80 0012fc24 0012fa84 087e91b0 NaviRadarRendererPlain!_dllonexit+0x1d4 0012fa88 ffffffff 0012fa8c 0012fafc 0012fa90 087e6f41 NaviRadarRendererPlain!TScanlines::calculate+0x5d1 0012fa94 1f4b7330 0012fa98 00000698 0012fa9c 1f4d5780 0012faa0 02023f10 0012faa4 02022ff8 |
And there were few remarkable facts about it:
1. The stack above the stack pointer (that is the stack which has just been released) was empty.
Actually it could be result of exception handling code execution. Anyway this is a sad fact because of information we could get from this part of the stack could be very useful.
2. The last value on the stack is 0x1f4d5318
It’s seems unremarkable, but if we look at the call stack in the point of failure we’ll see that debugger consider this value as a return address from the last function call. It seems the debugger assumes that corrupted instruction pointer was sent to the call instruction. Then the preceding value on the stack is return address. But since this not the only option, I decided to check this assumption first.
Let’s look at the address 0x1f4d5318 more closely.
0:000> !address 1f4d5318 Usage: Allocation Base: 1f200000 Base Address: 1f200000 End Address: 1fa00000 Region Size: 00800000 Type: 00020000 MEM_PRIVATE State: 00001000 MEM_COMMIT Protect: 00000004 PAGE_READWRITE |
It’s not inside any module!
Hence the debugger’s assumption (as well as my first one) is probably not right. And this means that the frame number 0D does not exist.
3. But the frame number 0C looks much more suitable
Thus the obvious next step I thought about is to figure out what function has created this frame. Knowing the frame return address we can find out a call instruction leading to the frame creation.
0:000> ub 7c363593 L1 msvcr71!free+0xbd [f:\vs70builds\6030\vc\crtbld\crt\src\free.c @ 101]: 7c36358d ff1580a0397c call dword ptr [msvcr71!_imp__HeapFree (7c39a080)] |
Let’s look where the table value _imp__HeapFree points to.
0:000> ln poi(msvcr71!_imp__HeapFree) (7c91043d) ntdll!RtlFreeHeap | (7c91058d) ntdll!RtlpQueryDepthSList Exact matches: ntdll!RtlFreeHeap |
Well, all I knew at that moment is that our failure was somehow related to the ntdll!RtlFreeHeap function and it wasn’t consequence of the call instruction.
Thus the only assumption remained is the assumption regarding a frame return address. There are several reasons why the return address can recover erroneous. Most common of them is different assumption of callee and caller about function calling convention or function signature (e.g. as a result of “dll hell”). But can we talk about inappropriate function calling convention or function signature regarding to ntdll.dll function? I really don’t think so!
Well, the situation became more and more intriguing :) Unfortunately the evidences I had wasn’t enough to make it clearer. I definitely needed more information and it would be great if I could get the part of the stack we just released as well. For that purpose I decided to attach the debugger to a number of relevant processes. Once the debugger receives an access violation event we will have a good chance to get the stack untouched.
While I was waiting for a dump with untouched released stack I had examined several another dumps from the same computer.
Here is some of the call stacks:
0:033> knL # ChildEBP RetAddr WARNING: Frame IP not in any known module. Following frames may be wrong. 00 0d58fed8 0a37285e 0xfc36225d 01 0d58fee8 04249d3f 0xa37285e 02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf 03 0d58ff74 03a46463 AtomDB!TSubscriberNotification_::Thread+0x23f 04 0d58ff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23 05 0d58ffb4 7c80b50b msvcr71!_threadstartex+0x6f 06 0d58ffec 00000000 kernel32!BaseThreadStart+0x37 |
0:070> knL # ChildEBP RetAddr WARNING: Frame IP not in any known module. Following frames may be wrong. 00 067cf2e4 01199be0 0xfc3c1e8d 01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150 02 067cf434 00bfec52 FBP_VideoTranseiver!n_rlt_video_source::TFBPVideoTransmitter::OnNewData+0xbe 03 067cf478 00bfe34b MFRLT_Scanner!TVideoHdrAccum::NotifySubscribers+0xf2 04 067cf4a0 00bfc5a7 MFRLT_Scanner!TVideoHdrAccum::OnNewData+0xdb 05 067cfd10 00d99753 MFRLT_Scanner!n_rlt_scanner::TMFRLT_Scanner::SetData+0xc7 06 067cfd78 00d99004 RIBSupportProxy!n_rlt_video_source::TRibSupportProxy_::SetData+0x93 07 067cfdb0 10040aed RIBSupportProxy!n_rlt_video_source::TVideoSourceSeqServer_::OnCall+0x84 08 067cfdb8 10040cf5 ETL!TSeqServer_::OnCall+0x1d 09 067cff74 10046463 ETL!TSeqServer_::Thread+0xd5 0a 067cff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23 0b 067cffb4 7c80b50b msvcr71!_threadstartex+0x6f 0c 067cffec 00000000 kernel32!BaseThreadStart+0x37 |
Despite the fact that all call stacks seemed to be different I found something common in they failure course. Look at the instruction pointers which leaded to access violation:
0d 0012f950 1f4d5318 0xfc910d93 0e 0012fa10 7c363593 0x1f4d5318 0f 0012fa58 087e6827 msvcr71!free+0xc3 |
00 0d58fed8 0a37285e 0xfc36225d 01 0d58fee8 04249d3f 0xa37285e 02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf |
00 067cf2e4 01199be0 0xfc3c1e8d 01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150 |
They all have the same value of the high byte. And needless to say that in 32th bit Windows OS there is no valid addresses with high bit set. But let’s look at those addresses without high bit:
0:000> ln 0x7c910d93 (7c91043d) ntdll!RtlFreeHeap+0x3a7 | (7c91058d) ntdll!RtlpQueryDepthSList |
0:001> ln 0x7c36225d (7c362247) msvcr71!strncmp+0x16 | (7c36227f) msvcr71!__sse2_available_init |
0:070> ln 0x7c3c1e8d (7c3c1e8d) msvcp71!std::basic_string Exact matches: msvcp71!std::basic_string |
Now let’s try to restore the call stack to the point of failure (last known call in black, restored instruction pointer in red):
7c910d93 3d00fe0000 cmp eax,0FE00h (ntdll!RtlFreeHeap+0x3a7) 7c36358d ff1580a0397c call dword ptr [msvcr71!_imp__HeapFree (7c39a080)] (msvcr71!free+0xc3) |
7c362247 55 push ebp (msvcr71!strncmp+0x16) 04249d39 ff152c852904 call dword ptr [AtomDB!_imp__strncmp (0429852c)] (AtomDB!TPatternAtomFilter_::Filter+0xdf) |
7c3c1e8d 6a00 push 0 (msvcp71!std::basic_string 01199bda ff15d8401a01 call dword ptr [FBP_VideoTranseiver!_imp_??1?$basic_stringDU?$char_traitsDstdV?$allocatorD (011a40d8)] (FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150) |
Seems true, isn’t it? But to prove or disprove this conjecture I need to have untouched released stack anyway.
And here it is:
0:000> knL # ChildEBP RetAddr WARNING: Frame IP not in any known module. Following frames may be wrong. 00 0012fe78 004a6310 0xfc1ac1c9 01 0012fe94 7c1b15e8 0x4a6310 02 0012fe9c 7c1b14f7 mfc71!CWinThread::PreTranslateMessage+0x9 03 0012fea4 7c1b1632 mfc71!AfxPreTranslateMessage+0x15 04 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x2b 05 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54 06 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30 07 0012ff18 0041b7fd mfc71!AfxWinMain+0x68 08 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185 09 0012fff0 00000000 kernel32!BaseProcessStart+0x23 |
0:000> dds @esp-50 @esp-4 0012fe2c 00000004 0012fe30 004a6310 0012fe34 00010b52 0012fe38 004a7664 0012fe3c 0012fe5c 0012fe40 7c20080e mfc71!_AfxForceVectorDelete+0x17fc 0012fe44 ffffffff 0012fe48 0012fe68 0012fe4c 7c1d7572 mfc71!AfxGetModuleThreadState+0x16 0012fe50 7c1d73e9 mfc71!CThreadLocal 0012fe54 7c1abc18 mfc71!afxMapHWND+0x10 0012fe58 00010b52 0012fe5c 0012fefc 0012fe60 7c20088d mfc71!_AfxForceVectorDelete+0x187b 0012fe64 ffffffff 0012fe68 00000000 0012fe6c 004a6310 0012fe70 00010b52 0012fe74 7c1ac1b6 mfc71!CWnd::WalkPreTranslateTree+0x10 0012fe78 00010b52 |
0:000> dds @esp @esp+50 0012fe7c 004a6310 0012fe80 1f2874f0 0012fe84 7c1b14b1 mfc71!AfxInternalPreTranslateMessage+0x3b 0012fe88 00010b58 0012fe8c 004a6310 0012fe90 004a6310 0012fe94 004a62e0 0012fe98 7c1b15e8 mfc71!CWinThread::PreTranslateMessage+0x9 0012fe9c 004a6310 0012fea0 7c1b14f7 mfc71!AfxPreTranslateMessage+0x15 0012fea4 004a6310 0012fea8 7c1b1632 mfc71!AfxInternalPumpMessage+0x2b 0012feac 004a6310 0012feb0 004a6310 0012feb4 004338f8 radar!theApp 0012feb8 7c1b1357 mfc71!CWinThread::Run+0x54 0012febc 004338f8 radar!theApp 0012fec0 004338f8 radar!theApp 0012fec4 0012ff08 0012fec8 ffffffff 0012fecc 00000001 |
Let’s check our assumption first. Last known call in this case is
7c1b14ac e8f5acffff call mfc71!CWnd::WalkPreTranslateTree (7c1ac1a6) |
Thus, if I was right, the last instruction should be somewhere in the mfc71!CWnd::WalkPreTranslateTree function.
0:000> ln 0x7c1ac1c9 (7c1ac1a6) mfc71!CWnd::WalkPreTranslateTree+0x23 | (7c1ac1e8) mfc71!CWnd::SendChildNotifyLastMsg |
Oh, yes! That is it!
But what the instruction is?
mfc71!CWnd::WalkPreTranslateTree+0x23 3134 7c1ac1c9 3b74240c cmp esi,dword ptr [esp+0Ch] 3134 7c1ac1cd 740d je mfc71!CWnd::WalkPreTranslateTree+0x36 (7c1ac1dc) |
More than strange isn’t it? How can we corrupt the instruction pointer executing cmp instruction? How the hell is this possible?!!!
The only thing I knew for sure – this is hardware problem. This explained a lot. The only thing left to do is to find the culprit.
I placed the suspects in the following order (in order of increasing probability):
1. RAM corruption
I considered it’s not very likely because in the case of memory cell corruption we would have problems in the arbitrary pieces of data and code, but not only for instruction pointers values.
2. Hard drive corruption
I considered it’s not very likely for the same reasons as the previous issue. We would have failures either in the same place (in the case of executable module corruption) or failures due to different reasons (in the case of page file corruption). Besides, I never heard that the surface of the hard drive corrupted only for one bit ?
3. Motherboard corruption
Well, can’t say anything about it. Perhaps there is a mechanism that could lead to such problems, but I know nothing about it.
4. CPU
The most likely candidate in my opinion. In the case the error is somewhere in instruction pointer increasing mechanism or somehow related to thread context restoring it may well be.
Yes, I know. If you hear “somewhere” and “somehow” too often there is “something” wrong :)
Thus, I needed evidence. I decided to replace the hardware components one by one (fortunately we have absolutely identical computers on the test bench). Replacement of RAM didn’t lead to any result. As well as replacement of HD. Unfortunately we couldn’t replace motherboard and CPU separately (because of our industrial PCs architecture), so we replaced them together.
And from this moment we hadn’t received any crashes at all!!!
Turns out we spent a lot of time and efforts to find the bug that was not in our (or some one else) code but “stood on the table” all this time!
No comments:
Post a Comment