tag:blogger.com,1999:blog-64132309754390512122024-03-13T13:40:04.527+03:00Diary of war with bugsThis blog is about debugging applications on Windows platform. Almost every day I have deal with tasks related to debugging. Some of them are challenge, some - cognitive, some - just funny. In this blog I will write about most challenge, cognitive or funny of them. I hope you enjoy!Unknownnoreply@blogger.comBlogger12125tag:blogger.com,1999:blog-6413230975439051212.post-55669626521371429952010-12-23T21:09:00.001+03:002010-12-23T21:11:18.142+03:00Invalid Bus Driver<div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/_Yv7YTiqUCn4/TROCvIBMxtI/AAAAAAAAAA4/BV3H6h-klZ0/s1600/busdriver.jpeg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="150" src="http://2.bp.blogspot.com/_Yv7YTiqUCn4/TROCvIBMxtI/AAAAAAAAAA4/BV3H6h-klZ0/s200/busdriver.jpeg" width="200" /></a></div><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Several weeks ago we encountered with very interesting crash of one of our product’s processes. After analyzing the dump we found that exceptions were excited inside two threads simultaneously.<br />
<br />
<br />
The first one roused up the debugger:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0355f0f0 7c3627e4 ntdll!RtlAllocateHeap+0x655<br />
<span style="color: red;">cmp edi,dword ptr [eax+4] ds:0023:8851e6cc=????????</span><br />
0355f130 7c36280c msvcr71!_heap_alloc+0xe0<br />
0355f138 7c362829 msvcr71!_nh_malloc+0x10<br />
0355f144 7c3eb633 msvcr71!malloc+0xf<br />
0355f154 7c3c1f0e msvcp71!operator new+0x21<br />
0355f9cc 7c3c4f9e msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::_Copy+0x73<br />
0355f9e0 7c3c55df msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::_Grow+0x22<br />
0355f9fc 7c3c6752 msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::assign+0x4e<br />
0355fa10 00595239 msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::basic_string<char,std::char_traits><char>,std::allocator<char> >+0x20<br />
0355fc4c 00595387 SCServer_dll71!_ConvertBinsToChars+0xb9<br />
0355fdac 007335ea SCServer_dll71!TReadResourceHandler::HandleEvent+0x127<br />
0355ff74 10046453 SCLib71!TReactor::Thread+0xda<br />
0355ff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23<br />
0355ffb4 7c80b50b msvcr71!_threadstartex+0x6f<br />
0355ffec 00000000 kernel32!BaseThreadStart+0x37</char></char></char,std::char_traits></char></char></char,std::char_traits></char></char></char,std::char_traits></char></char></char,std::char_traits></char></char></char,std::char_traits></td></tr>
</tbody></table><br />
<br />
And the second one was pending:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0012e45c 7c90e9ab ntdll!KiFastSystemCallRet<br />
0012e460 7c8633d5 ntdll!ZwWaitForMultipleObjects+0xc<br />
0012e7a0 7c36e289 kernel32!UnhandledExceptionFilter+0x82d<br />
0012e7bc 0040c860 msvcr71!_XcptFilter+0x15f<br />
0012e7c8 7c363943 SCServer71!WinMainCRTStartup+0x1d7<br />
0012e7f0 7c9037bf msvcr71!_except_handler3+0x61<br />
0012e814 7c90378b ntdll!ExecuteHandler2+0x26<br />
0012e8c4 7c90eafa ntdll!ExecuteHandler+0x24<br />
0012e8c4 00409c1c ntdll!KiUserExceptionDispatcher+0xe<br />
0012ebc4 7c1adc5b SCServer71!CMonitoringDlg::OnTimer+0x1c<br />
<span style="color: red;">call dword ptr [edx+8] ds:0023:00000008=????????</span> <br />
0012ec54 7c1a9f01 mfc71!CWnd::OnWndMsg+0x46b<br />
0012ec74 00422d16 mfc71!CWnd::WindowProc+0x22<br />
[…]<br />
0012ef74 77d487eb user32!InternalCallWinProc+0x28<br />
0012efdc 77d489a5 user32!UserCallWinProcCheckWow+0x150<br />
0012f03c 77d4bccc user32!DispatchMessageWorker+0x306<br />
0012f04c 7c1b1645 user32!DispatchMessageA+0xf<br />
0012f05c 7c1ab833 mfc71!AfxInternalPumpMessage+0x3e<br />
0012f080 7c1aeeed mfc71!CWnd::RunModalLoop+0xca<br />
0012f0bc 00424726 mfc71!CDialog::DoModal+0xf3<br />
0012f0f0 0040142b NSGuiCtl10!CNSGDialog::DoModal+0xc6<br />
0012ff08 7c1ae5d0 SCServer71!CScServerApp::InitInstance+0x9b<br />
0012ff18 0040c80e mfc71!AfxWinMain+0x47<br />
0012ffc0 7c816d4f SCServer71!WinMainCRTStartup+0x185<br />
0012fff0 00000000 kernel32!BaseProcessStart+0x23</td></tr>
</tbody></table><br />
<br />
<a name='more'></a><br />
<br />
In both cases the heap was corrupted. Analyzing content of the corrupted memory blocks we found that in second case the block was filled with 0, and in first case the picture was even more interesting - we found the string not far from corrupted address inside the heap structure.<br />
<br />
It looked like the following:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0:000> du 033beea0<br />
033beea0 "\Device\HarddiskVolume1\Program " <br />
033beee0 "Files\VirCOM\VSPort.exe"</td></tr>
</tbody></table><br />
<br />
This string could hardly be a part of our process. Obviously, it was the path to the VSPort.exe image in terms of kernel devices. Similarly it’s obvious that user mode application has no reason to generate a path, which can be used to refer to device in the device manager. Moreover, the path itself was well known. It was the path to the COM-port emulator application that was supplied by the vendor of the computer, where the crashes had occurred.<br />
<br />
At the same time we found out that the size of the data we received from the serial port through the GetOverlappedResult function was much greater than the size of the buffer we sent to the ReadFile function.<br />
<br />
All of these facts were clear pointed to the COM-port emulator driver. The only thing we had to do to prove this suggestion was caching the buffer overrun. <br />
<br />
First thing we did was setting the user mode processor breakpoint on data access. We set it just after the last buffer byte… and caught nothing. The only scenario in which this could happen was overrunning the buffer from the kernel mode. We increased the buffer size and set the guard just behind the buffer to ensure that our suggestion was right and the buffer had been overran. After receiving wrong data size from GetOverlappedResult we checked the guard and found it completely destroyed.<br />
<br />
In fact, this information was enough to realize that the source of the problem is the COM-port emulator driver (along with the VSPort.exe utility). We just changed the emulator to another one and made sure that the problem had gone. But I was wonder how could it happen that the driver destroys the data in our user memory space.<br />
<br />
Let’s remember the relationships between user application, I/O manager and device driver.<br />
<br />
When user application calls the ReadFile function, the I/O manager (which handles this request finally) does the following:<br />
<ol><li>Creates the IRP (I/O request packet)</li>
<li>Creates the buffer from the kernel memory space and attaches it to the IRP</li>
<li>Finds the read routine inside a device driver object and passes the pointer to the IRP to the read routine </li>
</ol><br />
When device driver processed the request in its turn it’s:<br />
<ol><li> Copies data to the buffer created by the I/O manager</li>
<li>Calls the IoCompleteRequest routine, returning the IRP to the I/O manager</li>
</ol><br />
After the I/O manager receives the IRP back it has to switch a process context to the context of the process which requested read operation (we are interesting in asynchronous operations only, since our code requests asynchronous operation). It accomplishes this by sending APC to the thread, which requested for I/O. When the APC routine executes, it checks for the return code (returned by the driver) and if the code doesn’t indicate an error it copies data from kernel buffer to user buffer and sets I/O event.<br />
<br />
That is, according to our suggestion, the I/O manager’s APC routine receives wrong data size from the driver and overruns our buffer. This time I set the data breakpoint just behind the buffer in the kernel mode. <br />
<br />
It’s fired on the following instruction:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>f677991c 804febb5 820de048 f6779968 f677995c nt!IopCompleteRequest+0x92 <br />
f677996c 80502b35 00000000 00000000 00000000 nt!KiDeliverApc+0xb3 </td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>804f48d0 8b4b1c mov ecx,dword ptr [ebx+1Ch]<br />
804f48d3 8b730c mov esi,dword ptr [ebx+0Ch]<br />
804f48d6 8b7b3c mov edi,dword ptr [ebx+3Ch]<br />
804f48d9 8bc1 mov eax,ecx<br />
804f48db c1e902 shr ecx,2<br />
<span style="color: red;">804f48de f3a5 rep movs dword ptr es:[edi],dword ptr [esi]</span> </td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0: kd> ? @eax<br />
Evaluate expression: 10051637 = 00996035 </td></tr>
</tbody></table><br />
<br />
I found that the value of eax register is very strange, considering the fact that the buffer size which we sent to the I/O manager was 4000h (that is 16K) bytes long.<br />
<br />
Let’s check the data inside the IRP:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0: kd> dt @ebx _IRP<br />
ntdll!_IRP<br />
+0x000 Type : 0n6<br />
+0x002 Size : 0x190<br />
+0x004 MdlAddress : (null) <br />
+0x008 Flags : 0x970<br />
+0x00c AssociatedIrp : __unnamed<br />
+0x010 ThreadListEntry : _LIST_ENTRY [ 0x82303230 - 0x81522e80 ] <br />
+0x018 IoStatus : _IO_STATUS_BLOCK<br />
+0x020 RequestorMode : 1 ''<br />
+0x021 PendingReturned : 0x1 ''<br />
+0x022 StackCount : 2 ''<br />
+0x023 CurrentLocation : 4 ''<br />
+0x024 Cancel : 0 ''<br />
+0x025 CancelIrql : 0 ''<br />
+0x026 ApcEnvironment : 0 ''<br />
+0x027 AllocationFlags : 0xc ''<br />
+0x028 UserIosb : 0x00d291a8 _IO_STATUS_BLOCK<br />
+0x02c UserEvent : 0x81ec5328 _KEVENT<br />
+0x030 Overlay : __unnamed<br />
+0x038 CancelRoutine : (null) <br />
+0x03c UserBuffer : 0x02b7c4e0 Void<br />
+0x040 Tail : __unnamed<br />
<br />
0: kd> dt @ebx+18 _IO_STATUS_BLOCK<br />
ntdll!_IO_STATUS_BLOCK<br />
<span style="color: red;">+0x000 Status : 0n258</span><br />
+0x000 Pointer : 0x00000102 Void<br />
<span style="color: red;">+0x004 Information : 0x996035</span></td></tr>
</tbody></table><br />
<br />
There are two things to note:<br />
<ol><li>The status code is 102h (which is mean <span style="font-family: "Courier New",Courier,monospace;">IO_TIMEOUT</span>)</li>
<li>The data size of 996035h bytes can’t be true</li>
</ol><br />
So, the task was to find from where this values came from. And it was the time to debug the driver. I had no symbols for it, so it promised to be a funny task.<br />
<br />
One way to getting to the place where the data size is calculated is to find the IRPs which are not processed yet (!irpfind) and set the data breakpoint to the field where the data size should be placed.<br />
<br />
Here we are:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>F8ac5fa0 8050165f vserial+0x8bdf<br />
f8ac5fcc 80544e5f nt!KiTimerExpiration+0xb1<br />
f8ac5ff4 805449cb nt!KiRetireDpcList+0x61<br />
f8ac5ff8 f5b508e8 nt!KiDispatchInterrupt+0x2b </td></tr>
</tbody></table><br />
<br />
The data size is calculates in the following instruction:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>f6301bcf 2b90c0000000 sub edx,dword ptr [eax+0C0h] <br />
f6301bd5 8bb0bc000000 mov esi,dword ptr [eax+0BCh]<br />
<span style="color: red;">f6301bdb 8d5432ff lea edx,[edx+esi-1]</span><br />
f6301bdf 89511c mov dword ptr [ecx+1Ch],edx</td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0: kd> ? @edx<br />
Evaluate expression: 2125373441 = 7eaea001<br />
0: kd> ? @esi<br />
Evaluate expression: -2125373440 = 81516000 </td></tr>
</tbody></table><br />
<br />
The <span style="font-family: "Courier New",Courier,monospace;">esi </span>and <span style="font-family: "Courier New",Courier,monospace;">edx </span>registers values should contain the carriage address and inverted pointer to the IRP buffer respectively (since edx register in the first instruction contains the size of the IRP buffer (4000h bytes)).<br />
<br />
Probably, the source of the problem could be related with these addresses. To check it out, I added the trace print to this instruction which prints the values of the data carriage pointer, base address of the buffer and resulting size of data inside the buffer, and stops the execution if the data size exceeds the user buffer size (that is 16K).<br />
<br />
The breakpoint fired on the following:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>[…]<br />
[811f5000,811f5035] - 35<br />
[8182b000,8182b000] - 0<br />
[822ea000,822ea000] - 0<br />
[811ef000,811ef035] - 35<br />
[8182b000,8182b000] - 0<br />
[822ea000,822ea000] - 0<br />
[811f5000,811f5035] - 35<br />
[8182b000,8182b000] - 0<br />
[822ea000,822ea000] - 0<br />
[811ef000,811ef035] - 35<br />
[8182b000,8182b000] - 0<br />
[822ea000,822ea000] - 0<br />
[811f5000,811f5035] - 35<br />
[8182b000,8182b000] - 0<br />
[822ea000,822ea000] - 0<br />
[811ef000,811ef035] - 35<br />
[8182b000,8182b000] - 0<br />
[822ea000,822ea000] - 0<br />
[811f5000,<span style="color: red;">820b7035</span>] - <span style="color: red;">ec2035</span> </td></tr>
</tbody></table><br />
<br />
Now I had to find the source of the pointers values and the reason why the carriage pointer doesn’t correspond to its buffer base address. But before we continue, let’s get some more information about the driver and its structure, since it will help us in the future.<br />
<br />
<br />
Knowing the IRP we can get the name of device this IRP was sent to:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0: kd> !irp 820de048-40<br />
Irp is active with 2 stacks 4 is current (= 00000000)<br />
No Mdl: System buffer=81510000: Thread 82303054: Irp is completed. Pending has been returned<br />
cmd flg cl Device File Completion-Context<br />
[ 0, 0] 0 0 00000000 00000000 00000000-00000000 <br />
<br />
Args: 00000000 00000000 00000000 00000000<br />
[ 3, 0] 0 0 82211950 00000000 00000000-00000000 <br />
*** ERROR: Module load completed but symbols could not be loaded for vserial.sys<br />
<div style="color: red;">\Driver\vserial</div>Args: 00000000 00000000 00000000 00000000<br />
<br />
0: kd> !devobj 82211950<br />
Device object (82211950) is for:<br />
VSerial0 \Driver\vserial DriverObject 82248970<br />
Current Irp 00000000 RefCount 1 Type 0000001b Flags 0000204c<br />
Dacl e1401cac DevExt 82211a08 DevObjExt 82211fd0 <br />
ExtensionFlags (0000000000) <br />
AttachedTo (Lower) 81eb1760*** ERROR: Module load completed but symbols could not be loaded for vsb.sys<br />
\Driver\vsbus<br />
Device queue is not busy.</td></tr>
</tbody></table><br />
<br />
Knowing the devise name we can get the driver object. More precise, we are interesting in its dispatch routines table:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>1: kd> !drvobj vserial 7<br />
Driver object (820c27e0) is for:<br />
\Driver\vserial<br />
Driver Extension List: (id , addr)<br />
<br />
Device Object list:<br />
820b5950 821ed040 82182950 8218f880<br />
821c3950 821c8628 <br />
<br />
DriverEntry: f5e8e442 vserial<br />
DriverStartIo: 00000000 <br />
DriverUnload: f5e87270 vserial<br />
AddDevice: f5e87cf0 vserial<br />
<br />
Dispatch routines:<br />
<span style="color: red;">[00] IRP_MJ_CREATE f5e87680 vserial+0x3680</span><br />
[01] IRP_MJ_CREATE_NAMED_PIPE 804f4282 nt!IopInvalidDeviceRequest <br />
[02] IRP_MJ_CLOSE f5e8c740 vserial+0x8740<br />
<span style="color: red;">[03] IRP_MJ_READ f5e8d420 vserial+0x9420</span><br />
<span style="color: red;">[04] IRP_MJ_WRITE f5e8e310 vserial+0xa310</span><br />
[05] IRP_MJ_QUERY_INFORMATION f5e89a40 vserial+0x5a40<br />
[06] IRP_MJ_SET_INFORMATION f5e89b00 vserial+0x5b00<br />
[07] IRP_MJ_QUERY_EA 804f4282 nt!IopInvalidDeviceRequest<br />
[…]</td></tr>
</tbody></table><br />
Now, when we know the key driver routines addresses, we are ready to further investigation.<br />
<br />
<br />
Let’s return to the data size calculation and look closely to the marked data fields:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>f6301bcf 2b90c0000000 sub edx,dword ptr [<span style="color: red;">eax+0C0h</span>]<br />
f6301bd5 8bb0bc000000 mov esi,dword ptr [<span style="color: red;">eax+0BCh</span>] <br />
f6301bdb 8d5432ff lea edx,[edx+esi-1]<br />
f6301bdf 89511c mov dword ptr [ecx+1Ch],edx</td></tr>
</tbody></table><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">eax </span>register contains the address of the device extension, created for each instance of the serial port device (let’s call this address <span style="font-family: "Courier New",Courier,monospace;">DevEx </span>for short):<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>nt!_DEVICE_OBJECT<br />
+0x000 Type : 0n3<br />
+0x002 Size : 0x680<br />
+0x004 ReferenceCount : 0n1<br />
+0x008 DriverObject : 0x82144f38 _DRIVER_OBJECT<br />
+0x00c NextDevice : 0x8218f490 _DEVICE_OBJECT<br />
+0x010 AttachedDevice : (null) <br />
+0x014 CurrentIrp : (null) <br />
+0x018 Timer : (null) <br />
+0x01c Flags : 0x204c<br />
+0x020 Characteristics : 0x100<br />
+0x024 Vpb : (null) <br />
<span style="color: red;">+0x028 DeviceExtension : 0x82028a08 Void</span><br />
+0x02c DeviceType : 0x1b<br />
+0x030 StackSize : 2 ''<br />
+0x034 Queue : __unnamed<br />
+0x05c AlignmentRequirement : 0<br />
+0x060 DeviceQueue : _KDEVICE_QUEUE<br />
+0x074 Dpc : _KDPC<br />
+0x094 ActiveThreadCount : 0<br />
+0x098 SecurityDescriptor : 0xe13f8168 Void<br />
+0x09c DeviceLock : _KEVENT<br />
+0x0ac SectorSize : 0<br />
+0x0ae Spare1 : 0<br />
+0x0b0 DeviceObjectExtension : 0x82028fd0 _DEVOBJ_EXTENSION <br />
+0x0b4 Reserved : (null)</td></tr>
</tbody></table><br />
<br />
Setting the breakpoints to this memory addresses we can easily find out instructions, which fills in the fields.<br />
<br />
Here is one of such instructions sets:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>f5fed3ae 8b501c mov edx,dword ptr [eax+1Ch]<br />
f5fed3b1 03d1 add edx,ecx<br />
<span style="color: red;">f5fed3b3 8996bc000000 mov dword ptr [esi+0BCh],edx</span><br />
ds:0023:81ff3354=8121c000<br />
f5fed3b9 8b4860 mov ecx,dword ptr [eax+60h]<br />
f5fed3bc 8b5104 mov edx,dword ptr [ecx+4]<br />
f5fed3bf 8b8eb4000000 mov ecx,dword ptr [esi+0B4h]<br />
f5fed3c5 8d540aff lea edx,[edx+ecx-1]<br />
<span style="color: red;">f5fed3c9 8996c0000000 mov dword ptr [esi+0C0h],edx </span></td></tr>
</tbody></table><br />
<br />
This instructions set executes as a part of the read routine of the device object (this is where the knowledge about the driver routines addresses can be useful). The driver fills in the following fields, after the data from the port buffer was copied to the buffer, attached to the IRP.<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">DevEx+BCh</span> now contains the address of the IRP buffer carriage (address of the buffer + number of bytes copied) and <span style="font-family: "Courier New",Courier,monospace;">DevEx+C0h</span> now points to the end of the IPR buffer (address of the buffer + size of the buffer (4000h bytes)).<br />
<br />
<br />
Another instructions set is the following:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>f5fec106 40 inc eax<br />
f5fec107 5f pop edi<br />
f5fec108 8986bc000000 mov dword ptr [esi+0BCh],eax ds:0023:82028ac4=81fe102e<br />
f5fec10e 5e pop esi<br />
f5fec10f c20800 ret 8</td></tr>
</tbody></table><br />
<br />
This instructions set executes as a part of the write completion routine, which executes on the DPC IRQL level. Since the read routine executes on the Passive IRQL, this write completion routine will interrupt execution of the read routine. Moreover, there is no synchronization to keep this data in consistent state.<br />
<br />
The write completion routine copies data to the port buffer byte by byte and uses the <span style="font-family: "Courier New",Courier,monospace;">DevEx+BCh </span>data field to save the pointer to the carriage of the port buffer – that is completely another buffer than the buffer which <span style="font-family: "Courier New",Courier,monospace;">DevEx+C0h</span> points to. Thus, if the write completion routine will accidentally interrupt the read routine, it can change the value of the <span style="font-family: "Courier New",Courier,monospace;">DevEx+BCh</span> to the address inside another buffer, which will subsequently lead to the client’s process crash.<br />
<br />
To be fair I have to say that this time I met with the vserial driver wasn’t first. The number of utilities uses this driver and there is no problem observed. But when I checked the work of the driver with several utilities, I found that they use the driver completely different way than the VSPort utility does, and the driver code which was discussed in the article is not executed.<br />
<br />
Also, it seems that the bug is a consequence of the misprint. Logically, two pointers for read and write routines just should be saved inside different data fields.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-87780438696735115562010-10-13T22:06:00.001+04:002010-12-23T21:12:24.098+03:00Grabbing the dumps from the customer's machineOccasionally I encounter the task of assisting the customer to get the bump of the crashed process. My company develops software for marine navigation systems, and as you can guess there some problems with internet connection in open sea. In fact, most of the vessels have internet access aboard, but it’s very slow due to the high price of satellite connections. In addition, in most vessels you can’t take direct connection to bridge environment due to security and technical restrictions. So, the only communication remains is e-mail ?<br />
<br />
To save my time and customers peace of mind I wrote the trivial batch script, which reduce the amount of explanations to minimum:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>@echo off<br />
<br />
<span style="color: #38761d;">rem The processes of interest</span><br />
set PROCESSES=(IBSSvc TBService scserver71)<br />
<br />
<span style="color: #38761d;">rem List all processes in the system</span><br />
tlist /v > dumps/processes.txt<br />
<br />
<span style="color: #38761d;">rem Grab all dumps and debugger output for the processes of interest</span><br />
for %%i in %PROCESSES% do cdb -pv -pn "%%i.exe" -logo dumps/%%i.txt -c ".dump /ma dumps/%%i.dmp;q"</td></tr>
</tbody></table><br />
<br />
All tools you need for script execution you can find in “Debugging Tools for Windows” suite.<br />
<br />
Here is the list of modules:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>* cdb.exe<br />
* dbgeng.dll<br />
* dbghelp.dll <br />
* ext.dll<br />
* tlist.exe<br />
* uext.dll</td></tr>
</tbody></table><br />
After that, I just pack all those stuff to self-extracting archive and send it to customer. After customer unpacks the archive and executes the script, he sends me content of the “dumps” folder back and I have the opportunity to analyze it.<br />
<br />
You can modify the script the way you like and use it where appropriate.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-81384580424499108482010-10-04T22:35:00.001+04:002010-12-23T21:13:12.612+03:00Believe not all that you seeLast week there was very interesting story that can be useful for debugger users. During the application debugging (using WinDbg) we encountered with stable repeated crash. The source of the crash was found easily, but it confused us completely. The problem was in one processor instruction. The instruction <span style="color: purple; font-family: "Courier New",Courier,monospace;">add eax,10h</span> was ended up with subtraction of <span style="color: purple; font-family: "Courier New",Courier,monospace;">34h </span>from <span style="color: purple; font-family: "Courier New",Courier,monospace;">eax </span>register.<br />
<br />
We had checked it several times but the result was absolutely identical. After that, we made several experiments with the following results:<br />
<br />
1. Whatever we put into the place of <span style="color: purple; font-family: "Courier New",Courier,monospace;">10h</span>, the result was subtraction of <span style="color: purple; font-family: "Courier New",Courier,monospace;">34h </span>from eax register.<br />
<br />
2. After we moved this instruction three bytes above or bellow the current adress (just noping the gap where instruction was) the effect disappeared immediately. But when we returned it back in its palce, the effect returned as well.<br />
<br />
After some time of puzzling over I looked into the breakpoints list and noticed that there was a breakpoint set at faulting instruction. But instruction wasn’t highlighted in the debugger. When I checked the address of the breakpoint, I found out that the breakpoint was set not inside the first instruction byte (like it should be), but inside the third byte – that is in the place of <span style="color: purple; font-family: "Courier New",Courier,monospace;">10h </span>operand. Of course this “breakpoint” didn’t act like an actual breakpoint, processor just executed <span style="color: purple; font-family: "Courier New",Courier,monospace;">add eax,CCh </span>instruction instead. And because of debugger hides all its breakpoints before it’s pass the control to user, we always saw <span style="color: purple; font-family: "Courier New",Courier,monospace;">add eax,10h</span> instruction instead of real one.<br />
<br />
So, the obvious question is: Why the debugger set this breakpoint?<br />
<br />
And the answer is simple. The unresolved breakpoint was set during another debugging session of this process, and since the process binaries and symbols set were changed, the debugger figured out wrong address for the breakpoint (in fact, completely wrong :)). And because of debugger still considered this breakpoint as unresolved, we were encountering with this crash over and over again after restart of the process.Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-6413230975439051212.post-25558317370159727362010-09-18T21:23:00.002+04:002010-12-23T21:41:17.308+03:00Congratulate me, I’m the winner!<a href="http://1.bp.blogspot.com/_Yv7YTiqUCn4/TJT0LNT6geI/AAAAAAAAAAM/Ij6nFu2kMZI/s1600/DBG_DebugAwards.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="http://1.bp.blogspot.com/_Yv7YTiqUCn4/TJT0LNT6geI/AAAAAAAAAAM/Ij6nFu2kMZI/s200/DBG_DebugAwards.jpg" width="177" /></a>Several articles from this blog have participated in <a href="http://www.dumpanalysis.org/debugging-story-annual-competition">Tell Your Windows Debugging Story Annual Competition</a>. Few days ago Dmitry Vostokov introduced <a href="http://www.dumpanalysis.org/debugging-competition-2010">the winners of the competition</a>. I’m glad to tell you that this blog is among them!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-47013205174811252292010-09-18T18:59:00.005+04:002010-12-23T21:14:18.948+03:00The way to find a hotspotToday I want to describe the way to find a hotspot without profilers. I often use this approach and find it effective and simple enough. The only tools you need are any Windows debugger (I prefer WinDbg, but you can use your favorite one) which should always be in your hands if you decided to investigate some performance issue (as well as any other problem with your software) and tool like Process Explorer which available free from Windows Sysinternals. You can use this method without profiler at all or in conjunction with profiler to clarify the result of profilers work.<br />
<br />
Before <span class="diff-same"></span><span class="diff-add">explaining</span><span class="diff-same"> </span>the approach, I want to define the term “hotspot”. There are different definitions which can be applied in different contexts. In this article we consider a hotspot as a limited set of instructions on which execution spends significant CPU time or more CPU time than you expect from it.<br />
<br />
The process is divided into the three steps. If you used a profiler and have a call graph, you can miss the first two steps and jump to step number three to verify the profilers result and find out the hotspot most precisely. Nevertheless, I recommend you to go <span class="diff-add">through all</span> steps because the results returned by profilers sometimes can be very obscure and you will lose time rolling around a call graph or other related information. Even if your profiler was absolutely right, you’ll have one more confirmation of its result. Sometimes it’s better to check twice :)<br />
<br />
<a name='more'></a><br />
<br />
<b>1. Determine the thread which spends significant CPU time</b><br />
<br />
This is the easiest step because you should do nothing by your hands and can just rely on the results returned by one of the tools which can show you CPU usage for each thread in your process. There are a number of good tools. I prefer to use Process Explorer from Sysinternals because it’s free, it’s powerful and can be useful in step number two (but again, you can use your favorite).<br />
<br />
<br />
<b>2. Determine the call stack relevant to the hotspot</b><br />
<br />
Once you found the thread which loads CPU over your expectations, you can find out the call stack relevant to your hotspot. The idea is to get a call stack of thread several times (in fact, more the better). If thread loads CPU too much, it should spend most of its time executing instructions which are part of its hotspot. In fact, profilers use this method as well. It’s known as sampling and it’s the most common method of hotspots analyzing if you can’t use more advanced instrumentation methods for some reason. Don’t forget to adjust your symbols paths, otherwise call stacks you’ll get could be far from reality.<br />
<br />
When you have some call stacks, you should analyze it and choose one which observes most often and/or is most probably could be a relevant to the hotspot on your opinion. <br />
<br />
<br />
<b>3. Find out the hotspot location</b><br />
<br />
Once you have a suspect, it’s time to find out location of the hotspot and prove your assumption. This is most tricky part, so I’ll describe it in details.<br />
<br />
Imagine we have the following call stack:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>ChildEBP RetAddr <br />
0012f588 0041349f Hotspot!write_string+0x33<br />
0012f85c 00411e50 Hotspot!_output+0xc6f<br />
0012f8a0 00411ca3 Hotspot!sprintf+0x90<br />
0012fab8 00411d43 Hotspot!Func1+0x93<br />
0012fb8c 00411d93 Hotspot!Func2+0x23<br />
0012fc60 00411ed3 Hotspot!Func3+0x23<br />
0012fd34 00412383 Hotspot!Func4+0x23<br />
0012fe08 004123c3 Hotspot!Func5+0x23<br />
0012fedc 004122e0 Hotspot!main+0x23<br />
0012ffc0 7c816fe7 Hotspot!mainCRTStartup+0x170<br />
0012fff0 00000000 kernel32!BaseProcessStart+0x23</td></tr>
</tbody></table><br />
<br />
Since we have to find out the instruction set which occupies too <span class="diff-same"></span><span class="diff-add">much</span><span class="diff-same"></span> CPU time, our first task is to figure out where this instruction set executes relative to the call stack. Even more precisely, we should find the function which is the initial point of the hotspot. That is if that function will no longer execute, our thread will no longer spend too much CPU time.<br />
<br />
The way to find this function is very straightforward – we will exclude one call after another from our execution path and check the load of CPU. We can start from the top or from the bottom of the call stack, or if the call stack too long and we have no idea where the hotspot can be, we can use some more advanced technique to reduce amount of work. Generally it depends on the situation and you should use your knowledge about the application and your intuition to choose the better approach.<br />
<br />
In this case I prefer to start from the middle of the call stack and exclude all instructions starting from function <span style="font-family: "Courier New",Courier,monospace;">Hotspot!Func3</span>. For this purpose I’ll change the function code so that the function executes no instructions but just returns instead.<br />
<br />
To do this we need to find the function start address and the stack size occupied for function parameters (it’s not necessary for <span style="font-family: "Courier New",Courier,monospace;">__cdecl</span> calling convention because in that case a caller pops a function arguments from the stack).<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>Hotspot!Func3<br />
<span style="color: red;">00411d70 </span>55 push ebp<br />
00411d71 8bec mov ebp,esp<br />
00411d73 81ecc0000000 sub esp,0C0h<br />
00411d79 53 push ebx<br />
00411d7a 56 push esi<br />
00411d7b 57 push edi<br />
00411d7c 8dbd40ffffff lea edi,[ebp-0C0h]<br />
00411d82 b930000000 mov ecx,30h<br />
00411d87 b8cccccccc mov eax,0CCCCCCCCh<br />
00411d8c f3ab rep stos dword ptr es:[edi]<br />
00411d8e b804000000 mov eax,4<br />
00411d93 e8f8570000 call Hotspot!_chkstk (00417590)<br />
00411d98 e825f4ffff call Hotspot!ILT+445(?Func2YAXXZ) (004111c2)<br />
00411d9d 8da534ffffff lea esp,[ebp-0CCh]<br />
00411da3 5f pop edi<br />
00411da4 5e pop esi<br />
00411da5 5b pop ebx<br />
00411da6 8be5 mov esp,ebp<br />
00411da8 5d pop ebp<br />
00411da9 c20800 ret <span style="color: red;">8</span></td></tr>
</tbody></table><br />
<br />
In the case of function <span style="font-family: "Courier New",Courier,monospace;">Hotspot!Func3</span> the start address is <span style="font-family: "Courier New",Courier,monospace;">00411d70 </span>and the parameters size is <span style="font-family: "Courier New",Courier,monospace;">8 </span>bytes.<br />
<br />
Having the address of the first function instruction we should change this instruction to appropriate return instruction (in our case <span style="font-family: "Courier New",Courier,monospace;">ret 8</span>).<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>0:000> eb 00411d70 c2 08 00<br />
<br />
0:000> uf Hotspot!Func3<br />
Hotspot!Func3<br />
25 00411d70 c20800 ret 8</td></tr>
</tbody></table><br />
<br />
<i>Don’t forget to return the first function instruction back before you’ll continue your investigation :)</i><br />
<br />
Now we should execute the process again to analyze CPU usage of our thread. If it’s still high – at least part of our hotspot executes before the call of the <span style="font-family: "Courier New",Courier,monospace;">Hotspot!Func3</span> function and we should continue to exclude the functions down the stack. Otherwise, if we observe CPU usage normalization, we can make a conclusion that we cut out all hotspot instructions set and the next step we should do is to make sure that <span style="font-family: "Courier New",Courier,monospace;">Hotspot!Func3</span> is really the initial point of the hotspot. That is, we should exclude the next function up the stack (in our case, the function <span style="font-family: "Courier New",Courier,monospace;">Hotspot!Func2</span>) and make sure that CPU usage is increased. If it’s not – go up the stack until you reach a function that increases CPU usage. That is, the function which contains a part of the hotspot.<br />
<br />
Thus, moving up or down the call stack, you’ll have an opportunity to figure out where your hotspot is. If you examined the call stack and still have no idea about the hotspot location, it’s better to return to step number two to make sure that you chose the right call stack to examine.<br />
<br />
Although the description of the process takes a few pages, in real time it takes several minutes of work. For me, it’s fast enough to not use profiles for this task in most cases.<br />
<br />
Now the only thing left is to find the cause of the hotspot. Depending on hotspot nature, its cause can be located inside one function (for example, infinite loop) or can be distributed across several function calls, several call paths, several threads, processes or even computers. Unfortunately, there are no common algorithms to find the cause of a hotspot even knowing where the hotspot is, and there are no profilers which can help you. Here you should again rely on your knowledge and intuition.<br />
<br />
<a href="http://www.codeproject.com/script/Articles/BlogArticleList.aspx?afid=1463" rel="tag" style="display: none;">CodeProject</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-64198661356196614462010-09-09T20:35:00.009+04:002010-12-23T21:15:09.410+03:00The bug in MFC tool tip controlYesterday, I encountered the bug in MFC’s tool tip control implementation.<br />
The bug dozes in the following code:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>LRESULT CToolTipCtrl::OnAddTool(WPARAM wParam, LPARAM lParam)<br />
{<br />
<span style="color: red;">TOOLINFO ti = *(LPTOOLINFO)lParam;</span> <span style="color: blue;"><----- HERE</span><br />
if ((ti.hinst == NULL) && (ti.lpszText != LPSTR_TEXTCALLBACK)<br />
&& (ti.lpszText != NULL))<br />
{<br />
void* pv;<br />
if (!m_mapString.Lookup(ti.lpszText, pv))<br />
m_mapString.SetAt(ti.lpszText, NULL);<br />
// set lpszText to point to the permanent memory associated<br />
// with the CString<br />
VERIFY(m_mapString.LookupKey(ti.lpszText, ti.lpszText));<br />
}<br />
return DefWindowProc(TTM_ADDTOOL, wParam, (LPARAM)&ti);<br />
}</td></tr>
</tbody></table><br />
<br />
<a name='more'></a><br />
<br />
If you look at the definition of <span style="font-family: "Courier New",Courier,monospace;">TOOLINFO </span>structure closely, you’ll find that this structure has a variable size which depends on the compilation macros defined:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>typedef struct {<br />
UINT cbSize;<br />
UINT uFlags;<br />
HWND hwnd;<br />
UINT_PTR uId;<br />
RECT rect;<br />
HINSTANCE hinst;<br />
LPTSTR lpszText;<br />
#if (_WIN32_IE >= 0x0300)<br />
LPARAM lParam;<br />
#endif <br />
#if (_WIN32_WINNT >= Ox0501)<br />
void *lpReserved;<br />
#endif <br />
} TOOLINFO, *PTOOLINFO, *LPTOOLINFO;</td></tr>
</tbody></table><br />
<br />
That is, a structure pointed to by <span style="font-family: "Courier New",Courier,monospace;">lParam </span>can have a size from 40 to 48 bytes long depending on the compilation macros defined and can’t be just copied to structure created inside the MFC module. Dealing with variable-size structures, you can copy only a common part of a structure or analyze its version first (that is the cbSize field was made for!).<br />
<br />
We can easily figure out what <span style="font-family: "Courier New",Courier,monospace;">TOOLINFO</span>’s size expected by MFC:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>mfc71!CToolTipCtrl::OnAddTool:<br />
7c1a21e5 55 push ebp<br />
7c1a21e6 8bec mov ebp,esp<br />
<span style="color: red;">7c1a21e8 83ec30 sub esp,30h</span><br />
7c1a21eb 53 push ebx<br />
7c1a21ec 56 push esi<br />
7c1a21ed 8b750c mov esi,dword ptr [ebp+0Ch]<br />
7c1a21f0 57 push edi<br />
7c1a21f1 8bd9 mov ebx,ecx<br />
<span style="color: red;">7c1a21f3 6a0c push 0Ch</span><br />
<span style="color: red;">7c1a21f5 59 pop ecx</span><br />
7c1a21f6 8d7dd0 lea edi,[ebp-30h]<br />
<span style="color: red;">7c1a21f9 f3a5 rep movs dword ptr es:[edi],dword ptr [esi]</span></td></tr>
</tbody></table><br />
<br />
The size of <span style="font-family: "Courier New",Courier,monospace;">TOOLINFO </span>inside the MFC module is equal to 0xc*4 = 0x30 = 48 bytes. And the size of <span style="font-family: "Courier New",Courier,monospace;">TOOLINFO </span>structure inside our product’s module was equal to 44 bytes!<br />
<br />
Of course, in most cases nothing criminal will happen. <span style="font-family: "Courier New",Courier,monospace;">CToolTipCtrl::OnAddTool</span> just copies 4 extra bytes – it’s eventually not even uses it. The only case when the problem may arise – if those 4 extra bytes don’t have read access. Though this situation is hardly possible if your <span style="font-family: "Courier New",Courier,monospace;">TOOLINFO</span> structure resides on the stack, what about heap memory? With probability far from zero, your structure can occupy last free bytes in a memory page and next page can be not committed yet. This is exactly the case which I encountered. As a result – access violation.<br />
<br />
Note that the problem can appear only if you allocate memory for <span style="font-family: "Courier New",Courier,monospace;">TOOLINFO </span>structure inside your own module, that is, use <a href="http://msdn.microsoft.com/en-us/library/bb760338%28VS.85%29.aspx">SendMessage approach</a>.<br />
<br />
I found the bug in MFC 7.1 library, but it’s not fixed even in MFC 10.0.<br />
<br />
<br />
<b>The ways to workaround</b><br />
<br />
There are several obvious ways to workaround the problem:<br />
<br />
1. Do not use <a href="http://msdn.microsoft.com/en-us/library/bb760338%28VS.85%29.aspx">SendMessage approach</a> to register a tool with a tool tip control. Use <a href="http://msdn.microsoft.com/en-us/library/s2y2wf56%28VS.80%29.aspx"><span style="font-family: "Courier New",Courier,monospace;">CToolTipCtrl::AddTool</span> method</a> instead.<br />
<br />
In case if you cannot avoid SendMessage approach (for static controls for example) you can do the following:<br />
<br />
2. Compile your module which uses tool tip control with all macros available (<span style="font-family: "Courier New",Courier,monospace;">_WIN32_IE</span> and <span style="font-family: "Courier New",Courier,monospace;">_WIN32_WINNT</span>).<br />
<br />
3. Reserve extra bytes (4 or 8, depending on your options) which can be copied without impact.<br />
<br />
For example like this:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>// Extend TOOLINFO on 8 bytes<br />
struct TOOLINFO_EX : public TOOLINFO<br />
{<br />
DWORD m_dwReserve1;<br />
DWORD m_dwReserve2;<br />
};<br />
<br />
// Allocate memory from heap (remember, it’s just example :))<br />
TOOLINFO* pti = new TOOLINFO_EX;<br />
<br />
// Initialization<br />
<br />
// Register a tool<br />
toolTipCtrl.SendMessage(TTM_ADDTOOL, 0, (LPARAM) pti);</td></tr>
</tbody></table><br />
<a href="http://www.codeproject.com/script/Articles/BlogArticleList.aspx?afid=1463" rel="tag" style="display: none;">CodeProject</a>Unknownnoreply@blogger.com7tag:blogger.com,1999:blog-6413230975439051212.post-47801452810827980792010-09-03T15:28:00.001+04:002010-12-23T21:41:46.224+03:00To untangle a ropeLast weekend I was launching a kite with my daughter and her friend. It was great fun, but when I was veering the kite down its string entangled. I spent about 30 minutes (maybe even more – time flies fast :)) to untangle the string. I followed the string loops, analyzed the types of knots, imagined what will happen to string if I pass it through one loop or another. I went through a lot of different combinations inside my mind and devised more and more new techniques. When I finished finally, I felt satisfaction. I felt no tiredness or irritation, furthermore I had the feeling like I untangle the strings all my life. Then I thought about it I realized that it’s not far from the truth. The process of untangling string (as well as rope, chain and so on) is very similar with the process of software defects researching (the term was first introduced by <a href="http://www.dumpanalysis.org/">Dmitry Vostokov</a>). So if you in doubt to be software researcher or not – try to untangle a rope first :)Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-44844071796692892812010-09-02T21:24:00.003+04:002011-01-21T19:35:11.364+03:00Can lightning strike the same place twice? (Or can a bomb to hit an elephant?)In the old soviet movie about World War II was mentioned the story about the old Professor of Mathematics. When Leningrad has been bombing he is never descended to bombshelter along with other people. He told that he calculated the probability of bomb hitting his house and the probability was too small to hiding from bombs. But one day his neighbors saw him in a bombshelter and asked why is he hiding with others. He replied that on that day a bomb hit the only elephant in the cities zoo, and probability of hitting the elephant was absolutely equal with probability of been hit himself.<br />
<br />
Why did I remember this story? Because during the last week I encountered twice with the failures which was the result of circumstances those probability was very low. Like people say: “If gun is hanging on the wall it will shoot one day or another”. <br />
<br />
<a name='more'></a><br />
<br />
The first hero of the topic is the synchronization error related to smart pointers.<br />
<br />
Let’s look to the following code example:<br />
<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td><br />
// Thread safe structure<br />
struct Day : public IRefCounted<br />
{<br />
virtual void AddRef()<br />
{<br />
// Thread safe implementation<br />
}<br />
<br />
virtual void Release()<br />
{<br />
// Thread safe implementation<br />
}<br />
};<br />
<br />
struct Month<br />
{<br />
typedef boost::intrusive_ptr DayPtr;<br />
<br />
// Calls from thread 1<br />
void OnNewDay()<br />
{<br />
m_spCurrentDay = new Day();<br />
}<br />
<br />
// Calls from thread 2<br />
DayPtr GetCurrentDay()<br />
{<br />
return m_spCurrentDay;<br />
}<br />
<br />
DayPtr m_spCurrentDay;<br />
}; </td></tr>
</tbody></table><br />
Suppose that Day structure is thread safe and Month’s methods OnNewDay() and GetCurrentDay() are called from different threads. Is this code thread safe?<br />
<br />
Let’s look what happens inside OnNewDay() function:<br />
<br />
1. New Day object created.<br />
<br />
2. Pointer to a Day object sent to DayPtr::operator=().<br />
<br />
3. Inside DayPtr::operator=() temporary DayPtr object created (now it’s holds a pointer to Day). Reference count value of a Day object incremented by 1.<br />
<br />
4. New temporary DayPtr object and m_spCurrentDay swapped.<br />
<br />
5. Temporary DayPtr object went out of scopes and destroyed. Reference count value of an old Day object decreased by 1. If it’s turns to 0, an old Day object will be deleted.<br />
<br />
Here is the swapping code:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td> void swap(intrusive_ptr & rhs) <br />
{<br />
T * tmp = px;<br />
px = rhs.px;<br />
rhs.px = tmp;<br />
}</td></tr>
</tbody></table><br />
And its last string in machine code (we will need it later):<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td> //rhs.px = tmp;<br />
<br />
mov eax,dword ptr [rhs]<br />
mov ecx,dword ptr [tmp]<br />
<span style="color: red;"> mov dword ptr [eax],ecx </span><br />
</td></tr>
</tbody></table><br />
Now let’s look what happens inside GetCurrentDay() function:<br />
<br />
1. Copy constructor of DayPtr (which is sent as parameter) called.<br />
<br />
2. Inside the copy constructor pointer to Day copied to DayPtr and reference count value of Day increased.<br />
<br />
Here is the code of intrusive_ptr copy constructor:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td> intrusive_ptr(intrusive_ptr const & rhs): px( rhs.px )<br />
{<br />
if( px != 0 ) intrusive_ptr_add_ref( px );<br />
}</td></tr>
</tbody></table><br />
And here is the machine code:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td>// px( rhs.px )<br />
mov eax,dword ptr [this] <br />
mov ecx,dword ptr [rhs] <br />
mov edx,dword ptr [ecx] <br />
mov dword ptr [eax],edx <br />
<br />
// if( px != 0 ) intrusive_ptr_add_ref( px );<br />
<span style="color: #38761d;"> mov eax,dword ptr [this] </span><br />
<span style="color: #38761d;"> cmp dword ptr [eax],0 </span><br />
<span style="color: #38761d;">je boost::intrusive_ptr</span><day><span style="color: #38761d;">::intrusive_ptr</span><day><span style="color: #38761d;">+43h (412473h) </span><br style="color: #38761d;" /><span style="color: #38761d;"> mov eax,dword ptr [this] </span><br style="color: #38761d;" /><span style="color: #38761d;"> mov ecx,dword ptr [eax] </span><br style="color: #38761d;" /><span style="color: #38761d;"> push ecx </span><br style="color: #38761d;" /><span style="color: #38761d;"> </span>call intrusive_ptr_add_ref (4116CCh) <br />
</day></day></td></tr>
</tbody></table><br />
Now let’s imagine the following situation:<br />
<br />
1. Thread number 2 calls GetCurrentDay() and returns control after a pointer to Day is copied to it’s DayPtr but before a reference count of a Day object incremented (see the area of instructions marked in green).<br />
<br />
2. Now thread number 1 resumes and calls OnNewDay() (or maybe resumes inside OnNewDay() already, but before the instruction marked in red). It copies a pointer to a new Day object to m_spCurrentDay and decreases an old Day object reference count value. If reference count turns to 0, an old Day object deletes. After that thread #1 returns control to the system.<br />
<br />
3. After thread #2 takes over control it increases a reference count by 1, BUT the object is no longer exist.<br />
<br />
Thus we have an intrusive_ptr to already deleted object. And this is completely not what we expected using a smart pointer. If you look at the copy constructors of other boost’s smart pointers you’ll encounter with the similar situation.<br />
<br />
<br />
<table bgcolor="#fbfbca"><tbody>
<tr><td><i>Please note – boost’s smart pointers are not thread safe! </i></td></tr>
</tbody></table><br />
<br />
Another good example of synchronization problems is the Xerces XML parser Platform Utils environment initialization. Although the Xerces is thread safe in the sense that you can create and use an instance of parser in any application thread (but you can’t use an instance of parser in different threads without synchronization!), there is the important exception from this rule – Platform Utils environment. <br />
<br />
Before using the Xerces parser you should initialize Platform Utils environment calling the XMLPlatformUtils::Initialize() method (and deinitialize it after use through the XMLPlatformUtils::Terminate()). XMLPlatformUtils class uses global reference count to control the objects lifetime and ensures one-time initialization and deinitialization.<br />
<br />
Let’s look at the code:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td> void XMLPlatformUtils::Initialize(const char* const locale<br />
, const char* const nlsHome<br />
, PanicHandler* const panicHandler<br />
, MemoryManager* const memoryManager<br />
, bool toInitStatics)<br />
{<br />
if (gInitFlag == LONG_MAX)<br />
return;<br />
<br />
gInitFlag++;<br />
<br />
if (gInitFlag > 1)<br />
return;<br />
<br />
<...><br />
}<br />
<br />
void XMLPlatformUtils::Terminate()<br />
{<br />
if (gInitFlag == 0)<br />
return;<br />
<br />
gInitFlag--;<br />
<br />
if (gInitFlag > 0)<br />
return;<br />
<br />
<...><br />
}</td></tr>
</tbody></table><br />
It’s absolutely clear that the only candidates for race condition is gInitFlag++ and gInitFlag— statements. If we consider them atomic all works just fine – the initialization and deinitialization code executes only once no matter which threads calls XMLPlatformUtils::Initialize() and XMLPlatformUtils::Terminate(). But are they really atomic?<br />
<br />
Let’s look at assembler code:<br />
<br />
<table bgcolor="#eeeeee" style="font-family: "Courier New",Courier,monospace;"><tbody>
<tr><td> // gInitFlag++;<br />
1 mov eax,dword ptr [gInitFlag (9C2AA54h)] <br />
2 add eax,1 <br />
3 mov dword ptr [gInitFlag (9C2AA54h)],eax <br />
<br />
// gInitFlag--;<br />
1 mov eax,dword ptr [gInitFlag (9C2AA54h)] <br />
2 sub eax,1 <br />
3 mov dword ptr [gInitFlag (9C2AA54h)],eax</td></tr>
</tbody></table><br />
What will happen if few threads execute incrementation or decrementation code simultaneously?<br />
<br />
Let’s consider the situation with gInitFlag++ statement execution:<br />
<br />
1. The reference count is 0.<br />
<br />
2. Thread #1 executes instruction #1 and returns control. The reference count still equal to 0.<br />
<br />
3. Thread #2 executes all 3 instructions and returns control. The reference count now is 1.<br />
<br />
4. Thread #1 executes instruction #2 and #3. The reference count still equal to 1.<br />
<br />
<br />
<br />
Ooops! We called XMLPlatformUtils::Initialize() twice, but our reference count is still equal to 1! This means that Platform Utils environment will be destroyed after the first call to XMLPlatformUtils::Terminate(), which most likely will result in crash when a second thread will try to use it. <br />
<br />
And what will happen with gInitFlag— statement in the same situation? The reference count will be larger than it should be and it will result in resources leak.<br />
<br />
<table bgcolor="#fbfbca"><tbody>
<tr><td><i>So it’s better to synchronize access to XMLPlatformUtils::Initialize() and XMLPlatformUtils::Terminate() methods if use them in multithreaded environment!</i></td></tr>
</tbody></table><br />
Of course the probability of discussed situations is very low, but it’s happens. Believe me :)Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-73953399511167383042010-08-02T17:56:00.004+04:002010-12-23T21:17:29.131+03:00Why it's not crashing? (conclusion)Well, today I received the final message from MS where they told me once again that there is nothing can be done on their side and advised me to catch exception in user code and call MiniDumpWriteDump after that. I think the story can be considered finished.<br />
<br />
Se la vie, mes amis. Se la vie.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-37174695587868465762010-08-01T14:48:00.006+04:002010-12-23T21:33:35.113+03:00In search of the Dr. WatsonA week ago I returned from the business trip to Italy. It was a trip to the yacht where was the real problems with the navigation system stability. One or two times a day all navigation data from sensors freeze and the navigation equipment was inoperative until full bridge (5 stations) rebooting.<br />
<br />
Just before the trip I received some information concerning the problem from our service engineer.<br />
There were several observations:<br />
<br />
1. All navigation data freeze simultaneously but the navigation system (radar, cartographic system and over subsystems) continued to work without any other noticeable problems.<br />
<br />
2. Sometimes the message box with “Pure virtual function call” notification was observed after data freezing.<br />
<br />
3. First time the problem was observed just after the last update of the product.<br />
<br />
4. In the end I even received HD images from all stations of the bridge (they was created after the last update and after the problem appeared for the first time).<br />
<br />
<br />
Description of the problem was clearly pointed to the fact that the cause of the problem is crashing (or freezing) of the navigation server processes at all stations simultaneously (each computer has its own navigation server executed and described problem is only possible if there is no servers alive). The only thing that confused me was mentioned earlier “Pure virtual function call” error. This error is not very common and occurs rare enough in its pure form. Most common cause of the error is improper sequence of objects chain deinitialization and destruction as well as undigested strategy of objects ownership (of course I’m not take into account errors with calling pure virtual method of child class inside a parent class destructor or deleting an object twice – I believe that our code better :). One of the classic situations caused by assertion message box triggered during object destruction (usually in debug mode). Message box message loop can provoke a call (through a pure virtual interface) to a partially destroyed object. As a result we will have clear pure virtual function call.<br />
<br />
<a name='more'></a><br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0012df14 004137f4 MSVCR71D!_purecall+0x19</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012dff4 7c22f7b6 PureCall!CMyDlg::OnTimer+0x34</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e0f8 7c22efde MFC71D!CWnd::OnWndMsg+0x7a6</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e118 7c22c820 MFC71D!CWnd::WindowProc+0x2e</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e190 7c22ccfe MFC71D!AfxCallWndProc+0xe0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e1b0 7c29d8ea MFC71D!AfxWndProc+0x9e</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e1e0 7e418734 MFC71D!AfxWndProcBase+0x4a</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e20c 7e418816 USER32!InternalCallWinProc+0x28</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e274 7e4189cd USER32!UserCallWinProcCheckWow+0x150</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e2d4 7e418a10 USER32!DispatchMessageWorker+0x306</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e2e4 7e42dbbf USER32!DispatchMessageW+0xf</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e31c 7e42593f USER32!DialogBox2+0x15a</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e344 7e43a91e USER32!InternalDialogBox+0xd0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e604 7e43a284 USER32!SoftModalMessageBox+0x938</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e754 7e4661d3 USER32!MessageBoxWorker+0x2ba</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e7ac 7e466278 USER32!MessageBoxTimeoutW+0x7a</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e7e0 7e450617 USER32!MessageBoxTimeoutA+0x9c</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e800 7e4505cf USER32!MessageBoxExA+0x1b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e81c 1020c2aa USER32!MessageBoxA+0x45</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012e854 102093a1 MSVCR71D!__crtMessageBoxA+0x16a</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ebb8 00413a84 MSVCR71D!_assert+0x5b1</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012eca4 00413beb PureCall!TObjectBase::~TObjectBase+0x44</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ed84 004136cb PureCall!TObject::~TObject+0x2b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ee64 00418598 PureCall!TObject::`scalar deleting destructor'+0x2b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ef64 7c22f7a4 PureCall! CMyDlg::OnClose+0x58</span></td></tr>
</tbody></table><br />
The situation aboard was not the same because the navigation servers crashed during they normal execution but not during the exit. Or something had to force them to exit.<br />
<br />
I hoped that HD images will clarify a cause of the pure virtual function call. First of all I checked the images for the presence of dumps from Dr. Watson. Nothing found. Then we installed the images to our test bench. Nothing found within a few days. <br />
<br />
There was no stable internet connection aboard and no possibility to ask someone from crew to execute some tools before or after the problem will appear next time. Furthermore, there were only a few days to solve the problem, so we decided that investigation of the problem aboard will be most efficiently. <br />
<br />
Of course the first thing I did on board was checking for the Dr. Watson dumps again. Imagine my surprise when I found few dumps on two different stations with two different timestamps. It was really very strange for several reasons:<br />
<br />
1. There were no dumps inside the HD images (remember, the dumps was created after the problem appeared for the first time).<br />
<br />
2. According to the problem description there should be 5 dumps (if any) with nearly identical timestamps.<br />
<br />
3. Again, according to the problem description nobody saw Dr. Watson UI. So it’s not clear where the dumps came from.<br />
<br />
There was no time for experiments irrelevant to the problem, so I decided to postpone deep investigation of the issue and proceed to analyze the crash dumps. A culprit was found very easy. It was a classical case of buffer overflow. <br />
<br />
Here is the call stack:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">WARNING: Frame IP not in any known module. Following frames may be wrong.</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be68 09c00cac <span style="color: red;">0x54205344</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be6c 09c00ff7 0x9c00cac</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be70 45564944 0x9c00ff7</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be74 41205352 0x45564944</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be78 53495353 0x41205352</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be7c 20444554 0x53495353</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be80 42205942 0x20444554</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be84 45475241 0x42205942</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be88 4e412053 0x45475241</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be8c 55542044 0x4e412053</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be90 4f422047 0x55542044</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be94 20535441 0x4f422047</span><br />
<span style="font-family: "Courier New",Courier,monospace;"><…></span></td></tr>
</tbody></table><br />
And here is the stack:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">09e9be6c 09c00cac 09c00ff7 45564944 41205352 ........DIVERS A</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be7c 53495353 20444554 42205942 45475241 SSISTED BY BARGE</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be8c 4e412053 55542044 4f422047 20535441 S AND TUG BOATS </span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9be9c 4c4c4957 20454220 4b2a4f2a 20474e49 WILL BE *O*KING </span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9beac 54204e49 202a4948 41455241 4548542e IN THI* AREA.THE</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bebc 52414220 20534547 20444e41 20475554 BARGES AND TUG </span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9becc 54414f42 414d2053 45422059 46454c20 BOATS MAY BE LEF</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bedc 4e492054 45485420 4f424120 4d204556 T IN THE ABOVE M</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9beec 49544e45 44454e4f 45524120 55442041 ENTIONED AREA DU</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9befc 474e4952 45485420 472a4e20 4d2e5448 RING THE N*GHT.M</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf0c 4e495241 20535245 20455241 54534e49 ARINERS ARE INST</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf1c 54435552 54204445 20544148 49544e55 RUCTED THAT UNTI</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf2c 5546204c 45485452 4f4e2052 45434954 L FURTHER NOTICE</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf3c 562a4e2c 54414749 204e4f49 54204e49 ,N*VIGATION IN T</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf4c 41204548 45564f42 45522a20 53492041 HE ABOVE *REA IS</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf5c 544f4e20 52455020 5454494d 542e4445 NOT PERMITTED.T</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf6c 20594548 205a5241 2a204f42 47495641 HEY ARZ BO *AVIG</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf7c 20455441 482a4957 55414320 4e4f4954 ATE WI*H CAUTION</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf8c 444e4120 4f4c5320 50532057 20444545 AND SLOW SPEED </span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bf9c 4e454857 204e4920 20454854 49434956 WHEN IN THE VICI</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bfac 5954494e 444e4120 204f5420 45564947 NITY AND TO GIVE</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bfbc 45485420 444e4920 54414349 41204445 THE INDICATED A</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bfcc 20414552 49572041 42204544 48545245 REA A WIDE BERTH</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bfdc 52414d2e 52454e49 52412053 4c412045 .MARINERS ARE AL</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bfec 41204f53 532a5644 54204445 4f46204f SO ADV*SED TO FO</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9bffc 574f4c4c 594e2a20 534e4920 43555254 LLOW *NY INSTRUC</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c00c 4e4f4954 2a472053 2a4e2a56 5320592a TIONS G*V*N**Y S</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c01c 4f505055 43205452 54464152 204e4920 UPPORT CRAFT IN </span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c02c 20454854 412a5241 444e4120 4c2a5620 THE AR*A AND V*L</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c03c 5454454c 4f502041 43205452 52544e4f LETTA PORT CONTR</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c04c 57204c4f 4d204f48 49205941 45555353 OL WHO MAY ISSUE</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c05c 594e4120 57454e20 52415720 474e494e ANY NEW WARNING</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c06c 49572053 52204854 52414745 <span style="color: red;">54205344 </span>S WITH REGAR<span style="color: red;">DS T</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">09e9c07c 4854204f 4241202a 2e45564f 00000000 O TH* ABOVE.....</span></td></tr>
</tbody></table><br />
And here is the culprit:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td style="font-family: "Courier New",Courier,monospace;">::memcpy(sentence.Sentence, pMessData, nMessSize);</td></tr>
</tbody></table><br />
There was too long message that the yacht received from Malta (presumably) one or two times a day and it crashed the system. The size of <span style="font-family: "Courier New",Courier,monospace;">sentence.Sentence</span> buffer was 512 bytes and the length of the message was clearly more.<br />
<br />
By the way we didn’t receive this message again within three days I was aboard. So the dumps I found were the only chance to catch the problem.<br />
<br />
When I came home I finally had a chance to find out the answers on the following questions:<br />
<br />
1. It’s absolutely clear that the cause of the navigation servers failure was an “Access violation” exception. But why instead of Dr. Watson UI (and logs along with dumps) the result of failure was a “Pure virtual function” message box?<br />
<br />
2. Where the dumps which I found on the bridge came from?<br />
<br />
3. And why did they have different timestamps?<br />
<br />
First of all I decided to analyze the behavior of the <span style="font-family: "Courier New",Courier,monospace;">UnhandledExceptionFilter </span>function. <br />
<br />
<table bgcolor="#fbfbca"><tbody>
<tr><td><i>NOTE:</i><br />
<i></i><br />
<i>It is worth to keep in mind that <span style="font-family: "Courier New",Courier,monospace;">UnhandledExceptionFilter </span>checks whether a process is been debugged in order to send debug event to the debugger rather then handle an event by the way intrinsic for non-debugged processes.</i><br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">7c862bf9 89bddcfeffff mov dword ptr [ebp-124h],edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862bff 57 push edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c00 6a04 push 4</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Set pointer to DWORD_PTR value that is the port number of the debugger for the process. A nonzero value indicates that the process is being run under the control of a ring 3 debugger.</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7c862c02 8d85dcfeffff lea eax,[ebp-124h]</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7c862c08 50 push eax</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c09 6a07 push 7</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c0b e8fdb3faff call kernel32!GetCurrentProcess (7c80e00d)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c10 50 push eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c11 ff15ac10807c call dword ptr <span style="color: red;">[kernel32!_imp__NtQueryInformationProcess (7c8010ac)]</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c17 85c0 test eax,eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c19 0f8ca2000000 jl kernel32!UnhandledExceptionFilter+0x137 (7c862cc1)</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Check if process is being run under debugger</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7c862c1f 39bddcfeffff cmp dword ptr [ebp-124h],edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c862c25 0f8496000000 je kernel32!UnhandledExceptionFilter+0x137 (7c862cc1)</span></td></tr>
</tbody></table><br />
<i>That is to have <span style="font-family: "Courier New",Courier,monospace;">UnhandledExceptionFilter</span>’s behavior similar to those we have for the process that is not been debugged we have to either to overwrite the debugger port value to 0 just before <span style="font-family: "Courier New",Courier,monospace;">UnhandledExceptionFilter </span>checks it or use a kernel-mode debugger.</i></td></tr>
</tbody></table><br />
To not bother I chose the second way. <br />
<br />
I found that after checking the various conditions (exception flags, exception id, number of parameters and so on) <span style="font-family: "Courier New",Courier,monospace;">UnhandledExceptionFilter </span>checks the error mode value.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">7c862ce5 e8eb7dfaff call kernel32!GetErrorMode (7c80aad5)</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Check the SEM_NOGPFAULTERRORBOX flag</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7c862cea a802 test al,2</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">7c862cec 0f8563070000 jne kernel32!UnhandledExceptionFilter+0x8ad (7c863455)</span></td></tr>
</tbody></table><br />
And if the flag <span style="font-family: "Courier New",Courier,monospace;">SEM_NOGPFAULTERRORBOX</span> set, <span style="font-family: "Courier New",Courier,monospace;">UnhandledExceptionFilter </span>just return <span style="font-family: "Courier New",Courier,monospace;">EXCEPTION_EXECUTE_HANDLER</span> to the exception handle.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">kernel32!UnhandledExceptionFilter+0x8ad:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c863455 33c0 xor eax,eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c863457 40 inc eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c863458 8da59cfdffff lea esp,[ebp-264h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c86345e 8b4de4 mov ecx,dword ptr [ebp-1Ch]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c863461 e8ab62faff call kernel32!__security_check_cookie (7c809711)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c863466 e8a0f0f9ff call kernel32!_SEH_epilog (7c80250b)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c86346b c20400 ret 4</span></td></tr>
</tbody></table><br />
In my case the exception handler was <span style="font-family: "Courier New",Courier,monospace;">MSVCR71!_except_handler3</span>.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">090cba48 7c36e289 kernel32!UnhandledExceptionFilter</span><br />
<span style="font-family: "Courier New",Courier,monospace;">090cba64 7c36b398 MSVCR71!_XcptFilter+0x15f</span><br />
<span style="font-family: "Courier New",Courier,monospace;">090cba70 7c363943 MSVCR71!_threadstartex+0x86</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">090cba98 7c9037bf MSVCR71!_except_handler3+0x61</span><br />
<span style="font-family: "Courier New",Courier,monospace;">090cbabc 7c90378b ntdll!ExecuteHandler2+0x26</span><br />
<span style="font-family: "Courier New",Courier,monospace;">090cbb6c 7c90eafa ntdll!ExecuteHandler+0x24</span><br />
<span style="font-family: "Courier New",Courier,monospace;">090cbb6c 27534e41 ntdll!KiUserExceptionDispatcher+0xe</span><br />
<span style="font-family: "Courier New",Courier,monospace;">WARNING: Frame IP not in any known module. Following frames may be wrong.</span><br />
<span style="font-family: "Courier New",Courier,monospace;">090cbe68 4c2d4f4b 0x27534e41</span></td></tr>
</tbody></table><br />
And the exception handler just unwind the stack and return control to <span style="font-family: "Courier New",Courier,monospace;">MSVCR71!_threadstartex</span> which in its turn call the <span style="font-family: "Courier New",Courier,monospace;">MSVCR71!_exit</span> function.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">MSVCR71!_except_handler3+0x61:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363943 5d pop ebp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363944 5e pop esi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363945 8b5d0c mov ebx,dword ptr [ebp+0Ch]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363948 0bc0 or eax,eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36394a 743f je MSVCR71!_except_handler3+0xa9 (7c36398b)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36394c 7848 js MSVCR71!_except_handler3+0xb4 (7c363996)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36394e 8b7b08 mov edi,dword ptr [ebx+8]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363951 53 push ebx</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363952 e88c000000 call MSVCR71!__global_unwind2 (7c3639e3)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363957 83c404 add esp,4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36395a 8d6b10 lea ebp,[ebx+10h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36395d 56 push esi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36395e 53 push ebx</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36395f e8c1000000 call MSVCR71!__local_unwind2 (7c363a25)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363964 83c408 add esp,8</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363967 8d0c76 lea ecx,[esi+esi*2]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36396a 6a01 push 1</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36396c 8b448f08 mov eax,dword ptr [edi+ecx*4+8]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363970 e83a010000 call MSVCR71!_NLG_Notify (7c363aaf)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363975 8b048f mov eax,dword ptr [edi+ecx*4]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363978 89430c mov dword ptr [ebx+0Ch],eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36397b 8b448f08 mov eax,dword ptr [edi+ecx*4+8]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36397f 33db xor ebx,ebx</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363981 33c9 xor ecx,ecx</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363983 33d2 xor edx,edx</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363985 33f6 xor esi,esi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c363987 33ff xor edi,edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="color: red;">7c363989 ffd0 call eax</span> <span style="color: #38761d;">// {MSVCR71!_threadstartex+0x89 (7c36b39b)}</span></span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">MSVCR71!_threadstartex+0x89</span><br />
<span style="font-family: "Courier New",Courier,monospace;">001b:7c36b39b 8b65e8 mov esp,dword ptr [ebp-18h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">001b:7c36b39e ff75e4 push dword ptr [ebp-1Ch]</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">001b:7c36b3a1 e8cef9ffff call MSVCR71!_exit (7c36ad74)</span></td></tr>
</tbody></table><br />
Thus, the answer to the first question was almost obtained. The source of the “Pure virtual function call” message box was in improper objects uninitialization and destruction sequence. The owner of the pointer to the pure virtual interface tried to call a method of already destructed object (just like I expected).<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td style="font-family: "Courier New",Courier,monospace;">090cfbc8 0b0135e1 MSVCR71!_purecall+0x12<br />
090cfcf8 0b056232 VdsProtocols!NVdsProt::IVdsProtMngrImpl::CloseSession+0x31<br />
090cfd24 0b06430d VdsProtocols!NVdsProt::CVdsTargetsWProt::Deinit+0xc2<br />
090cfd4c 0b07ce76 VdsProtocols!NVdsProt::CVdsTargetsWProt::~CVdsTargetsWProt+0x3d<br />
090cfd50 0b07cf39 VdsProtocols!_CRT_INIT+0x95<br />
090cfd8c 7c9011a7 VdsProtocols!_DllMainCRTStartup+0x9a<br />
090cfdac 7c923f31 ntdll!LdrpCallInitRoutine+0x14<br />
090cfe30 7c81ca3e ntdll!LdrShutdownProcess+0x14f<br />
090cff24 7c81cab6 kernel32!_ExitProcess+0x42<br />
090cff38 7c3638e1 kernel32!ExitProcess+0x14<br />
090cff40 7c3638c4 MSVCR71!__crtExitProcess+0x2e<br />
090cff70 7c36ad81 MSVCR71!doexit+0xab<br />
090cff80 7c36b3a5 MSVCR71!_exit+0xd<br />
090cffb4 7c80b50b MSVCR71!_threadstartex+0x93<br />
090cffec 00000000 kernel32!BaseThreadStart+0x37</td></tr>
</tbody></table><br />
The only thing left unclear is where this flag came from? I set a breakpoint to the <span style="font-family: "Courier New",Courier,monospace;">kernel32!SetErrorMode</span> function to see who and then set the error mode flags. The result was a bit unexpected. A lot of different functionalities including system modules set the flags one after another (especially during the modules loading process). Fortunately then I investigated the code more precisely I realized that all of them use the <span style="font-family: "Courier New",Courier,monospace;">SetErrorMode </span>function in a typical way. They just set the flag to a desired value and then return the old value back after code execution.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">kernel32!GetLongPathNameW+0x3d:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c813396 6801800000 push 8001h</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c81339b e87f78ffff call kernel32!SetErrorMode (7c80ac1f)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c8133a0 898568fdffff mov dword ptr [ebp-298h],eax</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">kernel32!GetLongPathNameW+0x3a0:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c813656 ffb568fdffff push dword ptr [ebp-298h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c81365c e8be75ffff call kernel32!SetErrorMode (7c80ac1f)</span></td></tr>
</tbody></table><br />
So, all that is necessary to find a culprit is to find code that set the SEM_NOGPFAULTERRORBOX flag for the first time.<br />
<br />
It wasn’t hard. The first call to the <span style="font-family: "Courier New",Courier,monospace;">SetErrorMode </span>function set the error mode value to <span style="font-family: "Courier New",Courier,monospace;">80007h</span>. The culprit was our own module which was the host for the server that I had debugged. <br />
<br />
Finally the picture was absolutely clear and the answer to the second and third questions was obvious. If it’s happens that at the moment of crash some thread changed the error mode value, OS will call Dr. Watson for the process, otherwise we will see the familiar message box with “Pure virtual function call” notification.<br />
<br />
The conclusion from the story is very simple: try to avoid the <span style="font-family: "Courier New",Courier,monospace;">SEM_NOGPFAULTERRORBOX</span> flag where it’s not necessary (e.g. like in service processes).Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-6413230975439051212.post-8215965983085194392010-07-30T21:56:00.009+04:002010-12-23T21:22:02.290+03:00Why it's not crashing?Several months ago, my colleagues encountered a very strange behavior of one of our processes after an exception was thrown. They told me that they know exactly that the process throws an exception which is not expected to be handled by their code, that is an exception must be unhandled and the process must be terminated immediately. But nothing like that happens. The process just continues its execution. At first, I thought that there must be some code down the stack which registers an exception handler that handles any exception indiscriminately. Therefore, the first thing I did was search for the <span style="font-family: "Courier New",Courier,monospace;">try{}catch(…){}</span> pattern down the stack. When I found nothing, I looked at the exception handlers list at the point where an exception was thrown. <br />
<br />
It looked as follows:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0012fd14: USER32!_except_handler3+0 (7e440457)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> CRT scope 0, func: USER32!UserCallWinProc+10a (7e44aa1c)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fd6c: USER32!_except_handler3+0 (7e440457)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> CRT scope 0, filter: USER32!DispatchMessageWorker+113 (7e440712)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="color: black;"> func: USER32!DispatchMessageWorker+126 </span>(7e44072a)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ffb0: VanishExcept!ILT+375(__except_handler3)+0 (0041117c)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ffe0: kernel32!_except_handler3+0 (7c839af0)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> CRT scope 0, filter: kernel32!BaseProcessStart+29 (7c84377a)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> func: kernel32!BaseProcessStart+3a (7c843790)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Invalid exception stack at ffffffff</span></td></tr>
</tbody></table><br />
There were no user exception handlers at all, but instead was an exception handler I have not seen before:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0012fd14: USER32!_except_handler3+0 (7e440457)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> CRT scope 0, func: USER32!UserCallWinProc+10a (7e44aa1c)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fd6c: USER32!_except_handler3+0 (7e440457)</span><br />
<span style="color: red;"><span style="font-family: "Courier New",Courier,monospace;"> CRT scope 0, filter: USER32!DispatchMessageWorker+113 (7e440712)</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="color: red;"> func: USER32!DispatchMessageWorker+126 (7e44072a)</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ffb0: VanishExcept!ILT+375(__except_handler3)+0 (0041117c)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012ffe0: kernel32!_except_handler3+0 (7c839af0)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> CRT scope 0, filter: kernel32!BaseProcessStart+29 (7c84377a)</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> func: kernel32!BaseProcessStart+3a (7c843790)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Invalid exception stack at ffffffff</span></td></tr>
</tbody></table><br />
<a name='more'></a><br />
So I decided to check if this is a handler which handles our exception. Surprisingly, it was the handler I was looking for. For me, it was absolutely not clear why the OS decided to handle the exception having no meaning for it instead of reporting about a serious problem and terminating a program as a result? This behavior, with no doubt, is strange and dangerous! The only way to go in this situation is just to leave an exception unhandled and let the system to execute Dr. Watson or whatever tool which was adjusted instead. In case of Dr. Watson, the user will have an opportunity to send a report to WER. This is why this service was created indeed.<br />
<br />
The next thing I did was setting the breakpoint at the <span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+113</span> instruction in order to figure out why <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker </span>decides to handle the exception. I discovered that the decision to handle an exception or not <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker</span>’s exception filter takes depending on the 17th bit of the <span style="font-family: "Courier New",Courier,monospace;">_teb.Win32ClientInfo.dwCompatFlags2</span> flag. If the bit is not set filter returns <span style="font-family: "Courier New",Courier,monospace;">EXCEPTION_EXECUTE_HANDLER</span>. In our case no one bit in <span style="font-family: "Courier New",Courier,monospace;">dwCompatFlags2<span style="font-family: inherit;"> </span></span>was set, thus filter reported that <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker</span>’s handler can handle exception no matter what the exception is.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">USER32!GetAppCompatFlags2:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ed6 8bff mov edi,edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ed8 55 push ebp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ed9 8bec mov ebp,esp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418edb 64a118000000 mov eax,dword ptr fs:[00000018h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ee1 83784000 cmp dword ptr [eax+40h],0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ee5 0f842b550100 je USER32!GetAppCompatFlags2+0x11 (7e42e416)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418eeb 64a118000000 mov eax,dword ptr fs:[00000018h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ef1 0fb74d08 movzx ecx,word ptr [ebp+8]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418ef5 3b88d4060000 cmp ecx,dword ptr [eax+6D4h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418efb 0f8238ae0000 jb USER32!GetAppCompatFlags2+0x2e (7e423d39)</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Get the pointer to TEB (ntdll!_TEB)</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e418f01 64a118000000 mov eax,dword ptr fs:[00000018h]</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Get the value of _CLIENTINFO.dwCompatFlags2 flag</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e418f07 8b80dc060000 mov eax,dword ptr [eax+6DCh]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418f0d 5d pop ebp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418f0e c20400 ret 4</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0x113:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e440712 6800040000 push 400h</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e440717 e8ba87fdff call USER32!GetAppCompatFlags2 (7e418ed6)</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Check for the 17th bit</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e44071c c1e810 shr eax,10h</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e44071f f7d0 not eax</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e440721 83e001 and eax,1</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Return EXCEPTION_EXECUTE_HANDLER == 1</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e440724 c3 ret</span></td></tr>
</tbody></table><br />
Sounds great, isn’t it?!<br />
<br />
Now let’s look what the handler does:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0x126:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e44072a 8b65e8 mov esp,dword ptr [ebp-18h]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e44072d 33c0 xor eax,eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e44072f e96090fdff jmp USER32!DispatchMessageWorker+0x12b (7e419794)</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0x12b:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419794 834dfcff or dword ptr [ebp-4],0FFFFFFFFh</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419798 e948f2ffff jmp USER32!DispatchMessageWorker+0x378 (7e4189e5)</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0x378:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4189e5 e816fcffff call USER32!_SEH_epilog (7e418600)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4189ea c20800 ret 8</span></td></tr>
</tbody></table><br />
Nothing, if fact. It’s just returns from <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker</span>.<br />
<br />
It’s absolutely obvious that if <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker </span>registered this exception handler any time it dispatches some message, this problem would be well known and most likely fixed already (because I consider this behavior as improper). So there should be some special circumstances to register the handler.<br />
<br />
The exact point of where <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker </span>registers the handler is:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0xbf:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419742 393d8c00477e cmp dword ptr [USER32!gfServerProcess (7e47008c)],edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419748 0f8527f2ffff jne USER32!DispatchMessageWorker+0x134 (7e418975)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e41974e 50 push eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e41974f ff7608 push dword ptr [esi+8]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419752 ff36 push dword ptr [esi]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419754 e849000000 call USER32!NtUserValidateTimerCallback (7e4197a2)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419759 85c0 test eax,eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e41975b 0f84e3420200 je USER32!DispatchMessageWorker+0x17c (7e43da44)</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Set the block counter to 0</span><br />
<span style="font-family: "Courier New",Courier,monospace;"><span style="color: red;">7e419761 897dfc mov dword ptr [ebp-4],edi</span> <span style="color: #38761d;">// edi == 0</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419764 3bdf cmp ebx,edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419766 0f85ed590100 jne USER32!DispatchMessageWorker+0xe1 (7e42f159)</span></td></tr>
</tbody></table><br />
And the conditions necessary to instruction execution are the following:<br />
<br />
1. <span style="font-family: "Courier New",Courier,monospace;">WM_TIMER</span> message<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4188f1 6a1c push 1Ch</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4188f3 68f089417e push offset USER32!MessageTable+0x440 (7e4189f0)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4188f8 e8c3fcffff call USER32!_SEH_prolog (7e4185c0)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4188fd 33ff xor edi,edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e4188ff 897de4 mov dword ptr [ebp-1Ch],edi</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Set MSG* param to esi</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e418902 8b7508 mov esi,dword ptr [ebp+8]</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0x6e:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418954 0f85947d0200 jne USER32!DispatchMessageWorker+0x70 (7e4406ee)</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// Get MSG.message value</span><br />
<span style="color: black; font-family: "Courier New",Courier,monospace;">7e41895a 8b5604 mov edx,dword ptr [esi+4]</span><br />
<br />
<span style="color: #38761d; font-family: "Courier New",Courier,monospace;">// and check if it’s equal to WM_TIMER</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e41895d 81fa13010000 cmp edx,113h</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e418963 0f84f0000000 je USER32!DispatchMessageWorker+0xa1 (7e418a59)</span></td></tr>
</tbody></table><br />
2. Valid timer callback function<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">USER32!DispatchMessageWorker+0xbf:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419742 393d8c00477e cmp dword ptr [USER32!gfServerProcess (7e47008c)],edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419748 0f8527f2ffff jne USER32!DispatchMessageWorker+0x134 (7e418975)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e41974e 50 push eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e41974f ff7608 push dword ptr [esi+8]</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419752 ff36 push dword ptr [esi]</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e419754 e849000000 call USER32!NtUserValidateTimerCallback (7e4197a2)</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">7e419759 85c0 test eax,eax</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e41975b 0f84e3420200 je USER32!DispatchMessageWorker+0x17c (7e43da44)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419761 897dfc mov dword ptr [ebp-4],edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419764 3bdf cmp ebx,edi</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7e419766 0f85ed590100 jne USER32!DispatchMessageWorker+0xe1 (7e42f159)</span></td></tr>
</tbody></table><br />
I think those conditions are not uncommon for the <span style="font-family: "Courier New",Courier,monospace;">SetTimer </span>function.<br />
<br />
I was so discouraged, that I thought that there should be something which I don’t know and I posted the issue to MSDN forum (<a href="http://social.msdn.microsoft.com/Forums/en-US/windowsgeneraldevelopmentissues/thread/c1e34bd0-8ea4-4869-bb20-4e4d0ed586fe">http://social.msdn.microsoft.com/Forums/en-US/windowsgeneraldevelopmentissues/thread/c1e34bd0-8ea4-4869-bb20-4e4d0ed586fe</a>). I described the issue and asked the community to the following questions:<br />
<br />
<br />
1. Why does the DispatchMessageWorker behaves this way (looks like a patch)?<br />
<br />
2. What compatibility flag should I set to bypass this behavior?<br />
<br />
And the silence was an answer :)<br />
<br />
I sent the issue to the Microsoft technical support finally. After several months of correspondence they acknowledged that the behavior of the <span style="font-family: "Courier New",Courier,monospace;">DispatchMessageWorker </span>function is abnormal and suggested that I catch all exceptions in the timer function and call <span style="font-family: "Courier New",Courier,monospace;">MiniDumpWriteDump </span>function inside my exception handler. When I explained to the escalation engineer why the idea is not very good and sometimes even inapplicable, he promised me to try to find another solution.<br />
<br />
Look forward to some results. And to be continued…<br />
<br />
<a href="http://www.codeproject.com/script/Articles/BlogArticleList.aspx?afid=1463" rel="tag" style="display: none;">CodeProject</a>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-6413230975439051212.post-89680457250532982872010-06-26T22:18:00.005+04:002010-12-23T21:23:39.501+03:00The bug is on the tableA few months ago there was an interesting story in the company where I work. One guy from a neighboring department asked me to help him. He encountered one crash after another over the past few days. Unusual in these crashes was that they occurred in different processes and under different circumstances, but within one computer on the test bench (consisted of 5 machines) and with identical symptoms. In all cases it was an access violation caused by illegal instruction pointer.<br />
<br />
Here is an example of one of the call stacks:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td style="font-family: "Courier New",Courier,monospace;">0:000> knL<br />
# ChildEBP RetAddr <br />
00 0012d7f8 7c90e9ab ntdll!KiFastSystemCallRet<br />
01 0012d7fc 7c8094f2 ntdll!ZwWaitForMultipleObjects+0xc<br />
02 0012d898 7c809c86 kernel32!WaitForMultipleObjectsEx+0x12c<br />
03 0012d8b4 6945763c kernel32!WaitForMultipleObjects+0x18<br />
04 0012e248 694582b1 faultrep!StartDWException+0x5df<br />
05 0012f2bc 7c863059 faultrep!ReportFault+0x533<br />
06 0012f530 7c36e289 kernel32!UnhandledExceptionFilter+0x4cf<br />
07 0012f54c 0041b84f msvcr71!_XcptFilter+0x15f<br />
08 0012f558 7c363943 radar!WinMainCRTStartup+0x1d7<br />
09 0012f580 7c9037bf msvcr71!_except_handler3+0x61<br />
0a 0012f5a4 7c90378b ntdll!ExecuteHandler2+0x26<br />
0b 0012f654 7c90eafa ntdll!ExecuteHandler+0x24<br />
0c 0012f654 fc910d93 ntdll!KiUserExceptionDispatcher+0xe<br />
<span style="color: red;">WARNING: Frame IP not in any known module. Following frames may be wrong.</span><br />
<span style="color: red;">0d 0012f950 1f4d5318 0xfc910d93</span><br />
<span style="color: red;">0e 0012fa10 7c363593 0x1f4d5318</span><br />
0f 0012fa58 087e6827 msvcr71!free+0xc3<br />
10 0012fa8c 087e6f41 NaviRadarRendererPlain!std::vector<std::pair><int,int>,std::allocator<std::pair><int,int> > >::_Insert_n+0x147<br />
11 0012fafc 087e46f1 NaviRadarRendererPlain!TScanlines::calculate+0x5d1<br />
12 0012fc2c 087e418e NaviRadarRendererPlain!TRadarToScreen_<unsigned short="">::RenderBlock+0x521<br />
13 0012fc64 087e2378 NaviRadarRendererPlain!TRadarToScreen_<unsigned short="">::Render+0x16e<br />
14 0012fc88 087c4cf0 NaviRadarRendererPlain!n_d3d_render::renderer_t::render+0x68<br />
15 0012fce0 03d5c5e7 NaviRadarLayer!TRadarLayerImpl::Draw+0x360<br />
16 0012fd1c 03d55c8e TotUser!TRadarManager_::Update+0x77<br />
17 0012fd4c 087c821a TotUser!TCrtPanel_::InvalidateRadar+0xbe<br />
18 0012fd54 10ac8959 NaviRadarLayer!TRadarLayerImpl::OnVideoBlockReceived+0x7a<br />
19 0012fd7c 10ac8ab6 TkRadar20!IAdviseHostImpl<iradarconnectorsink>::TypedForEach<void,videodata &="">+0x49<br />
1a 0012fd90 10ac8e88 TkRadar20!RadarConnector::VideoBlockReceived+0x26<br />
1b 0012fda0 77d48709 TkRadar20!detail::WndProc+0x48<br />
1c 0012fdcc 77d487eb user32!InternalCallWinProc+0x28<br />
1d 0012fe34 77d489a5 user32!UserCallWinProcCheckWow+0x150<br />
1e 0012fe94 77d4bccc user32!DispatchMessageWorker+0x306<br />
1f 0012fea4 7c1b1645 user32!DispatchMessageA+0xf<br />
20 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x3e<br />
21 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54<br />
22 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30<br />
23 0012ff18 0041b7fd mfc71!AfxWinMain+0x68<br />
24 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185<br />
25 0012fff0 00000000 kernel32!BaseProcessStart+0x23</void,videodata></iradarconnectorsink></unsigned></unsigned></int,int></std::pair></int,int></std::pair></td> </tr>
</tbody></table><br />
<a name='more'></a><br />
<br />
When I looked at the call stack for the first time I made several assumptions about IP corruption reason (those that I considered more likely):<br />
1. Failed IP address was sent to the call instruction.<br />
2. Wrong return address was restored by the ret instruction.<br />
<br />
To conform or refute each of them I started from the stack investigation.<br />
<br />
Content of the stack around the current stack pointer was the following:<br />
<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> .frame /c 0d</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0d 0012f950 1f4d5318 0xfc910d93</span><br />
<span style="font-family: "Courier New",Courier,monospace;">eax=000000f2 ebx=00000000 ecx=0000acce edx=1f4b7330 esi=1f4b6ec0 edi=01f20000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">eip=fc910d93 esp=0012f954 ebp=0012fa10 iopl=0 nv up ei ng nz na po cy</span><br />
<span style="font-family: "Courier New",Courier,monospace;">cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010283</span><br />
<span style="font-family: "Courier New",Courier,monospace;">fc910d93 ?? ???</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">0:000> dds @esp-50 @esp-4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f904 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f908 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f90c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f910 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f914 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f918 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f91c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f920 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f924 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f928 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f92c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f930 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f934 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f938 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f93c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f940 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f944 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f948 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f94c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f950 00000000</span><br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">0:000> dds @esp @esp+150</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f954 1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f958 1f4b6ec8</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f95c 0000008e</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f960 11200000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f964 7c397a63 msvcr71!_87except+0xc4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f968 1f4b6ec0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f96c 00000008</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f970 00000258</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f974 00000258</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f978 7c397a97 msvcr71!_87except+0xf8</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f97c 0000037f</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f980 01f20178</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f984 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f988 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f98c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f990 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f994 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f998 1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f99c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9a0 000000d4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9a4 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9a8 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9ac 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9b0 00000470</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9b4 1f4d5ac8</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9b8 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9bc 1f4d5310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9c0 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9c4 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9c8 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9cc 1f4d5310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9d0 000006a0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9d4 1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9d8 01f20000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9dc 01f20178</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9e0 000006a0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9e4 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9e8 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9ec 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9f0 0101ffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9f4 000000f2</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9f8 0012f954</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012f9fc 0012f578</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa00 0012fa48</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa04 7c90ee18 ntdll!_except_handler3</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa08 7c910570 ntdll!CheckHeapFillPattern+0x64</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa0c ffffffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa10 0012fa58</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa14 7c363593 msvcr71!free+0xc3</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa18 01f20000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa1c 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa20 1f4b6ec8</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa24 1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa28 0012fbf4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa2c 0000008e</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa30 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa34 0012fa80</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa38 7c3638e2 msvcr71!_except_handler3</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa3c 7c39a3c0 msvcr71!`string'+0xc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa40 0012fa24</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa44 0012f578</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa48 0012fa80</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa4c 7c3638e2 msvcr71!_except_handler3</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa50 7c39f150 msvcr71!`string'+0x24</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa54 ffffffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa58 0012fa8c</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa5c 087e6827 NaviRadarRendererPlain!std::vector<std::pair><int,int>,std::allocator<std::pair><int,int> > >::_Insert_n+0x147</int,int></std::pair></int,int></std::pair></span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa60 1f4b6ec8</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa64 0012fbf4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa68 00000384</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa6c 0000033b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa70 0000033b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa74 00000384</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa78 1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa7c 0012fa64</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa80 0012fc24</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa84 087e91b0 NaviRadarRendererPlain!_dllonexit+0x1d4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa88 ffffffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa8c 0012fafc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa90 087e6f41 NaviRadarRendererPlain!TScanlines::calculate+0x5d1</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa94 1f4b7330</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa98 00000698</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fa9c 1f4d5780</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012faa0 02023f10</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012faa4 02022ff8</span></td> </tr>
</tbody></table><br />
And there were few remarkable facts about it:<br />
<br />
1. The stack above the stack pointer (that is the stack which has just been released) was empty.<br />
Actually it could be result of exception handling code execution. Anyway this is a sad fact because of information we could get from this part of the stack could be very useful.<br />
<br />
2. The last value on the stack is <span style="font-family: "Courier New",Courier,monospace;">0x1f4d5318</span><br />
It’s seems unremarkable, but if we look at the call stack in the point of failure we’ll see that debugger consider this value as a return address from the last function call. It seems the debugger assumes that corrupted instruction pointer was sent to the call instruction. Then the preceding value on the stack is return address. But since this not the only option, I decided to check this assumption first.<br />
<br />
Let’s look at the address <span style="font-family: "Courier New",Courier,monospace;">0x1f4d5318 </span>more closely.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> !address 1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Usage: <unclassified></unclassified></span><br />
<span style="font-family: "Courier New",Courier,monospace;">Allocation Base: 1f200000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Base Address: 1f200000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">End Address: 1fa00000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Region Size: 00800000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Type: 00020000 MEM_PRIVATE</span><br />
<span style="font-family: "Courier New",Courier,monospace;">State: 00001000 MEM_COMMIT</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Protect: 00000004 PAGE_READWRITE</span></td></tr>
</tbody></table><br />
It’s not inside any module!<br />
<br />
Hence the debugger’s assumption (as well as my first one) is probably not right. And this means that the frame number 0D does not exist.<br />
<br />
3. But the frame number <span style="font-family: inherit;">0C </span>looks much more suitable<br />
Thus the obvious next step I thought about is to figure out what function has created this frame. Knowing the frame return address we can find out a call instruction leading to the frame creation. <br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> ub 7c363593 L1</span><br />
<span style="font-family: "Courier New",Courier,monospace;">msvcr71!free+0xbd [f:\vs70builds\6030\vc\crtbld\crt\src\free.c @ 101]:</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36358d ff1580a0397c call dword ptr [msvcr71!_imp__HeapFree (7c39a080)]</span></td></tr>
</tbody></table><br />
Let’s look where the table value <span style="font-family: "Courier New",Courier,monospace;">_imp__HeapFree</span> points to.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> ln poi(msvcr71!_imp__HeapFree)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">(7c91043d) ntdll!RtlFreeHeap | (7c91058d) ntdll!RtlpQueryDepthSList</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Exact matches:</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> <span style="color: red;">ntdll!RtlFreeHeap</span></span></td></tr>
</tbody></table><br />
Well, all I knew at that moment is that our failure was somehow related to the <span style="font-family: "Courier New",Courier,monospace;">ntdll!RtlFreeHeap</span> function and it wasn’t consequence of the call instruction. <br />
<br />
Thus the only assumption remained is the assumption regarding a frame return address. There are several reasons why the return address can recover erroneous. Most common of them is different assumption of callee and caller about function calling convention or function signature (e.g. as a result of “dll hell”). But can we talk about inappropriate function calling convention or function signature regarding to ntdll.dll function? I really don’t think so!<br />
<br />
Well, the situation became more and more intriguing :) Unfortunately the evidences I had wasn’t enough to make it clearer. I definitely needed more information and it would be great if I could get the part of the stack we just released as well. For that purpose I decided to attach the debugger to a number of relevant processes. Once the debugger receives an access violation event we will have a good chance to get the stack untouched.<br />
<br />
While I was waiting for a dump with untouched released stack I had examined several another dumps from the same computer.<br />
<br />
Here is some of the call stacks:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:033> knL</span><br />
<span style="font-family: "Courier New",Courier,monospace;"># ChildEBP RetAddr </span><br />
<span style="font-family: "Courier New",Courier,monospace;">WARNING: Frame IP not in any known module. Following frames may be wrong.</span><br />
<span style="font-family: "Courier New",Courier,monospace;">00 0d58fed8 0a37285e 0xfc36225d</span><br />
<span style="font-family: "Courier New",Courier,monospace;">01 0d58fee8 04249d3f 0xa37285e</span><br />
<span style="font-family: "Courier New",Courier,monospace;">02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf</span><br />
<span style="font-family: "Courier New",Courier,monospace;">03 0d58ff74 03a46463 AtomDB!TSubscriberNotification_::Thread+0x23f</span><br />
<span style="font-family: "Courier New",Courier,monospace;">04 0d58ff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23</span><br />
<span style="font-family: "Courier New",Courier,monospace;">05 0d58ffb4 7c80b50b msvcr71!_threadstartex+0x6f</span><br />
<span style="font-family: "Courier New",Courier,monospace;">06 0d58ffec 00000000 kernel32!BaseThreadStart+0x37</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:070> knL</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> # ChildEBP RetAddr </span><br />
<span style="font-family: "Courier New",Courier,monospace;">WARNING: Frame IP not in any known module. Following frames may be wrong.</span><br />
<span style="font-family: "Courier New",Courier,monospace;">00 067cf2e4 01199be0 0xfc3c1e8d</span><br />
<span style="font-family: "Courier New",Courier,monospace;">01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150</span><br />
<span style="font-family: "Courier New",Courier,monospace;">02 067cf434 00bfec52 FBP_VideoTranseiver!n_rlt_video_source::TFBPVideoTransmitter::OnNewData+0xbe</span><br />
<span style="font-family: "Courier New",Courier,monospace;">03 067cf478 00bfe34b MFRLT_Scanner!TVideoHdrAccum::NotifySubscribers+0xf2</span><br />
<span style="font-family: "Courier New",Courier,monospace;">04 067cf4a0 00bfc5a7 MFRLT_Scanner!TVideoHdrAccum::OnNewData+0xdb</span><br />
<span style="font-family: "Courier New",Courier,monospace;">05 067cfd10 00d99753 MFRLT_Scanner!n_rlt_scanner::TMFRLT_Scanner::SetData+0xc7</span><br />
<span style="font-family: "Courier New",Courier,monospace;">06 067cfd78 00d99004 RIBSupportProxy!n_rlt_video_source::TRibSupportProxy_::SetData+0x93</span><br />
<span style="font-family: "Courier New",Courier,monospace;">07 067cfdb0 10040aed RIBSupportProxy!n_rlt_video_source::TVideoSourceSeqServer_::OnCall+0x84</span><br />
<span style="font-family: "Courier New",Courier,monospace;">08 067cfdb8 10040cf5 ETL!TSeqServer_::OnCall+0x1d</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09 067cff74 10046463 ETL!TSeqServer_::Thread+0xd5</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0a 067cff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0b 067cffb4 7c80b50b msvcr71!_threadstartex+0x6f</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0c 067cffec 00000000 kernel32!BaseThreadStart+0x37</span></td></tr>
</tbody></table><br />
Despite the fact that all call stacks seemed to be different I found something common in they failure course. Look at the instruction pointers which leaded to access violation:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0d 0012f950 1f4d5318 <span style="color: red;">0xfc910d93</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">0e 0012fa10 7c363593 0x1f4d5318</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0f 0012fa58 087e6827 msvcr71!free+0xc3</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">00 0d58fed8 0a37285e <span style="color: red;">0xfc36225d</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">01 0d58fee8 04249d3f 0xa37285e</span><br />
<span style="font-family: "Courier New",Courier,monospace;">02 0d58ff20 0425199f AtomDB!TPatternAtomFilter_::Filter+0xdf</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">00 067cf2e4 01199be0 <span style="color: red;">0xfc3c1e8d</span></span><br />
<span style="font-family: "Courier New",Courier,monospace;">01 067cf3b8 0119d7be FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150</span></td></tr>
</tbody></table><br />
They all have the same value of the high byte. And needless to say that in 32th bit Windows OS there is no valid addresses with high bit set. But let’s look at those addresses without high bit:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> ln 0x7c910d93</span><br />
<span style="font-family: "Courier New",Courier,monospace;">(7c91043d) ntdll!RtlFreeHeap+0x3a7 | (7c91058d) ntdll!RtlpQueryDepthSList</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:001> ln 0x7c36225d</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> (7c362247) msvcr71!strncmp+0x16 | (7c36227f) msvcr71!__sse2_available_init</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:070> ln 0x7c3c1e8d</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> (7c3c1e8d) msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::~basic_string<char,std::char_traits><char>,std::allocator<char> > | (7c3c1e9b) msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::_Copy</char></char></char,std::char_traits></char></char></char,std::char_traits></char></char></char,std::char_traits></span><br />
<span style="font-family: "Courier New",Courier,monospace;">Exact matches:</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::~basic_string<char,std::char_traits><char>,std::allocator<char> > (void)</char></char></char,std::char_traits></char></char></char,std::char_traits></span></td></tr>
</tbody></table><br />
Now let’s try to restore the call stack to the point of failure (last known call in black, restored instruction pointer in red):<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="color: red; font-family: "Courier New",Courier,monospace;">7c910d93 3d00fe0000 cmp eax,0FE00h (ntdll!RtlFreeHeap+0x3a7)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">7c36358d ff1580a0397c call dword ptr [msvcr71!_imp__HeapFree (7c39a080)] (msvcr71!free+0xc3)</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="color: red; font-family: "Courier New",Courier,monospace;">7c362247 55 push ebp (msvcr71!strncmp+0x16)</span><br />
<span style="font-family: "Courier New",Courier,monospace;">04249d39 ff152c852904 call dword ptr [AtomDB!_imp__strncmp (0429852c)] (AtomDB!TPatternAtomFilter_::Filter+0xdf)</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="color: red; font-family: "Courier New",Courier,monospace;">7c3c1e8d 6a00 push 0 (msvcp71!std::basic_string<char,std::char_traits><char>,std::allocator<char> >::~basic_string<char,std::char_traits><char>,std::allocator<char> > (void))</char></char></char,std::char_traits></char></char></char,std::char_traits></span><br />
<span style="font-family: "Courier New",Courier,monospace;">01199bda ff15d8401a01 call dword ptr [FBP_VideoTranseiver!_imp_??1?$basic_stringDU?$char_traitsDstdV?$allocatorD (011a40d8)] (FBP_VideoTranseiver!TScopedTimer::~TScopedTimer+0x150)</span></td></tr>
</tbody></table><br />
Seems true, isn’t it? But to prove or disprove this conjecture I need to have untouched released stack anyway.<br />
<br />
And here it is:<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> knL</span><br />
<span style="font-family: "Courier New",Courier,monospace;"># ChildEBP RetAddr </span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">WARNING: Frame IP not in any known module. Following frames may be wrong.</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">00 0012fe78 004a6310 0xfc1ac1c9</span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;">01 0012fe94 7c1b15e8 0x4a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">02 0012fe9c 7c1b14f7 mfc71!CWinThread::PreTranslateMessage+0x9</span><br />
<span style="font-family: "Courier New",Courier,monospace;">03 0012fea4 7c1b1632 mfc71!AfxPreTranslateMessage+0x15</span><br />
<span style="font-family: "Courier New",Courier,monospace;">04 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x2b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">05 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54</span><br />
<span style="font-family: "Courier New",Courier,monospace;">06 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30</span><br />
<span style="font-family: "Courier New",Courier,monospace;">07 0012ff18 0041b7fd mfc71!AfxWinMain+0x68</span><br />
<span style="font-family: "Courier New",Courier,monospace;">08 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185</span><br />
<span style="font-family: "Courier New",Courier,monospace;">09 0012fff0 00000000 kernel32!BaseProcessStart+0x23</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> dds @esp-50 @esp-4</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe2c 00000004</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe30 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe34 00010b52</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe38 004a7664</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe3c 0012fe5c</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe40 7c20080e mfc71!_AfxForceVectorDelete+0x17fc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe44 ffffffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe48 0012fe68</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe4c 7c1d7572 mfc71!AfxGetModuleThreadState+0x16 </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe50 7c1d73e9 mfc71!CThreadLocal<afx_module_thread_state>::CreateObject </afx_module_thread_state></span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe54 7c1abc18 mfc71!afxMapHWND+0x10 </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe58 00010b52</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe5c 0012fefc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe60 7c20088d mfc71!_AfxForceVectorDelete+0x187b</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe64 ffffffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe68 00000000</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe6c 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe70 00010b52</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe74 7c1ac1b6 mfc71!CWnd::WalkPreTranslateTree+0x10 </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe78 00010b52</span></td></tr>
</tbody></table><br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> dds @esp @esp+50</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe7c 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe80 1f2874f0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe84 7c1b14b1 mfc71!AfxInternalPreTranslateMessage+0x3b </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe88 00010b58</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe8c 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe90 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe94 004a62e0</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe98 7c1b15e8 mfc71!CWinThread::PreTranslateMessage+0x9 </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fe9c 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fea0 7c1b14f7 mfc71!AfxPreTranslateMessage+0x15 </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fea4 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fea8 7c1b1632 mfc71!AfxInternalPumpMessage+0x2b </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012feac 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012feb0 004a6310</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012feb4 004338f8 radar!theApp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012feb8 7c1b1357 mfc71!CWinThread::Run+0x54 </span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012febc 004338f8 radar!theApp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fec0 004338f8 radar!theApp</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fec4 0012ff08</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fec8 ffffffff</span><br />
<span style="font-family: "Courier New",Courier,monospace;">0012fecc 00000001</span></td></tr>
</tbody></table><br />
Let’s check our assumption first. Last known call in this case is<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">7c1b14ac e8f5acffff call mfc71!CWnd::WalkPreTranslateTree (7c1ac1a6)</span></td></tr>
</tbody></table><br />
Thus, if I was right, the last instruction should be somewhere in the <span style="font-family: "Courier New",Courier,monospace;">mfc71!CWnd::WalkPreTranslateTree</span> function.<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">0:000> ln 0x7c1ac1c9</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> (7c1ac1a6) mfc71!CWnd::WalkPreTranslateTree+0x23 | (7c1ac1e8) mfc71!CWnd::SendChildNotifyLastMsg</span></td></tr>
</tbody></table><br />
Oh, yes! That is it!<br />
<br />
But what the instruction is?<br />
<br />
<table bgcolor="#eeeeee"><tbody>
<tr><td><span style="font-family: "Courier New",Courier,monospace;">mfc71!CWnd::WalkPreTranslateTree+0x23 </span><br />
<span style="color: red; font-family: "Courier New",Courier,monospace;"> 3134 7c1ac1c9 3b74240c cmp esi,dword ptr [esp+0Ch]</span><br />
<span style="font-family: "Courier New",Courier,monospace;"> 3134 7c1ac1cd 740d je mfc71!CWnd::WalkPreTranslateTree+0x36 (7c1ac1dc)</span></td></tr>
</tbody></table><br />
More than strange isn’t it? How can we corrupt the instruction pointer executing <span style="font-family: "Courier New",Courier,monospace;">cmp </span>instruction? How the hell is this possible?!!!<br />
<br />
The only thing I knew for sure – this is hardware problem. This explained a lot. The only thing left to do is to find the culprit.<br />
<br />
I placed the suspects in the following order (in order of increasing probability):<br />
<br />
1. RAM corruption<br />
I considered it’s not very likely because in the case of memory cell corruption we would have problems in the arbitrary pieces of data and code, but not only for instruction pointers values.<br />
<br />
2. Hard drive corruption<br />
I considered it’s not very likely for the same reasons as the previous issue. We would have failures either in the same place (in the case of executable module corruption) or failures due to different reasons (in the case of page file corruption). Besides, I never heard that the surface of the hard drive corrupted only for one bit ?<br />
<br />
3. Motherboard corruption<br />
Well, can’t say anything about it. Perhaps there is a mechanism that could lead to such problems, but I know nothing about it.<br />
<br />
4. CPU<br />
The most likely candidate in my opinion. In the case the error is somewhere in instruction pointer increasing mechanism or somehow related to thread context restoring it may well be.<br />
<br />
Yes, I know. If you hear “somewhere” and “somehow” too often there is “something” wrong :)<br />
<br />
Thus, I needed evidence. I decided to replace the hardware components one by one (fortunately we have absolutely identical computers on the test bench). Replacement of RAM didn’t lead to any result. As well as replacement of HD. Unfortunately we couldn’t replace motherboard and CPU separately (because of our industrial PCs architecture), so we replaced them together. <br />
<br />
And from this moment we hadn’t received any crashes at all!!!<br />
<br />
Turns out we spent a lot of time and efforts to find the bug that was not in our (or some one else) code but “stood on the table” all this time!Unknownnoreply@blogger.com0