Friday, July 30, 2010

Why it's not crashing?

Several months ago, my colleagues encountered a very strange behavior of one of our processes after an exception was thrown. They told me that they know exactly that the process throws an exception which is not expected to be handled by their code, that is an exception must be unhandled and the process must be terminated immediately. But nothing like that happens. The process just continues its execution. At first, I thought that there must be some code down the stack which registers an exception handler that handles any exception indiscriminately. Therefore, the first thing I did was search for the try{}catch(…){} pattern down the stack. When I found nothing, I looked at the exception handlers list at the point where an exception was thrown.

It looked as follows:

0012fd14: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, func:   USER32!UserCallWinProc+10a (7e44aa1c)
0012fd6c: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, filter: USER32!DispatchMessageWorker+113 (7e440712)
                func:   USER32!DispatchMessageWorker+126 (7e44072a)
0012ffb0: VanishExcept!ILT+375(__except_handler3)+0 (0041117c)
0012ffe0: kernel32!_except_handler3+0 (7c839af0)
  CRT scope  0, filter: kernel32!BaseProcessStart+29 (7c84377a)
                func:   kernel32!BaseProcessStart+3a (7c843790)
Invalid exception stack at ffffffff

There were no user exception handlers at all, but instead was an exception handler I have not seen before:

0012fd14: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, func:   USER32!UserCallWinProc+10a (7e44aa1c)
0012fd6c: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, filter: USER32!DispatchMessageWorker+113 (7e440712)
                func:   USER32!DispatchMessageWorker+126 (7e44072a)
0012ffb0: VanishExcept!ILT+375(__except_handler3)+0 (0041117c)
0012ffe0: kernel32!_except_handler3+0 (7c839af0)
  CRT scope  0, filter: kernel32!BaseProcessStart+29 (7c84377a)
                func:   kernel32!BaseProcessStart+3a (7c843790)
Invalid exception stack at ffffffff


So I decided to check if this is a handler which handles our exception. Surprisingly, it was the handler I was looking for. For me, it was absolutely not clear why the OS decided to handle the exception having no meaning for it instead of reporting about a serious problem and terminating a program as a result? This behavior, with no doubt, is strange and dangerous! The only way to go in this situation is just to leave an exception unhandled and let the system to execute Dr. Watson or whatever tool which was adjusted instead. In case of Dr. Watson, the user will have an opportunity to send a report to WER. This is why this service was created indeed.

The next thing I did was setting the breakpoint at the USER32!DispatchMessageWorker+113 instruction in order to figure out why DispatchMessageWorker decides to handle the exception. I discovered that the decision to handle an exception or not DispatchMessageWorker’s exception filter takes depending on the 17th bit of the _teb.Win32ClientInfo.dwCompatFlags2 flag. If the bit is not set filter returns EXCEPTION_EXECUTE_HANDLER. In our case no one bit in dwCompatFlags2 was set, thus filter reported that DispatchMessageWorker’s handler can handle exception no matter what the exception is.

USER32!GetAppCompatFlags2:
7e418ed6 8bff            mov     edi,edi
7e418ed8 55              push    ebp
7e418ed9 8bec            mov     ebp,esp
7e418edb 64a118000000    mov     eax,dword ptr fs:[00000018h]
7e418ee1 83784000        cmp     dword ptr [eax+40h],0
7e418ee5 0f842b550100    je      USER32!GetAppCompatFlags2+0x11 (7e42e416)
7e418eeb 64a118000000    mov     eax,dword ptr fs:[00000018h]
7e418ef1 0fb74d08        movzx   ecx,word ptr [ebp+8]
7e418ef5 3b88d4060000    cmp     ecx,dword ptr [eax+6D4h]
7e418efb 0f8238ae0000    jb      USER32!GetAppCompatFlags2+0x2e (7e423d39)

// Get the pointer to TEB (ntdll!_TEB)
7e418f01 64a118000000    mov     eax,dword ptr fs:[00000018h]

// Get the value of _CLIENTINFO.dwCompatFlags2 flag
7e418f07 8b80dc060000    mov     eax,dword ptr [eax+6DCh]
7e418f0d 5d              pop     ebp
7e418f0e c20400          ret     4

USER32!DispatchMessageWorker+0x113:
7e440712 6800040000      push    400h
7e440717 e8ba87fdff      call    USER32!GetAppCompatFlags2 (7e418ed6)

// Check for the 17th bit
7e44071c c1e810          shr     eax,10h
7e44071f f7d0            not     eax
7e440721 83e001          and     eax,1

// Return EXCEPTION_EXECUTE_HANDLER == 1
7e440724 c3              ret

Sounds great, isn’t it?!

Now let’s look what the handler does:

USER32!DispatchMessageWorker+0x126:
7e44072a 8b65e8          mov     esp,dword ptr [ebp-18h]
7e44072d 33c0            xor     eax,eax
7e44072f e96090fdff      jmp     USER32!DispatchMessageWorker+0x12b (7e419794)

USER32!DispatchMessageWorker+0x12b:
7e419794 834dfcff        or      dword ptr [ebp-4],0FFFFFFFFh
7e419798 e948f2ffff      jmp     USER32!DispatchMessageWorker+0x378 (7e4189e5)

USER32!DispatchMessageWorker+0x378:
7e4189e5 e816fcffff      call    USER32!_SEH_epilog (7e418600)
7e4189ea c20800          ret     8

Nothing, if fact. It’s just returns from DispatchMessageWorker.

It’s absolutely obvious that if DispatchMessageWorker registered this exception handler any time it dispatches some message, this problem would be well known and most likely fixed already (because I consider this behavior as improper). So there should be some special circumstances to register the handler.

The exact point of where DispatchMessageWorker registers the handler is:

USER32!DispatchMessageWorker+0xbf:
7e419742 393d8c00477e    cmp     dword ptr [USER32!gfServerProcess (7e47008c)],edi
7e419748 0f8527f2ffff    jne     USER32!DispatchMessageWorker+0x134 (7e418975)
7e41974e 50              push    eax
7e41974f ff7608          push    dword ptr [esi+8]
7e419752 ff36            push    dword ptr [esi]
7e419754 e849000000      call    USER32!NtUserValidateTimerCallback (7e4197a2)
7e419759 85c0            test    eax,eax
7e41975b 0f84e3420200    je      USER32!DispatchMessageWorker+0x17c (7e43da44)

// Set the block counter to 0
7e419761 897dfc          mov     dword ptr [ebp-4],edi  // edi == 0
7e419764 3bdf            cmp     ebx,edi
7e419766 0f85ed590100    jne     USER32!DispatchMessageWorker+0xe1 (7e42f159)

And the conditions necessary to instruction execution are the following:

1.    WM_TIMER message

USER32!DispatchMessageWorker:
7e4188f1 6a1c            push    1Ch
7e4188f3 68f089417e      push    offset USER32!MessageTable+0x440 (7e4189f0)
7e4188f8 e8c3fcffff      call    USER32!_SEH_prolog (7e4185c0)
7e4188fd 33ff            xor     edi,edi
7e4188ff 897de4          mov     dword ptr [ebp-1Ch],edi

// Set MSG* param to esi
7e418902 8b7508          mov     esi,dword ptr [ebp+8]

USER32!DispatchMessageWorker+0x6e:
7e418954 0f85947d0200    jne     USER32!DispatchMessageWorker+0x70 (7e4406ee)

// Get MSG.message value
7e41895a 8b5604          mov     edx,dword ptr [esi+4]

// and check if it’s equal to WM_TIMER
7e41895d 81fa13010000    cmp     edx,113h
7e418963 0f84f0000000    je      USER32!DispatchMessageWorker+0xa1 (7e418a59)

2.    Valid timer callback function

USER32!DispatchMessageWorker+0xbf:
7e419742 393d8c00477e    cmp     dword ptr [USER32!gfServerProcess (7e47008c)],edi
7e419748 0f8527f2ffff    jne     USER32!DispatchMessageWorker+0x134 (7e418975)
7e41974e 50              push    eax
7e41974f ff7608          push    dword ptr [esi+8]
7e419752 ff36            push    dword ptr [esi]
7e419754 e849000000      call    USER32!NtUserValidateTimerCallback (7e4197a2)
7e419759 85c0            test    eax,eax
7e41975b 0f84e3420200    je      USER32!DispatchMessageWorker+0x17c (7e43da44)
7e419761 897dfc          mov     dword ptr [ebp-4],edi
7e419764 3bdf            cmp     ebx,edi
7e419766 0f85ed590100    jne     USER32!DispatchMessageWorker+0xe1 (7e42f159)

I think those conditions are not uncommon for the SetTimer function.

I was so discouraged, that I thought that there should be something which I don’t know and I posted the issue to MSDN forum (http://social.msdn.microsoft.com/Forums/en-US/windowsgeneraldevelopmentissues/thread/c1e34bd0-8ea4-4869-bb20-4e4d0ed586fe). I described the issue and asked the community to the following questions:


1. Why does the DispatchMessageWorker behaves this way (looks like a patch)?

2. What compatibility flag should I set to bypass this behavior?

And the silence was an answer :)

I sent the issue to the Microsoft technical support finally. After several months of correspondence they acknowledged that the behavior of the DispatchMessageWorker function is abnormal and suggested that I catch all exceptions in the timer function and call MiniDumpWriteDump function inside my exception handler. When I explained to the escalation engineer why the idea is not very good and sometimes even inapplicable, he promised me to try to find another solution.

Look forward to some results. And to be continued…

2 comments:

  1. I would be really happy to see news about your problem? we have the same "anti-crash prevention" problem, which blocks us in certain conditions from sending our own crash reports.

    (sorry if it is there twice but first time i submitted nothing happened on the page)

    ReplyDelete
  2. Thanks for the great article! I was chasing the very same problem for months!
    What made things even more difficult is the fact that in at least one routine triggered via SetTimer() I _did_ receive a crash (Why?), whereas in others the crash was eaten as you described - and the application continued working with really bad delayed problems. The call to SetTimer() is identical in all my cases ...
    So, not knowing when and why some exceptions are handled correctly and some not I have now enclosed all WM_TIMER called routines in __try() blocks, which dumps and exits the program.

    ReplyDelete