Diary of war with bugs: In search of the Dr. Watson

A week ago I returned from the business trip to Italy. It was a trip to the yacht where was the real problems with the navigation system stability. One or two times a day all navigation data from sensors freeze and the navigation equipment was inoperative until full bridge (5 stations) rebooting.

Just before the trip I received some information concerning the problem from our service engineer.
There were several observations:

1.    All navigation data freeze simultaneously but the navigation system (radar, cartographic system and over subsystems) continued to work without any other noticeable problems.

2.    Sometimes the message box with “Pure virtual function call” notification was observed after data freezing.

3.    First time the problem was observed just after the last update of the product.

4.    In the end I even received HD images from all stations of the bridge (they was created after the last update and after the problem appeared for the first time).

Description of the problem was clearly pointed to the fact that the cause of the problem is crashing (or freezing) of the navigation server processes at all stations simultaneously (each computer has its own navigation server executed and described problem is only possible if there is no servers alive). The only thing that confused me was mentioned earlier “Pure virtual function call” error. This error is not very common and occurs rare enough in its pure form. Most common cause of the error is improper sequence of objects chain deinitialization and destruction as well as undigested strategy of objects ownership (of course I’m not take into account errors with calling pure virtual method of child class inside a parent class destructor or deleting an object twice – I believe that our code better :). One of the classic situations caused by assertion message box triggered during object destruction (usually in debug mode). Message box message loop can provoke a call (through a pure virtual interface) to a partially destroyed object. As a result we will have clear pure virtual function call.

0012df14 004137f4 MSVCR71D!_purecall+0x19
0012dff4 7c22f7b6 PureCall!CMyDlg::OnTimer+0x34
0012e0f8 7c22efde MFC71D!CWnd::OnWndMsg+0x7a6
0012e118 7c22c820 MFC71D!CWnd::WindowProc+0x2e
0012e190 7c22ccfe MFC71D!AfxCallWndProc+0xe0
0012e1b0 7c29d8ea MFC71D!AfxWndProc+0x9e
0012e1e0 7e418734 MFC71D!AfxWndProcBase+0x4a
0012e20c 7e418816 USER32!InternalCallWinProc+0x28
0012e274 7e4189cd USER32!UserCallWinProcCheckWow+0x150
0012e2d4 7e418a10 USER32!DispatchMessageWorker+0x306
0012e2e4 7e42dbbf USER32!DispatchMessageW+0xf
0012e31c 7e42593f USER32!DialogBox2+0x15a
0012e344 7e43a91e USER32!InternalDialogBox+0xd0
0012e604 7e43a284 USER32!SoftModalMessageBox+0x938
0012e754 7e4661d3 USER32!MessageBoxWorker+0x2ba
0012e7ac 7e466278 USER32!MessageBoxTimeoutW+0x7a
0012e7e0 7e450617 USER32!MessageBoxTimeoutA+0x9c
0012e800 7e4505cf USER32!MessageBoxExA+0x1b
0012e81c 1020c2aa USER32!MessageBoxA+0x45
0012e854 102093a1 MSVCR71D!__crtMessageBoxA+0x16a
0012ebb8 00413a84 MSVCR71D!_assert+0x5b1
0012eca4 00413beb PureCall!TObjectBase::~TObjectBase+0x44
0012ed84 004136cb PureCall!TObject::~TObject+0x2b
0012ee64 00418598 PureCall!TObject::`scalar deleting destructor'+0x2b
0012ef64 7c22f7a4 PureCall! CMyDlg::OnClose+0x58

The situation aboard was not the same because the navigation servers crashed during they normal execution but not during the exit. Or something had to force them to exit.

I hoped that HD images will clarify a cause of the pure virtual function call. First of all I checked the images for the presence of dumps from Dr. Watson. Nothing found. Then we installed the images to our test bench. Nothing found within a few days.

There was no stable internet connection aboard and no possibility to ask someone from crew to execute some tools before or after the problem will appear next time. Furthermore, there were only a few days to solve the problem, so we decided that investigation of the problem aboard will be most efficiently.

Of course the first thing I did on board was checking for the Dr. Watson dumps again. Imagine my surprise when I found few dumps on two different stations with two different timestamps. It was really very strange for several reasons:

1.    There were no dumps inside the HD images (remember, the dumps was created after the problem appeared for the first time).

2.    According to the problem description there should be 5 dumps (if any) with nearly identical timestamps.

3.    Again, according to the problem description nobody saw Dr. Watson UI. So it’s not clear where the dumps came from.

There was no time for experiments irrelevant to the problem, so I decided to postpone deep investigation of the issue and proceed to analyze the crash dumps. A culprit was found very easy. It was a classical case of buffer overflow.

Here is the call stack:

WARNING: Frame IP not in any known module. Following frames may be wrong.
09e9be68 09c00cac 0x54205344
09e9be6c 09c00ff7 0x9c00cac
09e9be70 45564944 0x9c00ff7
09e9be74 41205352 0x45564944
09e9be78 53495353 0x41205352
09e9be7c 20444554 0x53495353
09e9be80 42205942 0x20444554
09e9be84 45475241 0x42205942
09e9be88 4e412053 0x45475241
09e9be8c 55542044 0x4e412053
09e9be90 4f422047 0x55542044
09e9be94 20535441 0x4f422047
<…>

And here is the stack:

09e9be6c 09c00cac 09c00ff7 45564944 41205352 ........DIVERS A
09e9be7c 53495353 20444554 42205942 45475241 SSISTED BY BARGE
09e9be8c 4e412053 55542044 4f422047 20535441 S AND TUG BOATS
09e9be9c 4c4c4957 20454220 4b2a4f2a 20474e49 WILL BE *O*KING
09e9beac 54204e49 202a4948 41455241 4548542e IN THI* AREA.THE
09e9bebc 52414220 20534547 20444e41 20475554   BARGES AND TUG
09e9becc 54414f42 414d2053 45422059 46454c20 BOATS MAY BE LEF
09e9bedc 4e492054 45485420 4f424120 4d204556 T IN THE ABOVE M
09e9beec 49544e45 44454e4f 45524120 55442041 ENTIONED AREA DU
09e9befc 474e4952 45485420 472a4e20 4d2e5448 RING THE N*GHT.M
09e9bf0c 4e495241 20535245 20455241 54534e49 ARINERS ARE INST
09e9bf1c 54435552 54204445 20544148 49544e55 RUCTED THAT UNTI
09e9bf2c 5546204c 45485452 4f4e2052 45434954 L FURTHER NOTICE
09e9bf3c 562a4e2c 54414749 204e4f49 54204e49 ,N*VIGATION IN T
09e9bf4c 41204548 45564f42 45522a20 53492041 HE ABOVE *REA IS
09e9bf5c 544f4e20 52455020 5454494d 542e4445   NOT PERMITTED.T
09e9bf6c 20594548 205a5241 2a204f42 47495641 HEY ARZ BO *AVIG
09e9bf7c 20455441 482a4957 55414320 4e4f4954 ATE WI*H CAUTION
09e9bf8c 444e4120 4f4c5320 50532057 20444545   AND SLOW SPEED
09e9bf9c 4e454857 204e4920 20454854 49434956 WHEN IN THE VICI
09e9bfac 5954494e 444e4120 204f5420 45564947 NITY AND TO GIVE
09e9bfbc 45485420 444e4920 54414349 41204445   THE INDICATED A
09e9bfcc 20414552 49572041 42204544 48545245 REA A WIDE BERTH
09e9bfdc 52414d2e 52454e49 52412053 4c412045 .MARINERS ARE AL
09e9bfec 41204f53 532a5644 54204445 4f46204f SO ADV*SED TO FO
09e9bffc 574f4c4c 594e2a20 534e4920 43555254 LLOW *NY INSTRUC
09e9c00c 4e4f4954 2a472053 2a4e2a56 5320592a TIONS G*V*N**Y S
09e9c01c 4f505055 43205452 54464152 204e4920 UPPORT CRAFT IN
09e9c02c 20454854 412a5241 444e4120 4c2a5620 THE AR*A AND V*L
09e9c03c 5454454c 4f502041 43205452 52544e4f LETTA PORT CONTR
09e9c04c 57204c4f 4d204f48 49205941 45555353 OL WHO MAY ISSUE
09e9c05c 594e4120 57454e20 52415720 474e494e   ANY NEW WARNING
09e9c06c 49572053 52204854 52414745 54205344 S WITH REGARDS T
09e9c07c 4854204f 4241202a 2e45564f 00000000 O TH* ABOVE.....

And here is the culprit:

::memcpy(sentence.Sentence, pMessData, nMessSize);

There was too long message that the yacht received from Malta (presumably) one or two times a day and it crashed the system. The size of sentence.Sentence buffer was 512 bytes and the length of the message was clearly more.

By the way we didn’t receive this message again within three days I was aboard. So the dumps I found were the only chance to catch the problem.

When I came home I finally had a chance to find out the answers on the following questions:

1.    It’s absolutely clear that the cause of the navigation servers failure was an “Access violation” exception. But why instead of Dr. Watson UI (and logs along with dumps) the result of failure was a “Pure virtual function” message box?

2.    Where the dumps which I found on the bridge came from?

3.    And why did they have different timestamps?

First of all I decided to analyze the behavior of the UnhandledExceptionFilter function.

NOTE:

It is worth to keep in mind that UnhandledExceptionFilter checks whether a process is been debugged in order to send debug event to the debugger rather then handle an event by the way intrinsic for non-debugged processes.

7c862bf9 89bddcfeffff    mov     dword ptr [ebp-124h],edi
7c862bff 57              push    edi
7c862c00 6a04            push    4

// Set pointer to DWORD_PTR value that is the port number of the debugger for the process. A nonzero value indicates that the process is being run under the control of a ring 3 debugger.
7c862c02 8d85dcfeffff    lea     eax,[ebp-124h]
7c862c08 50              push    eax

7c862c09 6a07            push    7
7c862c0b e8fdb3faff      call    kernel32!GetCurrentProcess (7c80e00d)
7c862c10 50              push    eax
7c862c11 ff15ac10807c    call    dword ptr [kernel32!_imp__NtQueryInformationProcess (7c8010ac)]
7c862c17 85c0            test    eax,eax
7c862c19 0f8ca2000000    jl      kernel32!UnhandledExceptionFilter+0x137 (7c862cc1)

// Check if process is being run under debugger
7c862c1f 39bddcfeffff    cmp     dword ptr [ebp-124h],edi
7c862c25 0f8496000000    je      kernel32!UnhandledExceptionFilter+0x137 (7c862cc1)

That is to have UnhandledExceptionFilter’s behavior similar to those we have for the process that is not been debugged we have to either to overwrite the debugger port value to 0 just before UnhandledExceptionFilter checks it or use a kernel-mode debugger.

To not bother I chose the second way.

I found that after checking the various conditions (exception flags, exception id, number of parameters and so on) UnhandledExceptionFilter checks the error mode value.

7c862ce5 e8eb7dfaff      call    kernel32!GetErrorMode (7c80aad5)

// Check the SEM_NOGPFAULTERRORBOX flag
7c862cea a802            test    al,2

7c862cec 0f8563070000    jne     kernel32!UnhandledExceptionFilter+0x8ad (7c863455)

And if the flag SEM_NOGPFAULTERRORBOX set, UnhandledExceptionFilter just return EXCEPTION_EXECUTE_HANDLER to the exception handle.

kernel32!UnhandledExceptionFilter+0x8ad:
7c863455 33c0            xor     eax,eax
7c863457 40              inc     eax
7c863458 8da59cfdffff    lea     esp,[ebp-264h]
7c86345e 8b4de4          mov     ecx,dword ptr [ebp-1Ch]
7c863461 e8ab62faff      call    kernel32!__security_check_cookie (7c809711)
7c863466 e8a0f0f9ff      call    kernel32!_SEH_epilog (7c80250b)
7c86346b c20400          ret     4

In my case the exception handler was MSVCR71!_except_handler3.

090cba48 7c36e289 kernel32!UnhandledExceptionFilter
090cba64 7c36b398 MSVCR71!_XcptFilter+0x15f
090cba70 7c363943 MSVCR71!_threadstartex+0x86
090cba98 7c9037bf MSVCR71!_except_handler3+0x61
090cbabc 7c90378b ntdll!ExecuteHandler2+0x26
090cbb6c 7c90eafa ntdll!ExecuteHandler+0x24
090cbb6c 27534e41 ntdll!KiUserExceptionDispatcher+0xe
WARNING: Frame IP not in any known module. Following frames may be wrong.
090cbe68 4c2d4f4b 0x27534e41

And the exception handler just unwind the stack and return control to MSVCR71!_threadstartex which in its turn call the MSVCR71!_exit function.

MSVCR71!_except_handler3+0x61:
7c363943 5d              pop     ebp
7c363944 5e              pop     esi
7c363945 8b5d0c          mov     ebx,dword ptr [ebp+0Ch]
7c363948 0bc0            or      eax,eax
7c36394a 743f            je      MSVCR71!_except_handler3+0xa9 (7c36398b)
7c36394c 7848            js      MSVCR71!_except_handler3+0xb4 (7c363996)
7c36394e 8b7b08          mov     edi,dword ptr [ebx+8]
7c363951 53              push    ebx
7c363952 e88c000000      call    MSVCR71!__global_unwind2 (7c3639e3)
7c363957 83c404          add     esp,4
7c36395a 8d6b10          lea     ebp,[ebx+10h]
7c36395d 56              push    esi
7c36395e 53              push    ebx
7c36395f e8c1000000      call    MSVCR71!__local_unwind2 (7c363a25)
7c363964 83c408          add     esp,8
7c363967 8d0c76          lea     ecx,[esi+esi*2]
7c36396a 6a01            push    1
7c36396c 8b448f08        mov     eax,dword ptr [edi+ecx*4+8]
7c363970 e83a010000      call    MSVCR71!_NLG_Notify (7c363aaf)
7c363975 8b048f          mov     eax,dword ptr [edi+ecx*4]
7c363978 89430c          mov     dword ptr [ebx+0Ch],eax
7c36397b 8b448f08        mov     eax,dword ptr [edi+ecx*4+8]
7c36397f 33db            xor     ebx,ebx
7c363981 33c9            xor     ecx,ecx
7c363983 33d2            xor     edx,edx
7c363985 33f6            xor     esi,esi
7c363987 33ff            xor     edi,edi
7c363989 ffd0            call    eax // {MSVCR71!_threadstartex+0x89 (7c36b39b)}

MSVCR71!_threadstartex+0x89
001b:7c36b39b 8b65e8          mov     esp,dword ptr [ebp-18h]
001b:7c36b39e ff75e4          push    dword ptr [ebp-1Ch]
001b:7c36b3a1 e8cef9ffff      call    MSVCR71!_exit (7c36ad74)

Thus, the answer to the first question was almost obtained. The source of the “Pure virtual function call” message box was in improper objects uninitialization and destruction sequence. The owner of the pointer to the pure virtual interface tried to call a method of already destructed object (just like I expected).

090cfbc8 0b0135e1 MSVCR71!_purecall+0x12

090cfcf8 0b056232 VdsProtocols!NVdsProt::IVdsProtMngrImpl::CloseSession+0x31

090cfd24 0b06430d VdsProtocols!NVdsProt::CVdsTargetsWProt::Deinit+0xc2

090cfd4c 0b07ce76 VdsProtocols!NVdsProt::CVdsTargetsWProt::~CVdsTargetsWProt+0x3d

090cfd50 0b07cf39 VdsProtocols!_CRT_INIT+0x95

090cfd8c 7c9011a7 VdsProtocols!_DllMainCRTStartup+0x9a

090cfdac 7c923f31 ntdll!LdrpCallInitRoutine+0x14

090cfe30 7c81ca3e ntdll!LdrShutdownProcess+0x14f

090cff24 7c81cab6 kernel32!_ExitProcess+0x42

090cff38 7c3638e1 kernel32!ExitProcess+0x14

090cff40 7c3638c4 MSVCR71!__crtExitProcess+0x2e

090cff70 7c36ad81 MSVCR71!doexit+0xab

090cff80 7c36b3a5 MSVCR71!_exit+0xd

090cffb4 7c80b50b MSVCR71!_threadstartex+0x93

090cffec 00000000 kernel32!BaseThreadStart+0x37

The only thing left unclear is where this flag came from? I set a breakpoint to the kernel32!SetErrorMode function to see who and then set the error mode flags. The result was a bit unexpected. A lot of different functionalities including system modules set the flags one after another (especially during the modules loading process). Fortunately then I investigated the code more precisely I realized that all of them use the SetErrorMode function in a typical way. They just set the flag to a desired value and then return the old value back after code execution.

kernel32!GetLongPathNameW+0x3d:
7c813396 6801800000      push    8001h
7c81339b e87f78ffff      call    kernel32!SetErrorMode (7c80ac1f)
7c8133a0 898568fdffff    mov     dword ptr [ebp-298h],eax

kernel32!GetLongPathNameW+0x3a0:
7c813656 ffb568fdffff push dword ptr [ebp-298h]
7c81365c e8be75ffff call kernel32!SetErrorMode (7c80ac1f)

So, all that is necessary to find a culprit is to find code that set the SEM_NOGPFAULTERRORBOX flag for the first time.

It wasn’t hard. The first call to the SetErrorMode function set the error mode value to 80007h. The culprit was our own module which was the host for the server that I had debugged.

Finally the picture was absolutely clear and the answer to the second and third questions was obvious. If it’s happens that at the moment of crash some thread changed the error mode value, OS will call Dr. Watson for the process, otherwise we will see the familiar message box with “Pure virtual function call” notification.

The conclusion from the story is very simple: try to avoid the SEM_NOGPFAULTERRORBOX flag where it’s not necessary (e.g. like in service processes).