Just before the trip I received some information concerning the problem from our service engineer.
There were several observations:
1. All navigation data freeze simultaneously but the navigation system (radar, cartographic system and over subsystems) continued to work without any other noticeable problems.
2. Sometimes the message box with “Pure virtual function call” notification was observed after data freezing.
3. First time the problem was observed just after the last update of the product.
4. In the end I even received HD images from all stations of the bridge (they was created after the last update and after the problem appeared for the first time).
Description of the problem was clearly pointed to the fact that the cause of the problem is crashing (or freezing) of the navigation server processes at all stations simultaneously (each computer has its own navigation server executed and described problem is only possible if there is no servers alive). The only thing that confused me was mentioned earlier “Pure virtual function call” error. This error is not very common and occurs rare enough in its pure form. Most common cause of the error is improper sequence of objects chain deinitialization and destruction as well as undigested strategy of objects ownership (of course I’m not take into account errors with calling pure virtual method of child class inside a parent class destructor or deleting an object twice – I believe that our code better :). One of the classic situations caused by assertion message box triggered during object destruction (usually in debug mode). Message box message loop can provoke a call (through a pure virtual interface) to a partially destroyed object. As a result we will have clear pure virtual function call.
0012df14 004137f4 MSVCR71D!_purecall+0x19 0012dff4 7c22f7b6 PureCall!CMyDlg::OnTimer+0x34 0012e0f8 7c22efde MFC71D!CWnd::OnWndMsg+0x7a6 0012e118 7c22c820 MFC71D!CWnd::WindowProc+0x2e 0012e190 7c22ccfe MFC71D!AfxCallWndProc+0xe0 0012e1b0 7c29d8ea MFC71D!AfxWndProc+0x9e 0012e1e0 7e418734 MFC71D!AfxWndProcBase+0x4a 0012e20c 7e418816 USER32!InternalCallWinProc+0x28 0012e274 7e4189cd USER32!UserCallWinProcCheckWow+0x150 0012e2d4 7e418a10 USER32!DispatchMessageWorker+0x306 0012e2e4 7e42dbbf USER32!DispatchMessageW+0xf 0012e31c 7e42593f USER32!DialogBox2+0x15a 0012e344 7e43a91e USER32!InternalDialogBox+0xd0 0012e604 7e43a284 USER32!SoftModalMessageBox+0x938 0012e754 7e4661d3 USER32!MessageBoxWorker+0x2ba 0012e7ac 7e466278 USER32!MessageBoxTimeoutW+0x7a 0012e7e0 7e450617 USER32!MessageBoxTimeoutA+0x9c 0012e800 7e4505cf USER32!MessageBoxExA+0x1b 0012e81c 1020c2aa USER32!MessageBoxA+0x45 0012e854 102093a1 MSVCR71D!__crtMessageBoxA+0x16a 0012ebb8 00413a84 MSVCR71D!_assert+0x5b1 0012eca4 00413beb PureCall!TObjectBase::~TObjectBase+0x44 0012ed84 004136cb PureCall!TObject::~TObject+0x2b 0012ee64 00418598 PureCall!TObject::`scalar deleting destructor'+0x2b 0012ef64 7c22f7a4 PureCall! CMyDlg::OnClose+0x58 |
The situation aboard was not the same because the navigation servers crashed during they normal execution but not during the exit. Or something had to force them to exit.
I hoped that HD images will clarify a cause of the pure virtual function call. First of all I checked the images for the presence of dumps from Dr. Watson. Nothing found. Then we installed the images to our test bench. Nothing found within a few days.
There was no stable internet connection aboard and no possibility to ask someone from crew to execute some tools before or after the problem will appear next time. Furthermore, there were only a few days to solve the problem, so we decided that investigation of the problem aboard will be most efficiently.
Of course the first thing I did on board was checking for the Dr. Watson dumps again. Imagine my surprise when I found few dumps on two different stations with two different timestamps. It was really very strange for several reasons:
1. There were no dumps inside the HD images (remember, the dumps was created after the problem appeared for the first time).
2. According to the problem description there should be 5 dumps (if any) with nearly identical timestamps.
3. Again, according to the problem description nobody saw Dr. Watson UI. So it’s not clear where the dumps came from.
There was no time for experiments irrelevant to the problem, so I decided to postpone deep investigation of the issue and proceed to analyze the crash dumps. A culprit was found very easy. It was a classical case of buffer overflow.
Here is the call stack:
WARNING: Frame IP not in any known module. Following frames may be wrong. 09e9be68 09c00cac 0x54205344 09e9be6c 09c00ff7 0x9c00cac 09e9be70 45564944 0x9c00ff7 09e9be74 41205352 0x45564944 09e9be78 53495353 0x41205352 09e9be7c 20444554 0x53495353 09e9be80 42205942 0x20444554 09e9be84 45475241 0x42205942 09e9be88 4e412053 0x45475241 09e9be8c 55542044 0x4e412053 09e9be90 4f422047 0x55542044 09e9be94 20535441 0x4f422047 <…> |
And here is the stack:
09e9be6c 09c00cac 09c00ff7 45564944 41205352 ........DIVERS A 09e9be7c 53495353 20444554 42205942 45475241 SSISTED BY BARGE 09e9be8c 4e412053 55542044 4f422047 20535441 S AND TUG BOATS 09e9be9c 4c4c4957 20454220 4b2a4f2a 20474e49 WILL BE *O*KING 09e9beac 54204e49 202a4948 41455241 4548542e IN THI* AREA.THE 09e9bebc 52414220 20534547 20444e41 20475554 BARGES AND TUG 09e9becc 54414f42 414d2053 45422059 46454c20 BOATS MAY BE LEF 09e9bedc 4e492054 45485420 4f424120 4d204556 T IN THE ABOVE M 09e9beec 49544e45 44454e4f 45524120 55442041 ENTIONED AREA DU 09e9befc 474e4952 45485420 472a4e20 4d2e5448 RING THE N*GHT.M 09e9bf0c 4e495241 20535245 20455241 54534e49 ARINERS ARE INST 09e9bf1c 54435552 54204445 20544148 49544e55 RUCTED THAT UNTI 09e9bf2c 5546204c 45485452 4f4e2052 45434954 L FURTHER NOTICE 09e9bf3c 562a4e2c 54414749 204e4f49 54204e49 ,N*VIGATION IN T 09e9bf4c 41204548 45564f42 45522a20 53492041 HE ABOVE *REA IS 09e9bf5c 544f4e20 52455020 5454494d 542e4445 NOT PERMITTED.T 09e9bf6c 20594548 205a5241 2a204f42 47495641 HEY ARZ BO *AVIG 09e9bf7c 20455441 482a4957 55414320 4e4f4954 ATE WI*H CAUTION 09e9bf8c 444e4120 4f4c5320 50532057 20444545 AND SLOW SPEED 09e9bf9c 4e454857 204e4920 20454854 49434956 WHEN IN THE VICI 09e9bfac 5954494e 444e4120 204f5420 45564947 NITY AND TO GIVE 09e9bfbc 45485420 444e4920 54414349 41204445 THE INDICATED A 09e9bfcc 20414552 49572041 42204544 48545245 REA A WIDE BERTH 09e9bfdc 52414d2e 52454e49 52412053 4c412045 .MARINERS ARE AL 09e9bfec 41204f53 532a5644 54204445 4f46204f SO ADV*SED TO FO 09e9bffc 574f4c4c 594e2a20 534e4920 43555254 LLOW *NY INSTRUC 09e9c00c 4e4f4954 2a472053 2a4e2a56 5320592a TIONS G*V*N**Y S 09e9c01c 4f505055 43205452 54464152 204e4920 UPPORT CRAFT IN 09e9c02c 20454854 412a5241 444e4120 4c2a5620 THE AR*A AND V*L 09e9c03c 5454454c 4f502041 43205452 52544e4f LETTA PORT CONTR 09e9c04c 57204c4f 4d204f48 49205941 45555353 OL WHO MAY ISSUE 09e9c05c 594e4120 57454e20 52415720 474e494e ANY NEW WARNING 09e9c06c 49572053 52204854 52414745 54205344 S WITH REGARDS T 09e9c07c 4854204f 4241202a 2e45564f 00000000 O TH* ABOVE..... |
And here is the culprit:
::memcpy(sentence.Sentence, pMessData, nMessSize); |
There was too long message that the yacht received from Malta (presumably) one or two times a day and it crashed the system. The size of sentence.Sentence buffer was 512 bytes and the length of the message was clearly more.
By the way we didn’t receive this message again within three days I was aboard. So the dumps I found were the only chance to catch the problem.
When I came home I finally had a chance to find out the answers on the following questions:
1. It’s absolutely clear that the cause of the navigation servers failure was an “Access violation” exception. But why instead of Dr. Watson UI (and logs along with dumps) the result of failure was a “Pure virtual function” message box?
2. Where the dumps which I found on the bridge came from?
3. And why did they have different timestamps?
First of all I decided to analyze the behavior of the UnhandledExceptionFilter function.
NOTE: It is worth to keep in mind that UnhandledExceptionFilter checks whether a process is been debugged in order to send debug event to the debugger rather then handle an event by the way intrinsic for non-debugged processes.
That is to have UnhandledExceptionFilter’s behavior similar to those we have for the process that is not been debugged we have to either to overwrite the debugger port value to 0 just before UnhandledExceptionFilter checks it or use a kernel-mode debugger. |
To not bother I chose the second way.
I found that after checking the various conditions (exception flags, exception id, number of parameters and so on) UnhandledExceptionFilter checks the error mode value.
7c862ce5 e8eb7dfaff call kernel32!GetErrorMode (7c80aad5) // Check the SEM_NOGPFAULTERRORBOX flag 7c862cea a802 test al,2 7c862cec 0f8563070000 jne kernel32!UnhandledExceptionFilter+0x8ad (7c863455) |
And if the flag SEM_NOGPFAULTERRORBOX set, UnhandledExceptionFilter just return EXCEPTION_EXECUTE_HANDLER to the exception handle.
kernel32!UnhandledExceptionFilter+0x8ad: 7c863455 33c0 xor eax,eax 7c863457 40 inc eax 7c863458 8da59cfdffff lea esp,[ebp-264h] 7c86345e 8b4de4 mov ecx,dword ptr [ebp-1Ch] 7c863461 e8ab62faff call kernel32!__security_check_cookie (7c809711) 7c863466 e8a0f0f9ff call kernel32!_SEH_epilog (7c80250b) 7c86346b c20400 ret 4 |
In my case the exception handler was MSVCR71!_except_handler3.
090cba48 7c36e289 kernel32!UnhandledExceptionFilter 090cba64 7c36b398 MSVCR71!_XcptFilter+0x15f 090cba70 7c363943 MSVCR71!_threadstartex+0x86 090cba98 7c9037bf MSVCR71!_except_handler3+0x61 090cbabc 7c90378b ntdll!ExecuteHandler2+0x26 090cbb6c 7c90eafa ntdll!ExecuteHandler+0x24 090cbb6c 27534e41 ntdll!KiUserExceptionDispatcher+0xe WARNING: Frame IP not in any known module. Following frames may be wrong. 090cbe68 4c2d4f4b 0x27534e41 |
And the exception handler just unwind the stack and return control to MSVCR71!_threadstartex which in its turn call the MSVCR71!_exit function.
MSVCR71!_except_handler3+0x61: 7c363943 5d pop ebp 7c363944 5e pop esi 7c363945 8b5d0c mov ebx,dword ptr [ebp+0Ch] 7c363948 0bc0 or eax,eax 7c36394a 743f je MSVCR71!_except_handler3+0xa9 (7c36398b) 7c36394c 7848 js MSVCR71!_except_handler3+0xb4 (7c363996) 7c36394e 8b7b08 mov edi,dword ptr [ebx+8] 7c363951 53 push ebx 7c363952 e88c000000 call MSVCR71!__global_unwind2 (7c3639e3) 7c363957 83c404 add esp,4 7c36395a 8d6b10 lea ebp,[ebx+10h] 7c36395d 56 push esi 7c36395e 53 push ebx 7c36395f e8c1000000 call MSVCR71!__local_unwind2 (7c363a25) 7c363964 83c408 add esp,8 7c363967 8d0c76 lea ecx,[esi+esi*2] 7c36396a 6a01 push 1 7c36396c 8b448f08 mov eax,dword ptr [edi+ecx*4+8] 7c363970 e83a010000 call MSVCR71!_NLG_Notify (7c363aaf) 7c363975 8b048f mov eax,dword ptr [edi+ecx*4] 7c363978 89430c mov dword ptr [ebx+0Ch],eax 7c36397b 8b448f08 mov eax,dword ptr [edi+ecx*4+8] 7c36397f 33db xor ebx,ebx 7c363981 33c9 xor ecx,ecx 7c363983 33d2 xor edx,edx 7c363985 33f6 xor esi,esi 7c363987 33ff xor edi,edi 7c363989 ffd0 call eax // {MSVCR71!_threadstartex+0x89 (7c36b39b)} |
MSVCR71!_threadstartex+0x89 001b:7c36b39b 8b65e8 mov esp,dword ptr [ebp-18h] 001b:7c36b39e ff75e4 push dword ptr [ebp-1Ch] 001b:7c36b3a1 e8cef9ffff call MSVCR71!_exit (7c36ad74) |
Thus, the answer to the first question was almost obtained. The source of the “Pure virtual function call” message box was in improper objects uninitialization and destruction sequence. The owner of the pointer to the pure virtual interface tried to call a method of already destructed object (just like I expected).
090cfbc8 0b0135e1 MSVCR71!_purecall+0x12 090cfcf8 0b056232 VdsProtocols!NVdsProt::IVdsProtMngrImpl::CloseSession+0x31 090cfd24 0b06430d VdsProtocols!NVdsProt::CVdsTargetsWProt::Deinit+0xc2 090cfd4c 0b07ce76 VdsProtocols!NVdsProt::CVdsTargetsWProt::~CVdsTargetsWProt+0x3d 090cfd50 0b07cf39 VdsProtocols!_CRT_INIT+0x95 090cfd8c 7c9011a7 VdsProtocols!_DllMainCRTStartup+0x9a 090cfdac 7c923f31 ntdll!LdrpCallInitRoutine+0x14 090cfe30 7c81ca3e ntdll!LdrShutdownProcess+0x14f 090cff24 7c81cab6 kernel32!_ExitProcess+0x42 090cff38 7c3638e1 kernel32!ExitProcess+0x14 090cff40 7c3638c4 MSVCR71!__crtExitProcess+0x2e 090cff70 7c36ad81 MSVCR71!doexit+0xab 090cff80 7c36b3a5 MSVCR71!_exit+0xd 090cffb4 7c80b50b MSVCR71!_threadstartex+0x93 090cffec 00000000 kernel32!BaseThreadStart+0x37 |
The only thing left unclear is where this flag came from? I set a breakpoint to the kernel32!SetErrorMode function to see who and then set the error mode flags. The result was a bit unexpected. A lot of different functionalities including system modules set the flags one after another (especially during the modules loading process). Fortunately then I investigated the code more precisely I realized that all of them use the SetErrorMode function in a typical way. They just set the flag to a desired value and then return the old value back after code execution.
kernel32!GetLongPathNameW+0x3d: 7c813396 6801800000 push 8001h 7c81339b e87f78ffff call kernel32!SetErrorMode (7c80ac1f) 7c8133a0 898568fdffff mov dword ptr [ebp-298h],eax |
kernel32!GetLongPathNameW+0x3a0: 7c813656 ffb568fdffff push dword ptr [ebp-298h] 7c81365c e8be75ffff call kernel32!SetErrorMode (7c80ac1f) |
So, all that is necessary to find a culprit is to find code that set the SEM_NOGPFAULTERRORBOX flag for the first time.
It wasn’t hard. The first call to the SetErrorMode function set the error mode value to 80007h. The culprit was our own module which was the host for the server that I had debugged.
Finally the picture was absolutely clear and the answer to the second and third questions was obvious. If it’s happens that at the moment of crash some thread changed the error mode value, OS will call Dr. Watson for the process, otherwise we will see the familiar message box with “Pure virtual function call” notification.
The conclusion from the story is very simple: try to avoid the SEM_NOGPFAULTERRORBOX flag where it’s not necessary (e.g. like in service processes).
No comments:
Post a Comment