Thursday, December 23, 2010

Invalid Bus Driver

Several weeks ago we encountered with very interesting crash of one of our product’s processes. After analyzing the dump we found that exceptions were excited inside two threads simultaneously.

The first one roused up the debugger:

0355f0f0 7c3627e4 ntdll!RtlAllocateHeap+0x655
                  cmp edi,dword ptr [eax+4] ds:0023:8851e6cc=????????
0355f130 7c36280c msvcr71!_heap_alloc+0xe0
0355f138 7c362829 msvcr71!_nh_malloc+0x10
0355f144 7c3eb633 msvcr71!malloc+0xf
0355f154 7c3c1f0e msvcp71!operator new+0x21
0355f9cc 7c3c4f9e msvcp71!std::basic_string,std::allocator >::_Copy+0x73
0355f9e0 7c3c55df msvcp71!std::basic_string,std::allocator >::_Grow+0x22
0355f9fc 7c3c6752 msvcp71!std::basic_string,std::allocator >::assign+0x4e
0355fa10 00595239 msvcp71!std::basic_string,std::allocator >::basic_string,std::allocator >+0x20
0355fc4c 00595387 SCServer_dll71!_ConvertBinsToChars+0xb9
0355fdac 007335ea SCServer_dll71!TReadResourceHandler::HandleEvent+0x127
0355ff74 10046453 SCLib71!TReactor::Thread+0xda
0355ff80 7c36b381 ETL!TThread_::ThreadThunkFunction+0x23
0355ffb4 7c80b50b msvcr71!_threadstartex+0x6f
0355ffec 00000000 kernel32!BaseThreadStart+0x37

And the second one was pending:

0012e45c 7c90e9ab ntdll!KiFastSystemCallRet
0012e460 7c8633d5 ntdll!ZwWaitForMultipleObjects+0xc
0012e7a0 7c36e289 kernel32!UnhandledExceptionFilter+0x82d
0012e7bc 0040c860 msvcr71!_XcptFilter+0x15f
0012e7c8 7c363943 SCServer71!WinMainCRTStartup+0x1d7
0012e7f0 7c9037bf msvcr71!_except_handler3+0x61
0012e814 7c90378b ntdll!ExecuteHandler2+0x26
0012e8c4 7c90eafa ntdll!ExecuteHandler+0x24
0012e8c4 00409c1c ntdll!KiUserExceptionDispatcher+0xe
0012ebc4 7c1adc5b SCServer71!CMonitoringDlg::OnTimer+0x1c
                  call dword ptr [edx+8] ds:0023:00000008=????????              
0012ec54 7c1a9f01 mfc71!CWnd::OnWndMsg+0x46b
0012ec74 00422d16 mfc71!CWnd::WindowProc+0x22
0012ef74 77d487eb user32!InternalCallWinProc+0x28
0012efdc 77d489a5 user32!UserCallWinProcCheckWow+0x150
0012f03c 77d4bccc user32!DispatchMessageWorker+0x306
0012f04c 7c1b1645 user32!DispatchMessageA+0xf
0012f05c 7c1ab833 mfc71!AfxInternalPumpMessage+0x3e
0012f080 7c1aeeed mfc71!CWnd::RunModalLoop+0xca
0012f0bc 00424726 mfc71!CDialog::DoModal+0xf3
0012f0f0 0040142b NSGuiCtl10!CNSGDialog::DoModal+0xc6
0012ff08 7c1ae5d0 SCServer71!CScServerApp::InitInstance+0x9b
0012ff18 0040c80e mfc71!AfxWinMain+0x47
0012ffc0 7c816d4f SCServer71!WinMainCRTStartup+0x185
0012fff0 00000000 kernel32!BaseProcessStart+0x23

Wednesday, October 13, 2010

Grabbing the dumps from the customer's machine

Occasionally I encounter the task of assisting the customer to get the bump of the crashed process. My company develops software for marine navigation systems, and as you can guess there some problems with internet connection in open sea. In fact, most of the vessels have internet access aboard, but it’s very slow due to the high price of satellite connections. In addition, in most vessels you can’t take direct connection to bridge environment due to security and technical restrictions. So, the only communication remains is e-mail ?

To save my time and customers peace of mind I wrote the trivial batch script, which reduce the amount of explanations to minimum:

@echo off

rem The processes of interest
set PROCESSES=(IBSSvc TBService scserver71)

rem List all processes in the system
tlist /v > dumps/processes.txt

rem Grab all dumps and debugger output for the processes of interest
for %%i in %PROCESSES% do cdb -pv -pn "%%i.exe" -logo dumps/%%i.txt -c ".dump /ma dumps/%%i.dmp;q"

All tools you need for script execution you can find in “Debugging Tools for Windows” suite.

Here is the list of modules:

* cdb.exe
* dbgeng.dll
* dbghelp.dll            
* ext.dll
* tlist.exe
* uext.dll

After that, I just pack all those stuff to self-extracting archive and send it to customer. After customer unpacks the archive and executes the script, he sends me content of the “dumps” folder back and I have the opportunity to analyze it.

You can modify the script the way you like and use it where appropriate.

Monday, October 4, 2010

Believe not all that you see

Last week there was very interesting story that can be useful for debugger users. During the application debugging (using WinDbg) we encountered with stable repeated crash. The source of the crash was found easily, but it confused us completely. The problem was in one processor instruction. The instruction add eax,10h was ended up with subtraction of 34h from eax register.

We had checked it several times but the result was absolutely identical. After that, we made several experiments with the following results:

1.    Whatever we put into the place of 10h, the result was subtraction of 34h from eax register.

2.    After we moved this instruction three bytes above or bellow the current adress (just noping the gap where instruction was) the effect disappeared immediately. But when we returned it back in its palce, the effect returned as well.

After some time of puzzling over I looked into the breakpoints list and noticed that there was a breakpoint set at faulting instruction. But instruction wasn’t highlighted in the debugger. When I checked the address of the breakpoint, I found out that the breakpoint was set not inside the first instruction byte (like it should be), but inside the third byte – that is in the place of 10h operand. Of course this “breakpoint” didn’t act like an actual breakpoint, processor just executed add eax,CCh instruction instead. And because of debugger hides all its breakpoints before it’s pass the control to user, we always saw add eax,10h instruction instead of real one.

So, the obvious question is: Why the debugger set this breakpoint?

And the answer is simple. The unresolved breakpoint was set during another debugging session of this process, and since the process binaries and symbols set were changed, the debugger figured out wrong address for the breakpoint (in fact, completely wrong :)). And because of debugger still considered this breakpoint as unresolved, we were encountering with this crash over and over again after restart of the process.

Saturday, September 18, 2010

Congratulate me, I’m the winner!

Several articles from this blog have participated in Tell Your Windows Debugging Story Annual Competition. Few days ago Dmitry Vostokov introduced the winners of the competition. I’m glad to tell you that this blog is among them!

The way to find a hotspot

Today I want to describe the way to find a hotspot without profilers. I often use this approach and find it effective and simple enough. The only tools you need are any Windows debugger (I prefer WinDbg, but you can use your favorite one) which should always be in your hands if you decided to investigate some performance issue (as well as any other problem with your software) and tool like Process Explorer which available free from Windows Sysinternals. You can use this method without profiler at all or in conjunction with profiler to clarify the result of profilers work.

Before explaining the approach, I want to define the term “hotspot”. There are different definitions which can be applied in different contexts. In this article we consider a hotspot as a limited set of instructions on which execution spends significant CPU time or more CPU time than you expect from it.

The process is divided into the three steps. If you used a profiler and have a call graph, you can miss the first two steps and jump to step number three to verify the profilers result and find out the hotspot most precisely. Nevertheless, I recommend you to go through all steps because the results returned by profilers sometimes can be very obscure and you will lose time rolling around a call graph or other related information. Even if your profiler was absolutely right, you’ll have one more confirmation of its result. Sometimes it’s better to check twice :)

Thursday, September 9, 2010

The bug in MFC tool tip control

Yesterday, I encountered the bug in MFC’s tool tip control implementation.
The bug dozes in the following code:

LRESULT CToolTipCtrl::OnAddTool(WPARAM wParam, LPARAM lParam)
    TOOLINFO ti = *(LPTOOLINFO)lParam;     <----- HERE
    if ((ti.hinst == NULL) && (ti.lpszText != LPSTR_TEXTCALLBACK)
        && (ti.lpszText != NULL))
        void* pv;
        if (!m_mapString.Lookup(ti.lpszText, pv))
            m_mapString.SetAt(ti.lpszText, NULL);
        // set lpszText to point to the permanent memory associated
        // with the CString
        VERIFY(m_mapString.LookupKey(ti.lpszText, ti.lpszText));
    return DefWindowProc(TTM_ADDTOOL, wParam, (LPARAM)&ti);

Friday, September 3, 2010

To untangle a rope

Last weekend I was launching a kite with my daughter and her friend. It was great fun, but when I was veering the kite down its string entangled. I spent about 30 minutes (maybe even more – time flies fast :)) to untangle the string. I followed the string loops, analyzed the types of knots, imagined what will happen to string if I pass it through one loop or another. I went through a lot of different combinations inside my mind and devised more and more new techniques. When I finished finally, I felt satisfaction. I felt no tiredness or irritation, furthermore I had the feeling like I untangle the strings all my life. Then I thought about it I realized that it’s not far from the truth. The process of untangling string (as well as rope, chain and so on) is very similar with the process of software defects researching (the term was first introduced by Dmitry Vostokov). So if you in doubt to be software researcher or not – try to untangle a rope first :)

Thursday, September 2, 2010

Can lightning strike the same place twice? (Or can a bomb to hit an elephant?)

In the old soviet movie about World War II was mentioned the story about the old Professor of Mathematics. When Leningrad has been bombing he is never descended to bombshelter along with other people. He told that he calculated the probability of bomb hitting his house and the probability was too small to hiding from bombs. But one day his neighbors saw him in a bombshelter and asked why is he hiding with others. He replied that on that day a bomb hit the only elephant in the cities zoo, and probability of hitting the elephant was absolutely equal with probability of been hit himself.

Why did I remember this story? Because during the last week I encountered twice with the failures which was the result of circumstances those probability was very low. Like people say: “If gun is hanging on the wall it will shoot one day or another”.

Monday, August 2, 2010

Why it's not crashing? (conclusion)

Well, today I received the final message from MS where they told me once again that there is nothing can be done on their side and advised me to catch exception in user code and call MiniDumpWriteDump after that. I think the story can be considered finished.

Se la vie, mes amis. Se la vie.

Sunday, August 1, 2010

In search of the Dr. Watson

A week ago I returned from the business trip to Italy. It was a trip to the yacht where was the real problems with the navigation system stability. One or two times a day all navigation data from sensors freeze and the navigation equipment was inoperative until full bridge (5 stations) rebooting.

Just before the trip I received some information concerning the problem from our service engineer.
There were several observations:

1.    All navigation data freeze simultaneously but the navigation system (radar, cartographic system and over subsystems) continued to work without any other noticeable problems.

2.    Sometimes the message box with “Pure virtual function call” notification was observed after data freezing.

3.    First time the problem was observed just after the last update of the product.

4.    In the end I even received HD images from all stations of the bridge (they was created after the last update and after the problem appeared for the first time).

Description of the problem was clearly pointed to the fact that the cause of the problem is crashing (or freezing) of the navigation server processes at all stations simultaneously (each computer has its own navigation server executed and described problem is only possible if there is no servers alive). The only thing that confused me was mentioned earlier “Pure virtual function call” error. This error is not very common and occurs rare enough in its pure form. Most common cause of the error is improper sequence of objects chain deinitialization and destruction as well as undigested strategy of objects ownership (of course I’m not take into account errors with calling pure virtual method of child class inside a parent class destructor or deleting an object twice – I believe that our code better :). One of the classic situations caused by assertion message box triggered during object destruction (usually in debug mode). Message box message loop can provoke a call (through a pure virtual interface) to a partially destroyed object. As a result we will have clear pure virtual function call.

Friday, July 30, 2010

Why it's not crashing?

Several months ago, my colleagues encountered a very strange behavior of one of our processes after an exception was thrown. They told me that they know exactly that the process throws an exception which is not expected to be handled by their code, that is an exception must be unhandled and the process must be terminated immediately. But nothing like that happens. The process just continues its execution. At first, I thought that there must be some code down the stack which registers an exception handler that handles any exception indiscriminately. Therefore, the first thing I did was search for the try{}catch(…){} pattern down the stack. When I found nothing, I looked at the exception handlers list at the point where an exception was thrown.

It looked as follows:

0012fd14: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, func:   USER32!UserCallWinProc+10a (7e44aa1c)
0012fd6c: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, filter: USER32!DispatchMessageWorker+113 (7e440712)
                func:   USER32!DispatchMessageWorker+126 (7e44072a)
0012ffb0: VanishExcept!ILT+375(__except_handler3)+0 (0041117c)
0012ffe0: kernel32!_except_handler3+0 (7c839af0)
  CRT scope  0, filter: kernel32!BaseProcessStart+29 (7c84377a)
                func:   kernel32!BaseProcessStart+3a (7c843790)
Invalid exception stack at ffffffff

There were no user exception handlers at all, but instead was an exception handler I have not seen before:

0012fd14: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, func:   USER32!UserCallWinProc+10a (7e44aa1c)
0012fd6c: USER32!_except_handler3+0 (7e440457)
  CRT scope  0, filter: USER32!DispatchMessageWorker+113 (7e440712)
                func:   USER32!DispatchMessageWorker+126 (7e44072a)
0012ffb0: VanishExcept!ILT+375(__except_handler3)+0 (0041117c)
0012ffe0: kernel32!_except_handler3+0 (7c839af0)
  CRT scope  0, filter: kernel32!BaseProcessStart+29 (7c84377a)
                func:   kernel32!BaseProcessStart+3a (7c843790)
Invalid exception stack at ffffffff

Saturday, June 26, 2010

The bug is on the table

A few months ago there was an interesting story in the company where I work. One guy from a neighboring department asked me to help him. He encountered one crash after another over the past few days. Unusual in these crashes was that they occurred in different processes and under different circumstances, but within one computer on the test bench (consisted of 5 machines) and with identical symptoms. In all cases it was an access violation caused by illegal instruction pointer.

Here is an example of one of the call stacks:

0:000> knL
 # ChildEBP RetAddr 
00 0012d7f8 7c90e9ab ntdll!KiFastSystemCallRet
01 0012d7fc 7c8094f2 ntdll!ZwWaitForMultipleObjects+0xc
02 0012d898 7c809c86 kernel32!WaitForMultipleObjectsEx+0x12c
03 0012d8b4 6945763c kernel32!WaitForMultipleObjects+0x18
04 0012e248 694582b1 faultrep!StartDWException+0x5df
05 0012f2bc 7c863059 faultrep!ReportFault+0x533
06 0012f530 7c36e289 kernel32!UnhandledExceptionFilter+0x4cf
07 0012f54c 0041b84f msvcr71!_XcptFilter+0x15f
08 0012f558 7c363943 radar!WinMainCRTStartup+0x1d7
09 0012f580 7c9037bf msvcr71!_except_handler3+0x61
0a 0012f5a4 7c90378b ntdll!ExecuteHandler2+0x26
0b 0012f654 7c90eafa ntdll!ExecuteHandler+0x24
0c 0012f654 fc910d93 ntdll!KiUserExceptionDispatcher+0xe
WARNING: Frame IP not in any known module. Following frames may be wrong.
0d 0012f950 1f4d5318 0xfc910d93
0e 0012fa10 7c363593 0x1f4d5318
0f 0012fa58 087e6827 msvcr71!free+0xc3
10 0012fa8c 087e6f41 NaviRadarRendererPlain!std::vector,std::allocator > >::_Insert_n+0x147
11 0012fafc 087e46f1 NaviRadarRendererPlain!TScanlines::calculate+0x5d1
12 0012fc2c 087e418e NaviRadarRendererPlain!TRadarToScreen_::RenderBlock+0x521
13 0012fc64 087e2378 NaviRadarRendererPlain!TRadarToScreen_::Render+0x16e
14 0012fc88 087c4cf0 NaviRadarRendererPlain!n_d3d_render::renderer_t::render+0x68
15 0012fce0 03d5c5e7 NaviRadarLayer!TRadarLayerImpl::Draw+0x360
16 0012fd1c 03d55c8e TotUser!TRadarManager_::Update+0x77
17 0012fd4c 087c821a TotUser!TCrtPanel_::InvalidateRadar+0xbe
18 0012fd54 10ac8959 NaviRadarLayer!TRadarLayerImpl::OnVideoBlockReceived+0x7a
19 0012fd7c 10ac8ab6 TkRadar20!IAdviseHostImpl::TypedForEach+0x49
1a 0012fd90 10ac8e88 TkRadar20!RadarConnector::VideoBlockReceived+0x26
1b 0012fda0 77d48709 TkRadar20!detail::WndProc+0x48
1c 0012fdcc 77d487eb user32!InternalCallWinProc+0x28
1d 0012fe34 77d489a5 user32!UserCallWinProcCheckWow+0x150
1e 0012fe94 77d4bccc user32!DispatchMessageWorker+0x306
1f 0012fea4 7c1b1645 user32!DispatchMessageA+0xf
20 0012feb4 7c1b1357 mfc71!AfxInternalPumpMessage+0x3e
21 0012fed0 0040acc0 mfc71!CWinThread::Run+0x54
22 0012ff08 7c1ae5f1 radar!CECSNTApp::Run+0x30
23 0012ff18 0041b7fd mfc71!AfxWinMain+0x68
24 0012ffc0 7c816d4f radar!WinMainCRTStartup+0x185
25 0012fff0 00000000 kernel32!BaseProcessStart+0x23