Diary of war with bugs: Can lightning strike the same place twice? (Or can a bomb to hit an elephant?)

In the old soviet movie about World War II was mentioned the story about the old Professor of Mathematics. When Leningrad has been bombing he is never descended to bombshelter along with other people. He told that he calculated the probability of bomb hitting his house and the probability was too small to hiding from bombs. But one day his neighbors saw him in a bombshelter and asked why is he hiding with others. He replied that on that day a bomb hit the only elephant in the cities zoo, and probability of hitting the elephant was absolutely equal with probability of been hit himself.

Why did I remember this story? Because during the last week I encountered twice with the failures which was the result of circumstances those probability was very low. Like people say: “If gun is hanging on the wall it will shoot one day or another”.

The first hero of the topic is the synchronization error related to smart pointers.

Let’s look to the following code example:

// Thread safe structure
struct Day : public IRefCounted
{
   virtual void AddRef()
   {
   // Thread safe implementation
   }

   virtual void Release()
   {
   // Thread safe implementation
   }
};

struct Month
{
   typedef boost::intrusive_ptr DayPtr;

   // Calls from thread 1
   void OnNewDay()
   {
   m_spCurrentDay = new Day();
   }

   // Calls from thread 2
   DayPtr GetCurrentDay()
   {
   return m_spCurrentDay;
   }

   DayPtr m_spCurrentDay;
};

Suppose that Day structure is thread safe and Month’s methods OnNewDay() and GetCurrentDay() are called from different threads. Is this code thread safe?

Let’s look what happens inside OnNewDay() function:

1.    New Day object created.

2.    Pointer to a Day object sent to DayPtr::operator=().

3.    Inside DayPtr::operator=() temporary DayPtr object created (now it’s holds a pointer to Day). Reference count value of a Day object incremented by 1.

4.    New temporary DayPtr object and m_spCurrentDay swapped.

5.    Temporary DayPtr object went out of scopes and destroyed. Reference count value of an old Day object decreased by 1. If it’s turns to 0, an old Day object will be deleted.

Here is the swapping code:

void swap(intrusive_ptr & rhs)
{
   T * tmp = px;
   px = rhs.px;
   rhs.px = tmp;
}

And its last string in machine code (we will need it later):

//rhs.px = tmp;

mov eax,dword ptr [rhs]
mov ecx,dword ptr [tmp]
mov dword ptr [eax],ecx

Now let’s look what happens inside GetCurrentDay() function:

1. Copy constructor of DayPtr (which is sent as parameter) called.

2. Inside the copy constructor pointer to Day copied to DayPtr and reference count value of Day increased.

Here is the code of intrusive_ptr copy constructor:

intrusive_ptr(intrusive_ptr const & rhs): px( rhs.px )
{
if( px != 0 ) intrusive_ptr_add_ref( px );
}

And here is the machine code:

// px( rhs.px )
mov eax,dword ptr [this]
mov ecx,dword ptr [rhs]
mov edx,dword ptr [ecx]
mov dword ptr [eax],edx

// if( px != 0 ) intrusive_ptr_add_ref( px );
mov eax,dword ptr [this]
cmp dword ptr [eax],0
je boost::intrusive_ptr::intrusive_ptr+43h (412473h)
mov eax,dword ptr [this]
mov ecx,dword ptr [eax]
push ecx
call intrusive_ptr_add_ref (4116CCh)

Now let’s imagine the following situation:

1.    Thread number 2 calls GetCurrentDay() and returns control after a pointer to Day is copied to it’s DayPtr but before a reference count of a Day object incremented (see the area of instructions marked in green).

2.    Now thread number 1 resumes and calls OnNewDay() (or maybe resumes inside OnNewDay() already, but before the instruction marked in red). It copies a pointer to a new Day object to m_spCurrentDay and decreases an old Day object reference count value. If reference count turns to 0, an old Day object deletes. After that thread #1 returns control to the system.

3.    After thread #2 takes over control it increases a reference count by 1, BUT the object is no longer exist.

Thus we have an intrusive_ptr to already deleted object. And this is completely not what we expected using a smart pointer. If you look at the copy constructors of other boost’s smart pointers you’ll encounter with the similar situation.

Please note – boost’s smart pointers are not thread safe!

Another good example of synchronization problems is the Xerces XML parser Platform Utils environment initialization. Although the Xerces is thread safe in the sense that you can create and use an instance of parser in any application thread (but you can’t use an instance of parser in different threads without synchronization!), there is the important exception from this rule – Platform Utils environment.

Before using the Xerces parser you should initialize Platform Utils environment calling the XMLPlatformUtils::Initialize() method (and deinitialize it after use through the XMLPlatformUtils::Terminate()). XMLPlatformUtils class uses global reference count to control the objects lifetime and ensures one-time initialization and deinitialization.

Let’s look at the code:

void XMLPlatformUtils::Initialize(const char* const locale
                                 , const char* const nlsHome
                                 , PanicHandler* const panicHandler
                                 , MemoryManager* const memoryManager
                                 , bool toInitStatics)
{
     if (gInitFlag == LONG_MAX)
     return;

     gInitFlag++;

   if (gInitFlag > 1)
   return;

   <...>
}

void XMLPlatformUtils::Terminate()
{
   if (gInitFlag == 0)
     return;

   gInitFlag--;

     if (gInitFlag > 0)
       return;

     <...>
}

It’s absolutely clear that the only candidates for race condition is gInitFlag++ and gInitFlag— statements. If we consider them atomic all works just fine – the initialization and deinitialization code executes only once no matter which threads calls XMLPlatformUtils::Initialize() and XMLPlatformUtils::Terminate(). But are they really atomic?

Let’s look at assembler code:

// gInitFlag++;
1 mov eax,dword ptr [gInitFlag (9C2AA54h)]
2 add eax,1
3 mov dword ptr [gInitFlag (9C2AA54h)],eax

// gInitFlag--;
1 mov eax,dword ptr [gInitFlag (9C2AA54h)]
2 sub eax,1
3 mov dword ptr [gInitFlag (9C2AA54h)],eax

What will happen if few threads execute incrementation or decrementation code simultaneously?

Let’s consider the situation with gInitFlag++ statement execution:

1.    The reference count is 0.

2.    Thread #1 executes instruction #1 and returns control. The reference count still equal to 0.

3.    Thread #2 executes all 3 instructions and returns control. The reference count now is 1.

4.    Thread #1 executes instruction #2 and #3. The reference count still equal to 1.

Ooops! We called XMLPlatformUtils::Initialize() twice, but our reference count is still equal to 1! This means that Platform Utils environment will be destroyed after the first call to XMLPlatformUtils::Terminate(), which most likely will result in crash when a second thread will try to use it.

And what will happen with gInitFlag— statement in the same situation? The reference count will be larger than it should be and it will result in resources leak.

So it’s better to synchronize access to XMLPlatformUtils::Initialize() and XMLPlatformUtils::Terminate() methods if use them in multithreaded environment!

Of course the probability of discussed situations is very low, but it’s happens. Believe me :)