I was involved in a big project migrating a complex stack from Windows XP to Windows 7 x86, and got scared this !@#$ out of me.
The stack communicated to WebSphere MQ on AS/400 from Windows, and on the Windows side of things consisted of a UI developed in in Cool:Gen, a C interface to a Delphi DLL, which takes care of the communication to WebSphere MQ.
Digression:
This all was histerically grown. In the beginning (early to mid 90s last century) it was a big Borland Pascal/Turbo Pascal application – running on DOS, Windows 3.1x, Windows 95 and OS/2 2.x/3.0– that talked over a propriatary layer over SNA to AS/400.
The vendor of that layer didn’t respond to a request for Windows NT 4.x compatibility, and meanwhile some client applications were about to be developed in Delphi.So in 1997/1998 – together with a great AS/400 software developer – I wrote a SNA based APPC/CPI-C communication layer in Delphi 3 that could be accessed from both Turbo Pascal (using a file based interface) and Delphi (using an object interface).
The DOS interface was an executable around the Delphi interface, which was a set of classes.The Delphi part of the DOS interface was centered around FindFirstChangeNotification/CreateProcess combined with MsgWaitForMultipleObjects/WaitForSingleObject to make the waiting as efficient as possible.
The DOS part of the Delphi interface was centered around this piece to make waiting efficient:asm int $28 mov ax, $1000 int $15 { DESQview/TopView give up time slice } mov ax, $1680 int $2F end;Thanks to the RBIL: Ralf Brown’s Interrupt List (there are now multiple HTML versions of it), it makes use of these tricks so DOS applications can efficiently wait :
- INT 28 – DOS 2+ – DOS IDLE INTERRUPT
- INT 15 – AX = 1000h – TopView – “PAUSE” – GIVE UP CPU TIME
- INT 2F – AX = 1680h – MS Windows, DPMI, various – RELEASE CURRENT VIRTUAL MACHINE TIME-SLICE
About 5 years into this century, it was decided that the Turbo Pascal code should be rewritten, and that it should be done by an outsourcing “partner” (this was in the heydays of outsourcing).
Since that outsourcing party did a lot of HOST and AS/400 stuff, and there were already a lot of things written in Composer by IEF (which was originally chosen because of the expected stability of Texas Instruments but – via Sterling Software– now has ended as a cash cow at CA), the “natural” choice for the PC part was to use the IEF successor COOL:Gen (I think back then it was COOL:Gen version 5, but it is hard to find COOL:Gen version history on-line; CA is very protective about sharing knowledge, and also renamed COOL:Gen into CA Gen). Anyway, the final version delivered was in CA Gen 6, then the outsourcing contract was terminated and most source code was delivered).By the time I got involved, the application was a whopping 3 executable files large, as the Windows version of CA Gen could not handle the number of CAGen models that the outsourcing people had come up with.
The outsourcing partner – having an off the scale CMM level– had some C experts that could do interfacing to “difficult” things on the PC side, and – I hadn’t been involved as the communication library had been maintenance free for years – called me in last minute to get the AS/400 communication working.
Since the Delphi SNA code was a tad complicated, and the time frame was about zero, I wanted to stick do Delphi and proposed a 3-function (InitSNA/CallSNA/ExitSNA) DLL that the C guys would need to call.
Back then, my warning system should have gone bezerk, as I had to teach their C experts how to call a DLL, heck even make a sample Visual C 6 project that showed one call.
In fact, I made use of the – then undocumented – dcc32 JPH switch so the Delphi compiler would emit .hpp, .lib and .def files, created a Visual C 6 sample project for one specific call on the AS/400 with documentation that they should change that sample code (and the error checking) according to the documented system requirements for each available call.
This year I found out that they did not adapted error checking, so all calls have the same checks. Which explains the number of errors you get when something fails.
The CallSNA function basically transmits a buffer (slightly over 8k) back and forth between the PC and the AS/400. The business logic on both sides knows how to pack/unpack the buffer contents, and the asian outsourcing company rewrite that part in C based on some Turbo Pascal examples they tried to read.
Since I wasn’t allowed to spend more than 60 hours on this (they almost went bezerk when it appeared to be 85 hours), I was not allowed to do any code inspection.
In retrospect (I still was a relatively young nerd), I should have stopped the project right there, questioned the use of CA Gen, and proposed a Windows RAD solution in .NET or Delphi. But that was then.I translated the Delphi 3 code into Delphi 5 (the only Delphi version available at the client).
Actually – welcome to the corporate world – it would have taken them at least 4 weeks to give me a proper account and access to the building, so
- I wrote most of the code in Delphi 7 on my laptop,
- backported it to Delphi 5,
- transferred the changes a couple of times a day by my 28k8 modem, which would go 14k4 over their corporate phone system, mostly borrowing the line of a FAX machine as their phones could not dial out after-hours, but the FAX machines could,
- each time have one of their employees download it and put it on a corporate development machine).
About 2 years ago, I was given the task of migrating the whole layer from SNA to something modern (TCP/IP based, it turned out to be WebSphere MQ).
That started out as a tough job (when switching version control systems from CVS, via MKS Integrity to Serena Dimensions, they managed to loose 75% of the source code). But with some old partial backups I had somewhere on CD-ROM, I managed to restore the full source.
The move to MQ worked out dandy, as the CA Gen stuff didn’t need any changes (I had made the 3 function interface so generic that it could survive a major overhaul), and even more important: the end users were really happy as the MQ based communication path was at least twice as fast as the Windows->SNA-Server->Enterprise-Extender->Host->AS/400 communication path.Last year, I was given the task of migrating it over from Windows XP to Windows 7 (which meant from Delphi 2006 to Delphi XE2).
That worked out well for the DOS part (yes, some DOS apps were still being used), but not so well for the CA Gen part: it needed to be at least CA Gen 7.6 (which is still old: 2010) to survive Windows 7.
And that’s where the C code raise its ugly head, and I return from the digression:
The C code layer had quite a few pieces of code similar to the construct below, and literally thousands of manual call-site copies that were almost-but-not-entirely the same:
char *greet() { char c[] = "Hello"; return c; }
Anyone that even remotely knows C – even me – should know that returning a local variable is bug no-no (big/bug pun intended).
Back in 2005, that code must have raised some big warnings in Visual C 6, but somehow worked. I should have insisted into scrutinizing their code.
Now – compiled by the Visual Studio 2005 commandline compiler as required by Cool:Gen 7.6 – it caused all kinds of Heisenbugs crashing in the MSVCR70.DLL with memory overwrites and null reference errors.
Since Cool:Gen 7.6 does not generate Visual Studio 2005 projects – only a mix of build batch and NMAKE files – it was virtually impossible to debug the code.
With lots of luck, we found out that Cool:Gen had a debug flag in their build tool, which generated enough stack information for the Visual Studio 2005 JIT-debugger to roughly point at the C code near to the actual bugs.
When building the DLL in 2005, I had to make it threading aware, as it would otherwise occasionally fail (occasionally as those were the days of single core CPUs with an almost zero chance of inter-thread race conditions, nowadays thread-oblivious code would have failed at one of the first tries).
Which means that a stop-gap solution like making the char array static (moving it out of the stack into global memory) would get rid of the memory overwrite, but introduce a race condition.
In the end, we decided to do the stop-gap so initial testing could start, then add <code>char *</code> parameters to the affected functions and all (thousands!) of call-sites.
–jeroen
Filed under: C, CVS, Delphi, Delphi 2006, Delphi 3, Delphi 5, Delphi XE2, Development, Dimensions CM by Serena, MKS Integrity, Software Development, Source Code Management