White Paper
C# For Real Time Examining Execution Speed of JITted Code With CF 2.0 Abstract
By Chris Tacke
Managed code offers many advantages to the developer of embedded devices that must operate 24x7, often using advanced power management features of their CPU. Demands on software are intensified for real time systems. To keep power consumption low, software needs to be aware of the processor state, including sleep, and needs to initialize in microseconds. Yet, the mandate for 24x7 operations means that no shortcuts can be taken; code must not create any ‘stacks’, ‘buffers’ or reserved memory areas it does not clean up. Likewise, software for PXA-native interfaces can be complex and demand significant processing. Tests were run to compare both ‘speed’ of C# to C++, and ‘latency’ associated with the transfer of data across the managed code and unmanaged code boundary since the actual silicon interface is, by necessity, an unmanaged environment. Based on these tests, it appears that C# is both fast enough and the CF exhibits low enough latency, to allow the use of C#, and even VB for many types of real time applications.
What do you see? Here, some test code toggles a GPIO every 250 nanoseconds- as fast as this scope can measure effectively. The pulses repeat with perfect regularity, indicating no interaction with a background process like ‘garbage collection’. Is this C, C++ or C# managed code?
Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
Examining Execution Speed Of Jitted Code With CF 2.0
Introduction I like to think of myself as a “systems” guy – not an application developer, however I love the benefits that both managed code and Visual Studio 2005 provide as a development platform. In the early days of the Compact Framework, a deep understanding of how Platform Invoke (P/Invoke) and marshaling data across the managed/unmanaged boundary was almost always required for any robust solution. I tended to split my development time between writing driver and kernel code and writing managed code to access those native resources. As a natural product of doing that kind of work, I've always been intrigued by the idea of writing a driver in managed code, and so I’m always looking for potential ways that it might be realistic or achievable. The harsh reality is that it’s not possible to write a true driver with the Compact Framework as it is today. The Execution Engine (EE) cannot be hosted by a native process, which makes it impossible to expose the entry points required for device.exe to load an assembly. However, if we relax the definition of a “driver” a bit and consider a driver to simply be a piece of software dedicated to managing specific hardware resources, then the story changes a bit. Of course there are still a couple of large hurdles that still have to be overcome to achieve this definition of a driver. First is the lack of deterministic behavior in a managed code environment and second is the Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
purported lack of performance of managed code. The first hurdle - the non deterministic nature of a garbage collected environment - has been discussed and demonstrated fairly well, and while it's something I'm still trying to work around, it is something that I feel doesn't need much discussion. The second hurdle - performance - is something different. While I've seen it argued on several occasions that since managed code is not truly compiled code and that it runs against the .NET Common Language Runtime, it inherently must perform worse than native code. Surprisingly I've never actually seen anything that specifically set out to quantify the difference, so I decided that before I just accept "common knowledge" that maybe a little testing was in order. Lastly, I want to make it clear that I don’t recommend writing a driver in managed code for a production environment. To make it work, a lot of care that must be taken, right down to each line of code, and the implications of every call, have to be heavily modelled. The result is that with today’s Compact Framework tools, it’s just too fragile a system to trust on a large scale. However, as you’ll see in this article, there are some extremely promising behaviors that, with just a few additions from the Compact Framework team, could indeed make managed drivers a distinct reality in the next version.
Page 2
Examining Execution Speed Of Jitted Code With CF 2.0
The Baseline - Toggling a GPIO with C Before I could test managed code and have meaningful results, I needed to get a set of control data. How fast can some meaningful action occur with typical unmanaged code? I decided that a reasonable test would be to toggle a processor general-purpose input/output (GPIO) line as fast as possible since it's a common action in a driver, and using an oscilloscope it would be very easy to get a quantifiable measurement of speed. So I put together the following piece of code to toggle a GPIO as fast as possible:
and VirtualCopy in the same way MmMapIoSpace does to get a mapped virtual address for a specified physical address. Here I passed in the base physical address of the PXA255's GPIO registers, then I allocated pointers to the direction (GPDR), set (GPSR) and clear (GPCR) registers. Basically how these work is you set the state of a bit in GPDR to determine whether it's an input or an output, then you set the same bit in GPSR to turn it on or set the bit in GPCR to turn it off. The measurement would be nothing more than the two calls in the while loop to set and clear GPIO3 as fast as possible. Below is the compiler output for the previous code. ; 49
: while(true)
; 50
: {
; 51
:
#define GPIO3 (1 << 3) ... DWORD *p = (DWORD*)MapAddress(0x40E00000);
DWORD *gpdr = p + (0x10 / sizeof(DWORD));
0004c e5930000 [r3]
ldr
r0,
00050 e5804000 [r0]
str
r4,
; 52 DWORD *gpsr = p + (0x18 / sizeof(DWORD)); DWORD *gpcr = p + (0x24 / sizeof(DWORD));
*gpsr = GPIO3;
:
*gpcr = GPIO3;
00054 e5921000 [r2]
ldr
r1,
00058 e5814000 [r1]
str
r4,
; 53
: }
*gpdr |= GPIO3; while(true) { *gpsr = GPIO3; *gpcr = GPIO3; }
For the curious, the call to MapAddress is a function that wraps VirtualAlloc Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
You can see that the compiler has turned this into two pairs of load and store operations. While this could have been made faster by writing it in assembly, the purpose of this test wasn't to get the best possible time from unmanaged code, but to compare managed code with typical unmanaged code. Page 3
Examining Execution Speed Of Jitted Code With CF 2.0 Figure 1 is a captured scope trace of the output produced by the unmanaged code, measured right on the pin of the processor. The important piece of information to see is that a state change (high to low or low to high) is pretty consistent and is about 110ns.
downloadable sample, but the interesting part - the calling code - can be seen here: int gpio3 = (1 << 3);
// map all of GPIO space PhysicalAddressPointer pap; pap = new PhysicalAddressPointer(0x40E00000, 0x6B);
// make an GPIO output int gpdr = pap.ReadInt32(0x10); pap.WriteInt32(gpdr | gpio3);
while(true) {
Figure 1-Oscilloscope traces for unmanaged code
// turn it off
Using C#
// turn it on
Now that we've got the control measured, let's take a look at how we can implement the same feature (toggling GPIO3) in managed code and the speed we see from it. For my testing, I chose to use C# instead of VB.NET. Initially this choice was simply a matter of personal preference, but as we'll see shortly, some features available in C# but not VB.NET gave faster results. The base of my first tests were done with a simple class that P/Invokes the VirtualAlloc and VirtualCopy APIs to map a physical address to a virtual address just like the C code used earlier. The full class code is available in the Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
pap.WriteInt32(gpio3, 0x24);
pap.WriteInt32(gpio3, 0x18); }
An important difference between the hardware access in this code versus what was done with the unmanaged code is that the virtual address is stored as an IntPtr in managed code. This means that any reads from or writes to the address are done through a call to Marshal.Copy instead of directly to the pointer address like we were able to do in C. Intuitively I felt that this was going to add some overhead, and the resulting scope trace, seen in Figure 2, shows that it is indeed slower – almost seven times slower. Page 4
Examining Execution Speed Of Jitted Code With CF 2.0 Even though the managed code was significantly slower it was very consistent, and was still faster than I had expected, considering managed code had to make a function call to the Marshal class, which then had to marshal the data to the IntPtr address location. The question remained "how much of the difference is the overhead of the extra calls, and how much can be attributed to the Common Language Runtime (CLR) itself that the code runs in?" To determine that, I needed a better way to get at the hardware address, something that takes the IntPtr and Marshal calls out of the picture. This is where I had to turn to a C# code feature: unsafe code. Unsafe code simply means that if I set a specific compiler option, I'm allowed to allocate and use pointers in my managed code.
function I modified my test code to look like this: // toggle GPIO 3 int gpio3 = (1 << 3);
// map all of GPIO space PhysicalAddressPointer pap; pap = new PhysicalAddressPointer(0x40E00000, 0x6B);
unsafe { int *p = (int*)pap.GetUnsafePointer(); int *gpsr = p + (0x18 / 4); int *gpcr = p + (0x24 / 4); int *gpdr = p + (0x10 / 4); }
Figure 2- First Test with C#
When I measured the state changes this time I was pleasantly surprised. The traces, as seen in Figure 3, were identical to the unmanaged traces, meaning that the CLR was adding zero measurable overhead to the hardware access. Yes, you read that right – the managed code implementation was just as fast as the native implementation. All of the latency measured in the first managed code test lie in the overhead of the call to the Marshal class.
To use a pointer I had to make a slight modification to the previously used PhysicalAddressPointer class implementation to give external classes access to the internal virtual address as a uint* using the IntPtr.ToPointer method. Using the newly exposed Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
Page 5
Examining Execution Speed Of Jitted Code With CF 2.0 PXA25x pxa = new PXA25x();
// set gpio3 as an output pxa.GPIO.GPDR0 |= PXA25x.GPIO3;
while(true) { // set the pin pxa.GPIO.GPSR0 = PXA25x.GPIO3; // clear the pin
Figure 3-C# with improved interface to platform
My longer term goal was to make access to the hardware a little more user friendly by providing a wrapper class for the entire PXA255 processor, but I also wanted maximum performance to remain a goal. I wanted VB.NET developers to have the same advantages that C# developers would get, so I did some rethinking on how to get at the virtual address without going through the Marshal class. The first thought was to try to get a structure? that would map its members directly to the registers in the processor, and then pin those into memory. Unfortunately even with a pinned struct, you're still relegated to using the Marshal class for passing data to the mapped target address. I then decided that if the PXA255 wrapper used unsafe pointers internally that were wrapped by CLS compliant properties, VB developers would be able to directly access hardware as well as benefit from the speed of unsafe code. I then put together a comparable test using the PXA255 class and checked the performance with the scope. Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
pxa.GPIO.GPCR0 = PXA25x.GPIO3; }
Once again I was surprised by the result, but this time the surprise wasn't pleasant. It turned out that even though the class was using unsafe pointers, the results were a similarly large latency (see Figure 4). It appeared that it wasn't the internals of the Marshal class that were the performance hit after all, instead it was simply the fact that a method call was being made.
Figure 4-Latency of the method call
The last step was to physically verify that hunch, so I wrote a last bit of test Page 6
Examining Execution Speed Of Jitted Code With CF 2.0 code using the PXA255 class, but retrieving its internal pointer and then using the pointer locally in the test. Of course this isn't VB-accessible, but it would prove the theory about the location of the performance bottleneck. PXA25x pxa = new PXA25x(); c unsafe { uint *gpio = pxa.GetGPIORegistersUnsafePointer(); uint *gpsr = gpio + (0x18 / 4); uint *gpcr = gpio + (0x24 / 4); uint *gpdr = gpio + (0x10 / 4);
// make GPIO an output *gpdr |= PXA25x.GPIO3;
while(true) { // set the pin *gpsr = PXA25x.GPIO3; *gpcr = PXA25x.GPIO3; } }
In Figure 5 you can see that using the pointer again provided the same level of performance that the unmanaged code did, proving that the expense is simply the fact that a method call had been made, not that the Marshal class has any inherent bottleneck. In fact this shows that the Marshal class internally is actually quite performant, adding very little overhead beyond the call into it. Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
Figure 5- Managed code at full performance.
Conclusion So you see, the performance between managed code and unmanaged code can be negligible if when you are writing the code you are cognizant of the behavior characteristics of the managed environment. What does that buy us as a community of developers? Potentially, the implications are immense. I know that I said in the introduction that I didn’t recommend writing drivers in managed code, but as it stands right now we easily have the required performance to write device drivers for many items that are tolerant of large potential, but typically rare, latencies. Things like I²C or SPI synchronous serial buses or other GPIO devices are well within the realm of possibility. We've also seen that managed code can perform equally to unmanaged code, so if we can find a way to eek out deterministic behavior from the CLR, then we easily have the performance required for a whole host of devices. With a little ingenuity on our part and a little cooperation from those developing Page 7
Examining Execution Speed Of Jitted Code With CF 2.0 future versions of managed compilers and specifications, writing device drivers in managed code could become a commonplace task. I'm not advocating that we do away with unmanaged code it certainly has it place today, and will for the foreseeable future just as assembly still does, but we don't need to fear managed developers playing with hardware any more than the assembly developers of yesteryear needed to worry about C and C++ developers. Change is what has always driven the industry and what I do advocate is embracing that change because it looks like it's going to be fun.
This work was originally published by Microsoft in MSDN (Microsoft Developers Network; http://msdn1.microsoft.com/en-us/default.aspx). Copyright ©, 2006. Reprinted here with permission All Rights Reserved. Any trademarks used within are the property of their respective owners. This document contains technical descriptions that may not be representative of
Applied Data Systems 10260 Old Columbia Road Columbia, Maryland 21046 Phone 301-490-4007 Fax 301-490-4582
Page 8