Insight
Insight
Understand the Machine to Write Efficient Code How many programmers can actually write assembly programs? With the rising popularity of high-level languages (like Java, VB.Net, etc), there is rarely any need for programmers to learn assembly or low-level programming. However, there are domains where writing efficient code is very important, for example, game programming and scientific computing.
W
hen can we write highly efficient code? It is when we understand how the underlying machine works and make best use of that knowledge. One well-known way to write highly efficient code is to write code in assembly. There are many disadvantages with this; for example, we cannot port the programs easily, it is difficult to maintain the code, etc. So, we need to look
12
February 2008
|
LINUX For You
|
www.openITis.com
cmyk
for alternatives to write efficient code. We can write in low-level programming languages like C to get write code, whose efficiency is often comparable to the equivalent code written in assembly. For this reason, C is often referred as a ‘high-level assembler’. In this article, we’ll look at various programming constructs from the perspective of efficiency. We’ll consider general machine architecture for illustration; and for specific
Insight
examples, x86 architecture will be used. A word of caution before we proceed: the techniques and issues covered are not for general-purpose programming.
Basic types The machine, in general, understands only three types of values: address, integer and floatingpoint values. For representation and manipulation, here is the correspondence between the types that the machine can understand and what C supports. Addresses correspond to the pointer construct; integers—both signed and unsigned—correspond to short, int, long, long long, char (yes, a char is represented as an int internally!) etc; floating-point types correspond to float, double, long double, etc. The most efficient data-type that a processor handles is a ‘word’, which corresponds to ‘int’ type in C. For floating-point types, all computation is typically done in a larger floating-point type. For example, in x86 machines, all floating-point computation is done in ‘extended precision’, which is 80 bits in size and usually corresponds to ‘long double’ in C; if floating point expressions are used in the code, they are internally converted to extended precision by the processor and the results are converted back to float (which occupies 32 bits). The processor does floating-point computations in a separate ‘coprocessor’ unit. Address computation (such as array index access) is done using pointers in C. They directly correspond to memory access operations in the underlying machine (such as index addressing, in this case). The ‘int’ type is the most efficient for computations. Unsigned types and operations on that are as efficient as signed types. Floating point types and operations are slow compared to integral types. For example, in an imaginary processor, if integer division takes four cycles,
floating point division operation might take 50 to 100 cycles. Memory access operations are very slow—if the desired memory location is not in cache, then it might take hundreds of cycles to fetch the data from the main memory.
Operators C supports a rich set of operators. There are few operators that are directly supported by the processor and there are a few that are simulated in the software. For integral types, bitmanipulation operators are faster compared to other operators like arithmetic, logical or relational operators. One of the ways to write efficient code is to write code using bitwise operations instead of other slower operations. Here is a wellknown example: Using ‘<<’ is more operator efficient than dividing an integer value by 2. We’ll look at a different example for illustration here. A typical code segment for toggling a character’s case is to use relational operators, as in: // precondition: the char ch provided is in range [‘a’-‘z’] or [‘A’-‘Z’] char toggle_ascii_char_case(char ch) {
if( (ch >= ‘a’) && (ch <= ‘z’)
)
// lower case
ch = ch - 0x20;
else if( (ch >= ‘A’) && (ch <=
‘Z’) ) // upper case
ch = ch + 0x20;
return ch;
}
The code works on the following assumption: the given char ch is within the range [a-z] or [A-Z]. If the char is [A-Z], it returns the corresponding char in [a-z] and vice versa, that is, it toggles the case of the character. The value 0x20 is added or subtracted based on the fact that the alphabetic characters are separated by the hex value 0x20 in the ASCII table. But this function is slow. If this is a library function used for www.openITis.com
cmyk
toggling the case of characters in a string, then it is not a good implementation. The code can be improved as follows: since the comparison operators are not required, the function precondition says that ch passed to the function is in the given range [‘a’-‘z’] or [‘A’-‘Z’]. Based on C tradition, we need not check to ensure that the given char is in fact in this range. Also, since we are performing either the ‘-’ or ‘+’ arithmetic operation, we can replace it with bit-wise operations for toggling the bit using the exor operator. With this the code becomes efficient and simple: // precondition: the char ch provided is in range [‘a’-‘z’] or [‘A’-‘Z’] char toggle_ascii_char_case(char ch) {
return (ch ^= 0x20);
}
This example is just for illustration purposes. Using bit-wise operators obscures the code, but it usually significantly improves the efficiency of the code.
Control flow C has various conditional and looping constructs. A C compiler transforms such code constructs to branching (also known as ‘jump’) instructions. So, goto is the most straightforward construct for programming. It is possible to take any C program and create an equivalent program by removing all conditions and loops with just goto statements. Though it is ultimately branching instructions, there are subtle differences in constructs when it comes to efficiency. Which one is more efficient— nested if conditions or switch statements? In general, a switch is more efficient than nested if statements. If both are implemented using branching, why is switch more efficient than nested if conditions? Recall that, in a switch statement, all cases are constants. So, a
|
LINUX For You
|
february 2008
13
Insight
compiler can transform a switch statement to a range or look-up table, which is more efficient than a long list of jumps. Note that executing jump instructions is not costly, but unpredictable jumps can result in considerably slower execution. For example, frequent jumps can result in the flushing of the pipeline. Similarly, processors typically look ahead in the instruction stream and pre-fetch necessary memory accesses and put it in cache. So, unpredictable jumps can result in memory faults, which will result in wasting hundreds of cycles since the processor has to wait for the memory value to be available for it to continue execution. In general, a program with less number of branches is faster than those that have a large number of branches.
more efficient? for(i = 0; i< 50; i++) for(j = 0; j< 50; j++) for(k = 0; k< 50; k++)
for(i = 0; i< 50; i++) for(j = 0; j< 50; j++)
|
LINUX For You
printf(“%d ”,
a[i][j][k]);
C has arrays implemented in row-major order, that is, the same way it is organised in the hardware. The second loop is more efficient because it accesses memory locations sequentially, in row-major order. The processor will fetch the memory blocks into cache, and since the memory access is also sequential, this is efficient. However, in the first loop, the memory access is not sequential and hence there might be a lot of memory faults and hence it will be considerably slower than the second loop. From these examples we learn that it is important to keep in mind that memory faults are costly and we need to minimise such memory faults to write efficient code.
As said earlier, memory access is a costly operation. Let us take a specific example to illustrate this. Typically, it takes the same time to access global data or local (stack allocated) data. However, it is preferable to use local data instead of global data because of the well-known ‘principle of locality’. If a memory location is accessed, the processor doesn’t fetch value in just that memory location; it fetches many values adjacent to that memory location since the program is likely to access variables that are located near that memory location. Fetching a block of data and putting it in cache is not time consuming, but if there is a memory fault, it can lead to the processor waiting for hundreds of cycles for that memory access to happen. If there are large numbers of global variables and their accesses are spread throughout the program, then the program execution becomes considerably slower. Let us consider another (wellknown) example to illustrate this important ‘principle of locality’. Of the following two loops, which one is February 2008
for(k = 0; k< 50; k++)
Memory access
14
printf(“%d ”,
a[k][j][i]);
Compound types C supports compound types like structs, unions and bit-fields that are implemented in terms of other primitive or compound types. The hardware does not understand any compound types, and all processing is done on primitive types only. There are many aspects—such as padding and alignment—in using compound types that can affect the performance. Here, we’ll look at an example of bit-fields to understand how it is supported by the hardware. In C, we can manipulate and access bits using bit-fields. It is a syntax error to attempt taking the address of a bit-field member. Why? The granularity of addressing in modern computers is in bytes and
|
www.openITis.com
cmyk
it is not possible to take address of specific bits in a byte. Though it is space-efficient to use bit-fields, it is not time-efficient since it is not possible to access individual bits; so the compiler emits code to access ‘word’s and then does bitmanipulation to access individual bit-field member values. The following bit-field struct is to represent time in a day in HH:MM: SS format. For the hour, the range is 0-23 and for minutes and seconds, the range is 0-59. We can use the following struct: struct time {
unsigned int hour : 5;
unsigned int minute: 6;
unsigned int second: 6;
} tm1, tm2;
To access tm.minute, the compiler has to generate code to access the word (4 bytes) and do bitmanipulation and access only the 5th to 10th bit in that word, which is slow. So, avoid using bit-fields extensively if performance is important for your software. A better option in this case is to use a struct (without bit-fields) with a byte each for the hour, minute and second, respectively. In this article, we explored some fundamental issues to understand how various programming constructs can affect the efficiency of programs. There are many other issues, such as unaligned memory access, cost of I/O operations, etc, that are not covered in this article. This article is just a starting point to understand such problems. If you are interested, you can read books on assembly language, computer architecture and compiler optimisation to get a better understanding of the issues related to writing efficient programs. By: S G Ganesh is a research engineer in Siemens (Corporate Technology), Bangalore. His latest book is, ‘60 Tips on Object Oriented Programming’, published by Tata McGraw-Hill in December 2007. You can reach him at
[email protected]