Embedded Systems
Lab Experience Guide Speeding up the execution of the Viterbi decoding algorithm by adding custom instructions to a Niosbased system
Matteo Bosio 148451 Patrick Chiapello 152123
Note: the experience you're about to start was prepared for a 3hour lab; the durations you find at the beginning of each paragraph are the reasonable time that every step should take to finish the experience in three hours. However, if you're able to quickly complete the first fullyguided paragraphs, you will have more time for the part in which you will actually have to implement your own work (paragraphs from Profiling the code on). Introduction (5 mins) The aim of this experience is to gain familiarity with the possibility of adding extra instructions to an FPGAbased microprocessor, in order to increase the performance of the CPU on a specific application (in this case the Viterbi decoding algorithm). The Viterbi decoding algorithm is a quite complex procedure, which may be hard to understand for people that is not involved in signal processing; however it is NOT needed to understand in depth the steps of the algorithm! What you just have to do, is to identify the most time consuming part of the procedure, and then create some instructions that will speed up its execution. If you're interested in the details of the Viterbi algorithm, you can refer to the web page: http://home.netcom.com/%7Echip.f/viterbi/tutorial.html The code used as a basis for the experience is the one you can find in the above link, however it's been necessary to modify it, since it was too memoryconsuming to fit in our system. We had to neglect the convolutional encoding part and the Gaussian white noise addition (Nios baseprocessor doesn't have a floating point unit!), and so we will just feed the decoding function by means of a sequence of 1 and +1. The code you will work on is: files\viterbi.c
Setting up the system (40 mins) In order to execute the code, we need to create a system on the FPGA board with the following components:
Nios CPU: is the cpu that actually executes your program Onchip memory: the memory on which our program and our variables are stored JTAG Uart: allows you to communicate with the PC (vital for debugging!) Interval timer: useful for code profiling.
In order to begin the design we need to open the development environment Quartus II. From the file menu select ‘‘New Project Wizard’’ and click Next. From Explorer, create a working directory (for example c:\viterbi ). Pay attention: do not use datapaths that contain blank spaces! From Quartus Project Wizard select the created directory as working directory, chose “nios_system” as project name and “nios_system” as top level entity name. Then press next and skip the second tab pressing next again.
As FPGA chose the Cyclone II EP2C35F672C6 and than press finish button. Now that you have set up a project we have to add the Nios subsystem. Click on Tools → SOPC Builder and choose “nios_system” as System Name. Be careful to choose VHDL as implementation language. On the left side of the tool you have all the components you can add to the system. The first operation is adding a Nios II processor. Look for the component in the leftside list and add it to the design by double clicking it. In the following wizard select NIOS II/e (RISC, 32bit) then press finish. Following the same procedure, add to your system the “jtag uart” component (“Communication”menu), 45kB of onchip memory (“Memory” menu) and two Interval timers (“Other” menu). Now your system is almost ready. In the system menu press “autoassignbaseaddresses” and you are ready to generate your code. Press Generate to create your Nios II system. Before implementing the design we need to assign pin location at list for two important signals: clock and reset. To do so, we first need the system to complete Synthesis and Implementation process, so press the Start Compilation Button in the Quartus environment, but you can stop the process as soon as the Synthesis and Implementation phase is completed. Now, from Quartus open Assignments → Pin Planner and select the location PIN N2 for the clock and reserve it as a tristate input. The reset signal is the Key[0] push button in location PIN G26 and it is an input as well. We are ready for the implementation. Close the Pin Planner and press on the Start Compilation Button. If everything is alright you should obtain a successful compilation and your design is ready to be tested. But our system is composed of a processor, so to test it we need to instruct the processor to do something. In the next section how to complete this task will be explained.
Running the software (20 mins) Now that we have a complete (and hopefully functional) hardware, we are interested in writing some software to use it. From the Sopcbuilder interface it is possible to execute the Software IDE (menu tools). From Explorer you created a new folder (for example c : /viterbi/ ), now choose it when the tool prompt asks you for the project folder location. Now from menu file you chose newproject and select Altera Nios II C/C++ application, click Next. Give a name to your software and select Hello_world_small in Select Project Template, then click Finish. Now select all the lines in the .C file and replace them with the code you have in the file: files\viterbi.c. From the menu Project select Build all to create the executable for your processor. Up to now we have built a system from scratch and compiled some software for it. Now, our task is to find out whether what we have done is working or not. The first step is connecting the DE2 board (Blaster port) to the usb of your lab PC. Then you have to connect the power supplier and press the red button on the board. If you see a lot of led flashing and the message Welcome to the Altera DE2 board on the display, you are ready to test your design. Next step is programming the FPGA with your design. From Quartus open Tools → Programmer, then press Hardware setup and look for the usbblaster interface. Then you mark the Program/configure flag and press start. On success a window labeled OpenCore Plus Status will appear telling you your system is running. Do not close the window or your system will stop working. From the Nios IDE now you can try to execute your software. Open the run menu, select run as and select Nios Hardware. The program calls 1000 times the function sdvd() and each time the function is called, it prints an increasing number (from 0 to 999). In the end you'll see a sequence of zeros and ones, which are the decoded sequence. You should compare it with the bit sequence contained in the files\bit_sequence.txt and check that they are identical. Note: if you press the Key[0] button, your program will restart from the beginning, since you assigned the reset to the pin associated to this button.
Profiling the code (15 mins) In this phase it's time to use the timers we have instantiated in the creation of our system. Right click on your project folder in the leftside panel, then click System Libraries Properties. Now select “Link with Profiling Libraries”, “Reduced device drivers”, set timer_0 as System clock timer and timer_1 as Timestamp timer. Now build all and run the program. The output of the profiling process is stored into the file: <project_name>\Debug\gmon.out; to read it, double click on the file.
Adding the custom instructions (1 h 10 mins) As you have performed the code profiling, you should be able to identify the most timeconsuming part of the code and it's now time to speed it up. To do so, we need to add some custom instructions to the Nios processor. NiosII embedded processor provides the possibility of extending its instruction set, mapping custom instructions on user designed hardware. Then, these instructions can be directly used in C language, through proper wrappers provided by the development tools. Custom hardware instructions are usually connected to the ALU unit as shown in the figure below:
You have the possibility to add both combinatorial and sequential instructions (of course these ones need more input ports, such as clk, reset...): our suggestion is to use combinatorial ones, since they are the easiest ones to implement. The custom logics that you add in parallel to the CPU in order to execute your instructions, have to be described in .vhd files, which must have welldefined ports. You can find a trivial example in files\adder.vhd where a simple adder is implemented, however your instruction structure must be the same. Once you have written the .vhd blocks that implement your instructions, you have to add them to the
system. To do so, you have two choices: you can either modify the system you have already built, or you can build a new system from scratch (which might be better). Anyway, whatever choice you take, you must open the SOPC Builder, either in the old or in the new project. The components you need to instantiate are the same of the previous project, but this time, when we pick the Nios processor, we need to use the Custom Instruction tab. Note: the procedure described below has to be done for each .vhd file you have written. 1. Click on the Custom Instructions tab. 2. Click Import... The Interface to User Logic wizard appears. The Interface to User Logic wizard is used to import Nios II custom instruction logic. To import custom instruction logic into the system, you must: • Add HDL source files to the list. • Specify the top level module • Read in the port list. 3. Click Add. A Windows Explorer dialog box appears. Click on the src folder. 4. Choose your .vhd file 5. Click Open to select the .vhd file and return back to the Interface to User Logic wizard. 6. Click the Read portlist from files button. This will read the port information from your files. 7. Click Add to System to complete the custom instruction importing process. 8. Click Finish to add the custom instruction to the system and return to the SOPC Builder window. Note: whenever you add, in C, an integer to a variable defined as int*, the compiler automatically performs a times 4 multiplication on the integer to add. This is because an integer is defined on 32 bits, that is 4 bytes . Watch out when you implement this kind of instructions in vhdl! When you're done, you have to regenerate the system and to follow once again the steps of the Setting up the system paragraph.
Modifying the software (20 mins) Now that the new system has been compiled, it's necessary to modify the software, in order to make it exploit our new set of instructions. Open the IDE tool and, using the same code used above, clean the project: Project menu → Clean. Now build it again (Project menu → Build all). Now expand the folder on the leftside panel named <project_name>_syslib, expand the folder Release, then the folder system_description and double click on system.h to open it. Scroll down the file until you reach the “ custom instruction macros section”. Here you can find the macros associated to the custom instructions you have just added to the processor. To use them in the C code, just call them and pass the right arguments. You can now rebuild your code and try to run it.
Profiling the code on the customized system (10 mins) To profile the code on your new system, you have to follow the same procedure used for profiling the first program. Now compare the results of the two profilings: have you been able to gain performance? If the answer is yes, you have successfully accomplished this experience. Note: to give an idea of the result you can get, look at the following table. It is shown an extract of the profiling, with the most timeconsuming functions. By adding five custom instructions, we managed to accelerate the execution of the kernel_sdvd function of about 18%. Execution with no custom instructions
Execution with custom instructions
Flat profile:
Flat profile:
Each sample counts as 0.001 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 58.05 25.67 25.67 500 51.34 53.82 sdvd 20.68 34.81 9.14 795000 0.01 0.01 kernel_sdvd 6.11 37.52 2.70 udivmodsi4 3.23 38.94 1.43 __muldi3 2.73 40.15 1.21 __mulsi3 1.50 40.81 0.66 202000 0.00 0.00 soft_quant 1.10 41.30 0.49 sbrk 1.04 41.76 0.46 __divsi3 0.93 42.17 0.41 __floatsidf 0.70 42.48 0.31 500 0.62 0.62 init_adaptive_quant 0.48 42.69 0.21 __modsi3 0.40 42.87 0.18 __divdf3 0.38 43.04 0.17 2320 0.07 0.07 alt_avalon_jtag_uart_write 0.27 43.16 0.12 _start 0.21 43.25 0.09 __unpack_d 0.20 43.34 0.09 __pack_d 0.19 43.42 0.08 1 82.68 129.51 alt_sim_halt 0.18 43.50 0.08 alt_dev_null_write 0.17 43.57 0.07 7000 0.01 0.01 deci2bin 0.15 43.64 0.07 6000 0.01 0.01 soft_metric 0.15 43.71 0.07 8000 0.01 0.01 bin2deci 0.13 43.76 0.06 4000 0.01 0.03 nxt_stat 0.11 43.81 0.05 __fixdfsi 0.11 43.86 0.05 1 46.83 46.83 alt_busy_sleep 0.10 43.90 0.04 1 42.59 172.10 exit 0.09 43.94 0.04 __muldf3 0.08 43.98 0.04 pow 0.05 44.00 0.02 __ieee754_pow 0.05 44.02 0.02 isnan 0.05 44.05 0.02 __eqdf2 0.05 44.07 0.02 2320 0.01 0.08 write 0.05 44.09 0.02 __fpcmp_parts_d 0.04 44.11 0.02 __sfvwrite_small_dev 0.03 44.12 0.02 ___vfprintf_internal_r 0.03 44.14 0.01 __lshrdi3 0.02 44.15 0.01 __udivsi3 0.02 44.15 0.01 _write_r 0.01 44.16 0.01 1 5.37 5.37 main 0.01 44.16 0.01 __nedf2 0.01 44.17 0.00 _malloc_r
Each sample counts as 0.001 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 62.21 25.46 25.46 500 50.93 53.59 sdvd 18.32 32.96 7.50 795000 0.01 0.01 kernel_sdvd 6.75 35.73 2.76 udivmodsi4 2.28 36.66 0.93 __mulsi3 1.62 37.32 0.66 202000 0.00 0.00 soft_quant 1.20 37.81 0.49 __divsi3 1.11 38.27 0.45 __floatsidf 1.07 38.71 0.44 sbrk 0.79 39.03 0.33 500 0.65 0.65 init_adaptive_quant 0.55 39.25 0.22 __muldi3 0.47 39.45 0.19 __divdf3 0.38 39.60 0.15 __modsi3 0.31 39.73 0.12 1 124.97 124.97 alt_busy_sleep 0.30 39.85 0.12 1 123.97 248.94 alt_sim_halt 0.27 39.96 0.11 1000 0.11 0.11 __malloc_unlock 0.27 40.07 0.11 __unpack_d 0.27 40.18 0.11 _start 0.23 40.28 0.09 6000 0.02 0.02 soft_metric 0.22 40.36 0.09 2320 0.04 0.04 alt_avalon_jtag_uart_write 0.19 40.44 0.08 __pack_d 0.18 40.52 0.08 7000 0.01 0.01 deci2bin 0.15 40.58 0.06 8000 0.01 0.01 bin2deci 0.13 40.63 0.05 4000 0.01 0.03 nxt_stat 0.10 40.67 0.04 __muldf3 0.07 40.70 0.03 __ieee754_pow 0.06 40.73 0.02 __fixdfsi 0.05 40.75 0.02 __umodsi3 0.05 40.77 0.02 __eqdf2 0.05 40.79 0.02 __sfvwrite_small_dev 0.04 40.80 0.02 pow 0.04 40.82 0.02 __fpcmp_parts_d 0.04 40.84 0.01 ___vfprintf_internal_r 0.03 40.85 0.01 __udivsi3 0.03 40.86 0.01 __lshrdi3 0.03 40.87 0.01 isnan 0.02 40.88 0.01 2320 0.00 0.04 write 0.01 40.89 0.00 rint