RETURN
Class 543
Embedded Systems Conference
San Francisco, CA
March 2002
Multiprocessor Real-Time Analysis for Scan-Based Emulation: A Methodology Debbie Keil
Prithvi Rao
Systems Software Architect Texas Instruments, Inc.
Systems Software Developer Texas Instruments, Inc.
Abstract: Scan-based emulation is a pervasive method that is deployed to debug and develop DSP applications. Scan-based emulation relies on scanning data out from the target to the host computer using proprietary emulation hardware that is designed into the DSP core. Emulation support in our single processor and multiprocessor environments comprises software and hardware on both the target and the host, and therefore provides a transport mechanism upon which to base real-time analysis. In this paper we present an end-to-end methodology in developing a multiprocessor realtime analysis capability formulated on JTAG scan-based mechanisms. Specifically we discuss the issues related to extending the capability from single to multiprocessor domains. We examine the issues related to supporting real-time analysis in all of the software and hardware components. Finally, we enumerate the challenges in developing a methodology for deploying both homogeneous and heterogeneous multiprocessor combinations. 1. Introduction The ability to analyze the proper execution of real-time applications in an embedded system is critical to their development and deployment. This applies to real-time applications ranging from mission critical to multimedia. In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a dedicated hardware and software capability with an end-to-end methodology that supports the transferring of data between the host and the target in a lossless and reliable manner. Specifically, the Real-Time Analysis encompassed by this methodology consists of capturing data from a target application using dedicated hardware, transferring it through various layers of software dedicated to creating a real-time path and making it available to a host application for the purpose of analysis. Analysis includes the determining of whether applications meet both timing and logical correctness requirements. In this paper we present an end-to-end methodology in developing a multiprocessor realtime analysis capability formulated on JTAG scan-based mechanisms.
1
Scan-based emulation is a pervasive method that is deployed to debug, develop and analyze real-time applications running on DSPs. The JTAG boundary scan specification permits the connecting of multiple devices in a serial daisy-chained arrangement. Section 2 covers scan-based emulation in detail and sets the stage for RTA. The RTA hardware and software architecture that our methodology relies upon is presented in Section 3. It also includes an application that illustrates the necessity of RTA in a multiprocessor environment. The end-to-end methodology is discussed in Section 4. Specifically we discuss the issues related to extending the capability from single to multiprocessor domains. In Section 5 we enumerate the challenges in developing a methodology for deploying both homogeneous and heterogeneous multiprocessor combinations. Our conclusions are presented in Section 6. 2. Scan-Based Emulation With a traditional emulator, the CPU to be emulated is usually removed from its socket and replaced with an emulator pod. The emulator pod typically has a replacement CPU, plus various amounts of random logic and memory to monitor what is happening on the CPU pins. With modern CPUs such as DSPs, the traditional approach has several problems. The first problem is the speed of newer DSP chips. Bus cycle times can be 25 nanoseconds or shorter, and all instructions typically execute in a single cycle. This makes it difficult for a traditional emulator to allow emulation at full speed. The number of pins to monitor can be staggering, with chips having multiple 32-bit address and data buses, making a traditional emulator expensive. The second, and more serious, problem is that DSPs often have on-chip caches, pipelines, memory and peripherals. Sometimes a whole algorithm can execute without any activity on the CPU pins. The solution to these problems is scan-based emulation. With scan-based emulation, the CPU is never removed from the socket; in fact it can be soldered directly onto the board. Instead, the CPU has a serial scan interface, allowing the emulator to scan the internals of the device through a standard connector. The pinout for this connector is defined by a standards committee, making it possible to support several devices with a single emulator1. The scan-based approach to emulation has many advantages: •
Emulation at full device speed - Since there is no logic needed to monitor what happens on the CPU bus, the emulator can allow the device to execute programs at full speed.
2
• • • •
Non-intrusive emulation - Since no logic is attached to any CPU pin, the CPU bus is not affected at all by the emulation process. The emulator will not affect the operation of the bus, as is so often the case with traditional emulators. In-circuit emulation - The CPU can be soldered to the board while emulating. This makes denser packaging possible, and also makes the emulator a manufacturing test tool. Full access to internal memory, caches, pipelines and registers - The complete state of the processor is visible to the outside through a scan interface. Complete access to the system from the CPU - Any peripheral or memory that the CPU can access in the system can also be accessed through the scan interface. The emulator can look at the system "through the eyes of the CPU". This makes it possible to debug and diagnose a system where nothing is working except the CPU itself.
2.1 The JTAG Interface The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for testing individual devices on printed circuit boards, without the need to remove the devices from the board. This is accomplished by a method called boundary scan, whereby the state of each pin of each device (with some special logic on the device) is serially scanned out from the device. Multiple devices can be daisy chained, and an entire PC board can therefore be scanned in a single scan chain. It is possible to use the same method to scan out not only the state of a devices pins, but to scan out any internal information from the device, such as register values, memory location; as a consequence scan-based emulation was born. The JTAG specification does not include the pinout for the JTAG connector. The extension to JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with pinout and physical dimensions common to all DSPs that support JTAG involved in this methodology2. During JTAG emulation, the emulator supplies the clock that scans the device. This means that the target clock speed is completely independent of the emulation clock, and the emulator can support targets running at any clock speed. 2.2 The Boundary Scan Mechanism 2.2.1 Device Architecture The JTAG device architecture is based on the IEEE 1149.1 architecture. In this specification, there are four dedicated pins collectively known as the Test Access Port (TAP). They are: • •
Test Data In (TDI) Test Data Out (TDO)
3
• • •
Test Mode Select (TMS) Test Clock (TCK) Test Rest (optional)
A boundary scan cell is connected to each boundary scan register on each device that is being scanned. The architecture further specifies a finite machine TAP controller with inputs TMS and TCK. There is an Instruction Register (IR) holding the current instruction, a bypass register, and an optional 32-bit identification register for permanent identification. 2.2.2 Principles of Boundary Scan Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel load operations cause signals from the core logic to be loaded into the output cells. Parallel unload operations cause the signals to be loaded from the input cells to the core logic. Data is shifted in serial mode by daisy chaining devices. The figure below shows the TDI of each device connected to the TDO of the next device in the scan chain. Loosely Coupled DSP Arrangement (Multiple Boards)
Target Board 1
TDI
JTAG Header (14 Pin)
Target Board 2
TDO
TDI
TDO
TDI
TDI
TDO
TDI
TDO
JTAG Splitter Card TDO To Host Computer
Class 543: Debbie Keil & Prithvi Rao
1
Figure 1 Boundary Scan
In a homogeneous multiprocessor environment all devices have the same emulation hardware with the same scan chain length. In a heterogeneous environment, devices have different emulation hardware resulting in varying scan chain lengths. It is possible to avoid scanning any device by placing it in bypass mode. Typically, the system architect is responsible for determining the type (homogeneous or heterogeneous) type of arrangement of devices, their order in the scan chain and if they will be placed in bypass. 4
3. Real-Time Analysis (RTA) The following application in the domain of high energy physics illustrates the necessity for RTA in a heterogeneous multiprocessor environment. The Fermilab Tevatron Collider generates 15 million particle collisions per second. These particle collisions result in the creation of subatomic particles that travel through a spectrometer. The data output from the spectrometer is in the order of terabytes per second and must be analyzed in real time. The analysis engine comprises a massively parallel arrangement of heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of applying algorithms that reconstruct and filter the collision data. The result is a select set of interesting collisions from which physicists can study some of the remaining mysteries of matter and antimatter in the universe3. Analysis of real-time embedded applications is necessary at several points during the software life cycle: during development, as a means to debug; towards the end of development, for tuning performance; and after the application is deployed, for failure analysis. Historically, several different methods have been employed to debug and analyze realtime embedded applications. Traditional debuggers were used to set breakpoints that stop the target application so that the memory state could be examined. This method has proven to be inadequate for most real-time applications because setting breakpoints halts the application and therefore interferes with the timing constraints of the system. The memory state is not guaranteed to contain reliable results. Logic analyzers have been used for many years to clamp onto the data busses of the target and monitor the data flow of the application in order to analyze application behavior. Aside from the fact that logic analyzers are expensive ($15K to $60K for a DSP), the increase in system-level integration over the years has resulted in fewer exposed data paths for the logic analyzer to monitor. In some cases, pre-production versions of the chips containing in-circuit emulation (ICE) structures were manufactured. These could be used to debug real-time applications. However, since the debugging environment is not equivalent to the final production environment, the application’s performance cannot be guaranteed to remain the same from the ICE version to the final chip. Most modern microprocessors are architected with specialized hardware counters that can be programmed for the purpose of tracing applications. Traditionally these registers have been used to determine the design of the microarchitecture such as caches and TLBs, etc. Whereas these registers can be used to trace the behavior of applications at a very fine level of granularity, they cannot easily be used as a RTA mechanism. An ancillary yet significant issue is that analysis requires that the user have an advanced knowledge of the target microarchitecture in order to interpret the data. Finally, tracing supports data transfer only from target to host and not from host to target.
5
Host
Host App1
… Host AppM
RTA Host SW Lib
Host API
Target
Emu SW
Emulator JTAG
RTA Emu Target SW HW Lib
RealTime Target App
Target API
Figure 2 Single processor RTA based upon JTAG emulation
An alternative real-time analysis solution based upon JTAG emulation is presented here. This hardware and software architecture for a single processor is shown in Figure 2. The JTAG interface that connects the on-chip emulation logic to the host-based emulator provides the physical mechanism on which to transport data from the target to the host and vice versa. The target application is the subject matter to be analyzed; it is the source of data to be sent to the host and the sink for data received from the host. Therefore, a RTA target software library exists to bridge the gap from the target application to the onchip emulation hardware. Good software engineering practices dictate that an API exist for this software library. On the host, the data is to be analyzed by a host application. This host application may also input data to the target application. We must therefore bridge the gap from the emulator to the host application. An emulation software driver controls the scanning of data to/from the target via the emulator. It is the first piece of host software to receive data from the target and the last piece of host software to handle data heading to the target. A RTA host software library funnels the data between the emulation software driver and the host application. Again, an API exists for the RTA host software library. It should be noted that multiple host applications may be run concurrently. Data flow in this architecture is bi-directional: data flows from the target application to the host application for analysis, and data may flow from the host application to the target application for supplying input parameters. Such input parameters may be used for fine tuning performance, for supplying test data, etc. Refer to Figure 3. For target-to-host data transfer, there are two distinct parts of the data flow path. The first part extends from the target application to the RTA host software library. This is the real-time transportation leg. Since our target application has real-time constraints, data 6
must be off-loaded from the target to the host at a certain rate. The RTA host software library is the first piece of software on the host that realizes it has received real-time data for analysis. (The emulation software is agnostic to what type of data it is scanning.) The RTA host software can record the data to disk and be done with it, or buffer it internally. The second part of the target-to-host data flow path extends from the RTA host software library to the host application. The data is analyzed by the host application. If the data has been recorded in persistent storage, then the data can be played back at any time. If the data is not in persistent storage, then it must be analyzed by the host application as it is produced; that is, the data must be drained from the RTA host software buffers as they fill.
Host
Target
RTA Host App1
… Host AppM
Real-Time Transportation RTA Host SW Lib
Emu SW
Emulator JTAG
RTA Emu Target SW HW Lib
RealTime Target App
Input
Figure 3 Data flow in single processor RTA based upon JTAG emulation
The above RTA architecture for a single embedded processor is easily extended to a multiprocessor environment. This is shown in Figure 4. A RTA target software library must exist on each embedded target to connect the target application to the emulation logic on that target. For multi-core architectures, a RTA target software library will exist for each core. The data from each processor is scanned up to the host via the JTAG interface ring as described in Section 2 (Scan-Based Emulation). On the host, there exists an emulation software driver corresponding to each target in the system. Each emulation software driver receives the data from its corresponding target and delivers the data to the one RTA host software library.
7
Target Processors TDI
Host Processor Host App1 Host App2
…
Emu SW P1 RTA Host SW
TDI
Emu SW P2 Emu
…
Emu SW Pn
TDI
TDO
RealTime App
P1
Emu RTA HW Target SW
RealTime App
P2
RealTime App
Pn
…
Host AppM
TDO
Emu RTA HW Target SW
TDO TDI TDO
Emu RTA HW Target SW
Class 543: Debbie Keil & Prithvi Rao
4
Figure 4 Multiprocessor RTA via Scan-Based Emulation
This architecture for multiprocessor real-time analysis via scan-based emulation provides the basis for the methodology. 4. Methodology In this section we present an end-to-end methodology that is predicated on support in hardware and software across several families of DSPs. There is special emulation hardware architected into the DSP core and emulation drivers as well as RTA target and host side software that permit the user to perform RTA. Fundamentally, this methodology involves using a development environment to develop and download a target application to a DSP. The application running on the DSP interfaces with the RTA software to send and receive data. The data is scanned out using JTAG boundary scan. The data is received on the host by the emulation driver that interfaces to the host-side RTA software. The data is then presented to the host client application for analysis. The figures of merit used to determine the success of this methodology are performance, scalability, ease of use and reliability. We discuss these criteria within the scope of both hardware capabilities and the RTA software architecture in a multiprocessor environment. 4.1 Performance An important consideration in providing a methodology for multiprocessor RTA is performance.
8
4.1.1 Dedicated Emulation Hardware The performance problem has been partially addressed in hardware by dedicating hardware for scan-based emulation. Data is transferred between target and host using dedicated emulation hardware to improve performance. In a heterogeneous multiprocessor arrangement of DSPs one complication that arises is that of varying scan lengths. Each design of DSP has its own emulation hardware. This results in scan lengths that vary within a family of DSPs and between Instruction Set Architectures (ISAs). The result of this variance is that longer scan chains require greater disassembly time for the scanned data resulting in lower throughput. This results in lower performance. Data can be streamed between target and host by using peripherals such as DMA and by performing real-time memory write operations. In a multiprocessor JTAG scan, a special JTAG boundary scan bypass instruction obviates the need to scan any device set to bypass mode. This results in less time to disassemble data being transferred between host and target.
Host
Target Virtual Path 1
Host App
Virtual Path 2
…
Target App
Virtual Path N
Figure 5 Virtual Data Paths
4.1.2 Data Identification A RTA solution for a multiprocessor environment must be able to identify the processor from which data originates. This introduces the need to mark the data with a processor identifier. The decision then becomes where in the system to do this. If we examine the host, we see that there is a one-to-one correspondence between the emulation software drivers and the processors in the system. Since there is an emulation software driver for each target in the system, these drivers can stamp the data with a processor identifier. 9
Note that from a performance perspective, it is better to mark the data on the host-side as to the target-side. If a unique processor identifier were sent down to the target and the data were tagged there, then more data would be sent from the target to the host and would consume precious bandwidth. At the processor level, it is possible to allow finer-grain identification of data. Virtual data paths that extend from the target application to the host application are used to segregate data. See Figure 5. For target-to-host data transfer, the segregation policy is determined by the target application writer, whereas for host-to-target data transfer, the segregation policy is determined by the host application writer. In either case, the corresponding application (host or target) must be aware of how the data is segregated according to virtual paths. Therefore, both the target API and the host API must contain methods to identify the virtual path on which the data is flowing. The introduction of virtual data path identification has ramifications on performance because this identifier must be carried with the data. 4.2 Scalability A key aspect of this methodology is scalability. This issue is addressed in both hardware and software. 4.2.1 Hardware Scalability The JTAG specification permits the daisy chaining of hardware. The limits placed on the number of devices that can be daisy chained is based on signal strength limitations as opposed to the JTAG specification. 4.2.2 Software Scalability In software, data is tagged from each target with a unique identifier (as described in Section 4.1.2) so that data being transferred between host and target can be identified as to which processor it belongs. Further, the RTA architecture is software scalable; writing the target application is not dependent upon the number or processors and does not need to be altered if processors are either added or removed from the system configuration. There is no requirement that the target application have any knowledge of the type or number of processors in a scan chain at the time of development. The emulation drivers and the RTA host software are architected to manage the data from the different processors. 4.2.2.1 Data Selection The host application should be able to select from which processor to send or receive data. This is accomplished by incorporating this functionality into the host API. This
10
proves to be very favorable with respect to scalability. By allowing the host application to select the processor, the same target application can be replicated without change on multiple processors to exploit parallel computing power. For example, let’s assume that we have a target application that performs a series of transformations on a given vector, and then transfers the resulting vector to the host for analysis. Let’s further assume that there are many vectors that must be transformed and that we choose to deploy the same target application on as many processors as there are vectors to achieve maximum computing parallelization. We can design a host application that sends a different vector to each processor and then collects the resulting vectors from each processor, respectively, for analysis. See Figure 6.
Host Application • • • • • • • • • • • •
Select Processor1 Send vector1 Select Processor2 Send vector2 Select Processor3 Send vector3 Select Processor1 Get resultant vector Select Processor2 Get resultant vector Select Processor3 Get resultant vector Class 543: Debbie Keil & Prithvi Rao
Processor1 Transform Processor2 Transform Processor3 Transform
6
Figure 6 Processor Selection
Note that this example still holds if the target application sends its data via a virtual data path. Since the target application is replicated unchanged, each processor would be sending data on the same virtual path. However, since the host application selects data on the basis of both processor identifier and virtual data path identifier, all data is uniquely identifiable. This example illustrates that host-application control over processor selection results in a scalable multiprocessor methodology. 4.3 Ease of Use
11
Ease of use is an important but often difficult figure of metric to sustain. A software debugging environment is provided that permits the user to easily configure the hardware in the system. 4.3.1 Hardware Support A trend in DSP emulation hardware is to support device registers that are mapped at fixed addresses. This permits the source code porting of applications. Further, a trend in more contemporary DSP emulation logic is to replicate the logic on all DSPs. This further simplifies the deployment of RTA tools. 4.3.2 Software Support At setup, the user selects the type of target and loads the system with an emulation driver for that target. The user also specifies the number of targets of each type and their position in the scan chain. Without this capability users would have to add code in their host applications that performed the same function, resulting in messy and unnecessarily complex code. The debugging support software permits the setting of devices on a scan chain to be bypassed. In the absence of this support, the application might have to disassemble unwanted scans. Host side support is provided in the way of object-oriented interfaces based on the Component Object Model (COM)4, which is a defacto industry standard. This permits the host application developer to write client programs that are not tightly coupled to a specific DSP. 4.4 Reliability Reliability is critical to the deployment of the RTA capability. 4.4.1 Hardware Reliability The JTAG specification has been long established as a reliable standard. It has been adopted and extended. An extensive set of target libraries have been developed for various flavors of DSPs based on boundary scan. Reliability is achieved through reuse of the same register set in different versions of emulation hardware across ISAs and within ISAs. 4.4.2 Software Reliability The use of unidirectional virtual paths for both target-to-host and host-to-target data transfers assists in ensuring that there is no data corruption. Further, host applications synchronize on data buffers connected to virtual paths and so there is no data loss.
12
Buffer management is precise and is architected to ensure no data loss on both target and host sides. Another feature of the RTA architecture is congestion control. With this capability buffers are guaranteed not to overflow. During host-to-target data transfer, the RTA architecture signals the end of data transfer through a virtual path using callbacks. Callbacks are used to notify target applications that data sent by the host has to be read. The virtual paths through which data is passed cannot be reused unless previously written data has been consumed. Data is copied from the target application into buffers in the RTA target software library. This supports reliability by ensuring that the target application does not accidentally overwrite data. 5. Challenges There are several challenges in supporting a uniform multiprocessor RTA capability on various families of DSPs. Each family of DSP has its particular variant of emulation hardware. This has an impact on the RTA protocol that is used to transfer data between host and target. For instance, some of the emulation capabilities on some DSPs use interrupts to signal the flow of data between host and target. In the absence of emulation interrupt support, the application must poll the emulation hardware for the presence of data. Another issue is the support for DSPs with varying word sizes (16 bit and 32 bit). There is a need to support RTA in the presence of various memory hierarchies. Specifically, RTA must run when the application is loaded into on-chip or off-chip memory. These issues have been addressed on the target side by developing the RTA target software libraries that get linked in with the application. These target libraries comprise the software that is responsible for programming the emulation and peripheral device registers and effect data transfer. On the host side, the RTA host software is a target independent layer that can filter data in a multiprocessor environment to send and receive the data from a particular target unambiguously. 6. Conclusion The RTA methodology presented in this paper is extensively used and widely accepted. The problem that is illustrated in the high energy physics application presented in Section 3 is not limiting. Our experiences to date have shown that other domains such as wireless
13
and mobile computing require the processing of RTA data where both DSPs and microcontrollers are on the same scan chain. The development of this RTA capability has been predicated on the JTAG specification and the adherence to this standard in the emulation hardware that has been designed into the DSP core. The software that has been developed is able to differentiate between the various DSPs. The virtual paths in the RTA architecture guarantee data integrity. The COM interfaces permit the analysis of data via Commercial Off-The-Shelf (COTS) tools such as MATLAB® and LabVIEW™. We consider this methodology to be successful based on the criteria cited in Section 4. The methodology has been demonstrated to scale well. It incorporates many ease-of-use features and is reliable from both hardware and software perspectives. We have maximized performance by utilizing emulation hardware and streamlining the software layers. Biographies Debbie Keil is a member of Texas Instruments’ technical ladder, a distinction held by the top twenty percent of TI’s technical staff worldwide. She is the co-architect of TI’s RealTime Analysis technology and the technical lead of the software engineering team that develops real-time analysis solutions for TI DSPs. Debbie has extensive industry experience in compiler development and embedded systems programming. She has published in the EE Times and has patents pending. She is a member of the International WHO’S WHO of Information Technology. Debbie holds a Master’s degree in Computer Science and a Bachelor’s degree in Computer Science and Math from the University of Pittsburgh. Prithvi Rao is a member of technical staff with Texas Instruments. He is currently working on developing Real-Time Analysis capability on multiprocessor arrangements of DSPs. He has worked in the development of the Real-Time Mach operating system at CMU and was awarded two patents for his work in distributed control architectures for mobile robots. He has a Bachelor's degree in Electrical Engineering from the University of Canterbury, New Zealand and a Master's degree in Electrical and Computer Engineering from CMU. He has been invited to present several tutorials on Java for the Usenix organization and has published numerous articles on Java and CORBA in their “;login:”technical journal. He has an adjunct faculty appointment at Carnegie Mellon where he teaches graduate courses in Electronic Commerce, Java, Distributed Object Technologies and Telecommunications Management. References [1] JTAG IEEE 1149.1 Specification http://www.ieee.org
14
[2] Texas Instruments JTAG Extensions http://www.ti.com [3] Gottschalk, E.E., et al., "The BTeV DAQ and Trigger System – Some Throughput, Usability, and Fault Tolerance Aspects," Proceedings of the Computing in High Energy and Nuclear Physics Conference (CHEP 2001), p. 628, Beijing, China, September 2001. [4] Microsoft Component Object Model (COM) http://www.microsoft.com
15