Multiprocessor System-On-Chips
Introduction
SoC
An integrated circuit that implements most or all of the functions of a complete electronic system Memory chip is not a system but a component
Contains memory, instruction-set processor (CPU), specialized logic, bus, other digital functions…
Generally tailored to the application rather than being a general-purpose chip
Cost-effective Provide the necessary performance
Introduction
A new crisis to SoC design
Pressure on chip designers
increases in functionality, reliability, bandwidth decreases in cost, power consumption high-intension silicon with Register-Transfer-Level hardware design techniques
productive gap, growing cost, time-to-market
Challenges
silicon density, design and verification tools & complexity, bug cost, software, complex standard
Introduction
New design methodology
using pre-designed, pre-verified processor cores but, general-purpose processors is impossible designing custom RTL logic but, takes too long, too rigid to change easily
Solution
configurable, extensible processor firmware rather than RTL-defined hardware
The Limitations of Traditional ASIC Design
Conventional SoC-design
combining a standard microprocessor, memory, RTL-built logic into an ASIC philosophical descendants of earlier board-level designs one or two 32-bit busses (for saving pins) rigid partitioning between a microprocessor and logic blocks because assume that the communications are bottlenecks
The impact of SoC Integration
wide bus (128-, 256-bit) 1 GB per second on an SoC using wider busses
The Limitations of Traditional ASIC Design
The limitation of general-purpose processors
most generic data types for complete generality silicon-intensive, deeply pipelined, super-scalar… IPC limits
Embedded system
critical functions need specific data types cannot take full advantage of all capabilities hard-wired circuits for data-intensive functions
Extensible Processors as an Alternative to RTL
Two important criteria
must accelerate and simplify the creation of configurations hardware descriptions, software development tools, verification aids
Configurable & Extensible processor
Non-architectural processor configuration Fixed-menu of processor architecture configurations User-modifiable processor RTL Processor extension using an instruction-set description language Fully automated processor synthesis
Extensible Processors as an Alternative to RTL
Design migration from hardwired state machine to firmware program control
flexibility software-based development faster, more complete system modeling unification of control and data time-to-market
Not right choice
small, fixed-state machines simple data buffering
Extensibility and Energy Efficiency
Hard-wired logic
small silicon area (low switched capacitance) low cycle count (more useful work per cycle)
Configurable processor
small architecture features can be omitted and configured on demand application-specific instructions and interface can be added
same effects
Toward Multiple-Processor SoCs
Complexity of SoC designs
Two trends
faster initial design greater post-fabrication flexibility combining of functions traditionally implemented migration of functions with RTL into application-specific processors
Regards
interconnection of multiple processors simulation of a system composed of application-specific processors
What Are MPSoCs?
What is an MPSoC?
Why do we care about performance?
A system-on-chip that contains multiple instruction-set processors In practice, most SoCs are MPSoCs
Precise performance requirements At least, some real-time deadlines
Why do we care about energy?
Battery-operated devices In non battery-operated, energy consumption is related to cost
What Are MPSoCs?
In an MPSoC, SW design is an inherent part of the overall chip design
For chip designers
Either HW or SW can be used to solve a problem Depend on performance, power, and design time
For SW designers
SW will be shipped as a part of a chip must be extremely reliable Meet many design constraints reserved for hardware (hard timing constrains, energy consumption...)
What Are MPSoCs?
Heterogeneous vs. symmetric multiprocessors
Harder to program Cheaper More energy-effective
Challenges in MPSoC software design
The combination of high reliability, real-time performance, small memory footprint, and low-energy software
Why MPSoCs?
Typically, MPSoC is a heterogeneous multiprocessor
Several different types of PEs (processing elements) Heterogeneously distributed memory system Heterogeneous interconnection network between the PEs and the memory systems
A shared-memory multiprocessor model is preferred because it makes life simpler for the programmer
Why MPSoCs?
Multiprocessor vs. Uniprocessor
Enough performance for some applications The computational concurrency required to handle concurrent real-world events in real time (task-level parallelism)
Heterogeneous vs. Symmetric
Perform real-time computations Be area-efficient Be energy-efficient Provide the proper I/O connections
Why MPSoCs?
Perform real-time computations
Real-time computing is much more than high-performance computing
Predictable behavior of the hardware
For predictable and high performance A mechanism that is specialized to the needs of the application Specialized memory systems, application-specific instructions
Why MPSoCs?
Be area-efficient
A special-purpose PE may be much faster and smaller than a programmable processor
If the system architect can predict some aspects of the memory behavior of the application, it is often possible to reflect those characteristics in the architecture
Memory specialization / Cache configuration
Why MPSoCs?
Be energy-efficient
Power-sensitive, whether due To environmental considerations (heat dissipation) Or to system requirements (battery power)
Specialization saves power
Stripping away features that are unnecessary for the application
Why MPSoCs?
Provide the proper I/O connections
The point of an SoC is to provide a complete system
Specialized I/O
Because of the variety of physical interfaces, it is difficult to create customizable I/O devices effectively
Challenges
Software development
Task-level behavior
High performance, real time, and low power Each MPSoC requires its own software development environment
Task-level parallelism is both easy to identify in SoC applications and important to exploit RTOSs provide scheduling mechanisms, but abstract the process
Networks-on-chips
Use packet networks to interconnect the processes in the SoC
Challenges
FPGAs
Security
The FPGA logic can be used for custom logic that could not be designed before manufacturing A good complement to software-based customization
Connect to Internet Security becomes increasingly important
Networks of chips
Sensor networks Do not have total control over the system
Design Methodologies
Fast design time is very important
Tight time-to-market and time window constraints
Higher level abstractions are needed on the HW and SW side
A key issue is the definition of a good system-level model that is capable of representing all those heterogeneous components along with local and global design constraints and metrics
Design Methodologies
Design steps
Design space exploration
Architecture design
Hardware/software partitioning, selection of architectural platform and components
Design of components, hardware/software interface design
Consider strict requirements, regarding time-tomarket, system performance, power consumption, and production cost…
Hardware Architecture
Which CPU do you use? What instruction set and cache should be used based on the application characteristics?
What set of processors do you use? How many processors do you need?
What interconnect and topology should be used? How much bandwidth is required? What quality-of-service characteristics are required of the network?
How should the memory system be organized? Where should memory be placed and how much memory should be provided for different tasks?
Hardware Architecture
Research project for high-performance MPSoC architectures for high-performance applications
Philips Nexperia™ DVP Texas Instrument OMAP platform Xilinx Virtex-II Pro ™…
We can see that
Limit the number and types of integrated processor cores Provide a fixed or not well-defined memory architecture Limit the choice of interconnect networks and available IPs Do not support the design from a high abstraction level
Software
Programmer’s Viewpoint
Parallel architecture, parallel programming is required
Two types of parallel programming model Shared-memory programming : OpenMP Message-passing programming : message-passing interface
MPSoC vs. conventional parallel programming
Application
Application-specific Not need full-featured parallel programming models
Architecture
Heterogeneity Massive parallelism
Software
Software Architecture and Design Reuse Viewpoint
Middleware, Operating system, Hardware abstraction layer
APIs provide an abstraction of the underlying hardware architecture to upper layers of software
The software architecture may enable several levels of software design reuse
Key challenges Determining which abstraction of MPSoC architecture is most suitable at each of the design steps Determining how to obtain application-specific optimization of software architecture
Software
Optimization Viewpoint
Cost and performance requirements
Two factors
Processor architecture
Memory hierarchy
Parallelism Application-specific
Shared memory Distributed memory
Consider problems in a different context with more design freedom in hardware architecture and with a new focus on energy consumption
Techniques for Designing Energy-Aware MPSoC
Introduction
Power and energy consumption have become significant constraints Reducing active power – voltage scaling Reducing standby power
30
Reducing Active Energy
Multiple Supply Voltage
Decreasing supply voltage – decrease performance since increase gate delay Effective in MPSoC since different type of MP require difference performance
DVS combined with DFS
(most popular of the techniques)
Most embedded and mobile processors containing this feature
31
DVS+DFS
As long as supply voltage is increased before increasing the clock rate, the system only stall when the PLL is relocking on the new clock rate. Future MPSoC would require its own converter and PLL
Requirement : Cores are tolerant of periodic dropouts Complication : PLL is analog device – noise is induced by digital switching
32
Reducing Standby Energy
Increasing VT decreases Subthreshold leakage current (pros) and increases gate delay (cons). DVS, DFS, variable VT is an effective way Sleep transistor Gating the supply rail Switching off the supply to idle component
System SW can determine the optimal scheduling Can direct idle cores to switch off
33
Energy-Aware Memory System Design
Memory constitute a significant portion of the overall chip resources Energy is expended due to data access, coherence activity and leakage energy expended in storing the data
34
Reducing Active Energy 1. Partitioning large caches into smaller structures 2. Use of a memory hierarchy that attempts to capture most access in the smallest size of memory
Cache way-prediction Selective way caches Filter cache
35
Reducing Standby Energy
Most of above techniques do little to alleviate the leakage energy
Reducing leakage during idle cycles by turning off the supply voltage
Gated-Vdd : shut down portions of the cache dynamically
State-preserving leakage optimization
36
Requirement
Ability to identify unused resources Cache size is reduced dynamically to optimize Cache block is supply-gated Keeping the tag line active when deactivating a cache line Dynamic voltage scaling Drowsy cache Leakage-biased bitline
37
Influence of Cache Architecture on Energy Consumption
2 popular alternatives for building a cache single multi-ported cache (shared by MP)
Pros
Constructive interference can reduce overall miss rates Inter-process communication is easy to implment
Cons
Consume significant energy Not scalable
38
Each processor have its own private cache
Pros
Low power per access, low latency, and good scalability
Cons Duplication of data and instructions Complex cache coherence protocol
39
Combine the advantage of two option! CCC (crossbar-connected cache)
Shared cache is divided into multiple banks using an N x M crossbar Pros Duplication problem is eliminated (logically single) Consistency mechanism isn’t needed Scalable Be useful in reducing energy consumption
40
CCC (con’t)
Cons
Concurrent access to the same bank cause processor stall
Alleviate
More cache banks than # of processor Deals with the reference to the same block
The energy benefits of CCC
41
Reducing Snoop Energy
In bus-based symmetric multi-processors, all bus size cache controllers snoop the bus
Snoop occur when writes are issued to already cached block, and cache miss Unlike normal cache, tag and data array access are separated Energy optimizations include Use of dedicated tag arrays for snoops Serialization of tag and data array accesses
42