Efficient Synthesis of Application Specific MP architectures for Process Networks for Applications with Soft Real Time Constraints
Harsh Dhand
Efficient Synthesis of Application Specific MP architectures for Process Networks for Applications with Soft Real Time Constraints
Harsh Dhand
Outline • • • • •
Bin Packing Problem Redefined Problem at hand Motivation Real Life Application Probabilistic Application Modeling
It’s Bin Packing • Let the bins be of Variable sizes, and from a collection we are to chose for fitting balls. • The balls are flexible sized, having an average size, but can be stretched or compressed depending on external pressure. • The expansion/ contraction of a ball might have positive or negative effect on others. • Without bursting any ball, select from the bins so as to minimize the bin size.
Objective • Application Modeling : System Level power/performance analysis • Help Designer : Design platform or select right platform from set of target applications. • Focus: Portable Embedded Multimedia Systems: “soft” real time constraints: average behavior far more important than worst case.
Tasks Involved • Automatic Synthesis of System on Chip Multiprocessor Architectures for Process Networks Tasks
Application
Synthesis
Verification
Tasks Involved • Automatic Synthesis of System on Chip Multiprocessor Architectures for Process Networks Tasks
Application
Synthesis
Verification
Synthesis Problem • Given: Application description and component library • Find: Allocation of processors for processes, local and shared memories for communication • Objective Fn: Minimize Hardware cost, Meeting real time constraint
Solution: 1 • Traditional approach: Mapping as a scheduling problem. • Work in scheduling Task Graphs with begin-end type of communication and constant processing time requirements. • Data Dependent Behavior rules out static scheduling.
Solution: 2 • Application Modeled as KPN. • MILP based approach for mapping KPN with data dependent behavior and estimating data communication conflicts. • Dynamic scheduler present in compute units to schedule the threads ( block and context switch mechanism)
Solution 2: Constraint Specification Specify Various Parameters Mapping constraints: • Process mapped onto one processor • Queue mapped to local memory if reader and writer processes mapped to same unit. • Queue mapped to local/shared memory • Processor communicates with shared memory when some reader/writer present.
Synthesis Performance Constraints: Throughput < Bandwidth Throughput governs Computation Time: Capacity constraints at Processor Solved by Optimizer or Heuristic based.
Shortcomings Time taken by process on processor actually data dependent but because of input to MILP, given worst case value. When two/more processes mapped to same processor, there relative behavior can be positive or negatively co-related. No consideration of same.
Problem Cause • Due to Data Dependencies • Designing for worst case (order of magnitude difference) • QoS requirements depending on media.
Study 2 Observations: Depending on input traces, processes expose inherent clustering of probability distribution in steady state regime. Processes react in different ways, scope for optimization and trade-offs.
Real Example
Test Case •
3 runs: – Producer produces at rate = Consumer. – Producer slower than consumer – Producer faster than consumer Two Scenarios: Clip with low correlation (<10%) between adjacent MBs and frames. Clip with High correlation (>90%)
Discussion on Results
Low Correlation Slowing down the Producer. Third Run: Producer producing slightly faster, blocked by consumer, also buffer is full Blocked for > 50 % time
Results & Discussion
High Correlation Steady State Probabilities: Different nature of input => Different set of plots Though Computation is low: Buffer requirements are higher: Abrupt changes require.
Low Correlation
High Correlation
Implications • Info about global probabilities: put processes with similar running probabilities in same cluster • eg: Processes > 50 % time onto low power processor • Processes around 25% on ASICs • Also partition on the available resources based on joint probabilities.
Computation Time
Number of Instances
Quantify
Compression Ratio
Computation Time
Tail Inequalities • Chebyschev’s inequality: Pr[|X-c| ≥ ε ] ≤ 1/ε2 E(X-c) 2 Ie Pr[|X-µ| ≥ ε ] ≤ Var X / ε2
Modeling Soft RT constraint • Let ‘α’ denote the real Time constraint, that is no more than α cases can you exceed the performance cost. • Then: Pr[|X-µ| ≥ ε ] ≤ Var X / ε2 ≤ α ie Worst case: ε = (Var X/ α) ½ Thus: X = µ + (Var X/ α) ½ Verify that: As α decreases, allowed worst case of X increases (should)
Number of Instances
µ
ε
Computation Time
X
Pr[|X-µ| ≥ ε ] ≤ Var X / ε2 ≤ α X = µ + (Var X/ α) ½
Chernoff Bounds Additive form: Pr [X > λ] < ….. Or Pr [X > λ + µ] < …. Multiplicative form: Pr [X > (1 + δ) µ] < ….
Tighter Bounds • The tighter the bound the more complex solving for δ/ ε • Pr [X ≥ (1 + δ) µ] ≤ U (n, δ, µ ) ≤ F (n, δ, µ ) ≤ G (n, δ, µ ) • In each case put f(δ) ≤ α and solve for δ, and substitute.
Tighter Bounds • U (n, δ, µ ) = ((nCi*)(µ/n)i*)/ µ(i+ δ)Ci* where i*= µ δ/(1- µ/n)
• eg G(δ, µ ) ≤ e – (δ^2)µ/3 ≤ α. Gives: X = (1 + 3 ln(1/ α)/ µ) µ
Changes • Σqmmjl * thrj * szj ≤ bwmml thr required now X (= µ + ε or (1 + δ) µ ) Similarly the ncyik, number of clock cycles spent for process i on processor k is now X (avg case instead of worst) tpjk×ncyjk×iterj + Context Switching OH ≤ freq
Changes • Σqmmjl * thrj * szj ≤ bwmml thr required now X (= µ + ε or (1 + δ) µ ) Similarly the ncyik, number of clock cycles spent for process i on processor k is now X (avg case instead of worst) tpjk×ncyjk×iterj + Context Switching OH ≤ freq
Granularity of Observation
• V(X+C) = V(X) • Other uses of variance
Number of Instances
• Stream Level • Frame Level
µ
ε
Computation Time
X
RT Constraint on Whole Process • The Chernoff-Hoeffding bound: Sum of Variables. But sum of observations of same variable. • Going by Average behavior and pdf: need to also respect the overall RT Constraint: Sum of all the computation time of processes • {Xi} be independent variables, common mean as µ and variance σ 2 • Require Pr {Sn – E[Sn] ≥ na}
RT Constraint on Whole Process
• Hoeffding Ineq: Pr {Sn – E[Sn] ≥ na} ≤ {(α /β)β (1- α /1-β) 1-β}n α= µ and β = µ + a Should also depend on Variance !
• Easy Bennet Bound: Pr {Sn – E[Sn] ≥ na} ≤ exp (-na2/2 σ 2)
Joint behavior of Processes • Positive Correlation • Negative Correlation
Correlation • Can be applied only to variables which are known to have cause/effect relation. • To determine the dependence between 2 variables like processing times, correlation is not a fit measure. • There are variance tests (studying)
Incorporating Joint PDFs • Main Issue: At what level do we consider correlation across different processors. • Since Require constraints at mapping stage: If consider correlation among all processes, huge number. • Choice: Post processing or adding to constraints
References • Probabilistic Application Modeling for System-Level Performance Analysis: Radu Marculescu & Amit Nandi (ECE, CMU) (DATE 01) • Automatic Synthesis of System on Chip Multiprocessor Architecture for PN: Basant Dwivedi, Anshul Kumar, M Balakrishnan • Chernoff-Hoeffding Bounds for Applications with Limited Independence: Jeanette P. Schmidt, Alan Siegel, Aravind Srinivasant
Thanks
Extra • Chi-Square test to determine whether Variables are independent (Null Hypothesis) • Or if Dependent, to what extent.