Domain Specific Languages And Haskell Don Stewart | LACSS, Santa Fe, NM | Oct 14, 2009
© 2009 Galois, Inc. All rights reserved.
Two Points To Take Home 1. Embedded domain specific languages (EDSLs) are an inexpensive way to improve portability, maintainability, productivity, and correctness of new scientific code 2. Haskell is a great programming language for EDSLs, and also for exploring parallel programming models – via STM, aggressive speculation and nested data parallelism. © 2009 Galois, Inc. All rights reserved.
Part 1: A Way Forward: Embedded Domain Specific Languages
© 2009 Galois, Inc. All rights reserved.
Change and Growth • New architectures are appearing • Increasing complexity (GPUs, The Cell) – – – –
unusual programming models unusual memory hierarchies unusual hybrid architectures massive compute power
• How do we write code that is going to be portable, maintainable, correct and fast? • How to bridge the “programmability gap”? How do we experiment with these machines? © 2009 Galois, Inc. All rights reserved.
Domain Specific Languages (DSLs) • DSLs are: – Small languages with a restricted programming model aimed at a particular problem domain – Work at semantic level of problem domain – Not trying to do everything at once • A relatively new, and very hot, approach to tackling unusual problems and managing complexity • Emerged from the programming language community
© 2009 Galois, Inc. All rights reserved.
DSLs : the advantages Focused to a particular problem domain, so • Better productivity: work in domain abstractions • Higher level – easier to maintain • Restricted semantics: easier to optimize and verify – Domain-level knowledge feeds new optimizations
• Usually small and declarative – not tied to a particular hardware model – so easier to port • Encourages explorative coding! © 2009 Galois, Inc. All rights reserved.
An nice example: Cryptol A DSL for cryptography built by Galois • Emphasis on performance + correctness • High level: target users are crypto experts • High level: not tied to any machine model, so: – Portable to VHDL/FPGA/C/Haskell/Interpreter
• Restricted programming model enables automatic equivalence checking • Domain-specific optimizations: so very, very fast – “128 bit AES targetting 100Gbps via Async FPGAs” © 2009 Galois, Inc. All rights reserved.
Domain-specific knowledge made visible – – AES “rounds” function in Cryptol Rounds (State, (initialKey, rndKeys, finalKey)) = final where { istate = State ^ initialKey; rnds = [istate] # [| Round (state, key) – – stream comprehension || state <- rnds || key <- rndKeys |]; final = FinalRound (last rnds, finalKey); }; © 2009 Galois, Inc. All rights reserved.
Making DSLs easier to build • Designing and building compilers is relatively time consuming – Especially if we want good code generators – Good type systems – Good optimizers
• Embedded DSLs are the way forward – – – –
Embed your DSL in an existing language – save $$ Reuse its syntax, compiler, libs, tools, type system But, use a library for code generation And write your own optimizations © 2009 Galois, Inc. All rights reserved.
Good host languages for EDSLs Good host languages – Need to support overloading (numbers, strings) – Can build ASTs from regular syntax – Need a rich type system (embed the domain language's types in the host language's types) – Should have a good toolchain (doc tools, profilers) – Should have good code generation libraries • C, LLVM, Asm, Haskell, ...
Haskell is a good host language! © 2009 Galois, Inc. All rights reserved.
1. The “accelerate” multi-dim array EDSL “accelerate” - Haskell EDSL for multi-dimensional array processing targetting data parallel hardware – Collective operations on multi-dim arrays • Targetting massive data parallelism
– Restricted control flow and types • Widely portable, and matches what the GPU supports
– Generative code approach based on templates • Matches hand-specialization techniques
• Not tied to any hardware – guaranteed portability • http://www.cse.unsw.edu.au/~chak/project/accelerate/ © 2009 Galois, Inc. All rights reserved.
Import Data.Array.Accelerate – – EDSL code for dot product dotp :: Vector Float → Vector Float → Acc (Scalar Float) dotp xs ys = let xs' = use xs
– – marshal data to GPU
ys' = use ys in fold (+) 0 (zipWith (*) xs' ys') – – GPU computation • See “Haskell Arrays, Accelerated (Using GPUs)”, Chakravarty et al, Haskell Implementors Workshop, 2009.
© 2009 Galois, Inc. All rights reserved.
2. An EDSL for SIMD-parallel algorithms • Programming model: SIMD-parallel algos – – – – – –
Target users: mathematicians (!) Target backends: CPU, the Cell Emphasizes domain-specific optimization Emphasizes exploratory programming “Generates unusual call patterns” Generates C, Haskell, or fed into state-of-the-art instruction scheduler
• Anand and Kahl, “A Domain-Specific Language for the Generation of Optimized SIMD-Parallel Assembly Code”, 2007. © 2009 Galois, Inc. All rights reserved.
Bit-shift division: example divShiftMA :: SPUType val ⇒ Integer → Integer → Integer → Integer → val → val divShiftMA p q s n v | s 6≡ 0
= mpya m v b
| m’ < 2 ↑ 10 ∧ m’ > 0
= mpyui v m’
| m’ < 2 ↑ 9 ∧ m’ > (−2 ↑ 9) = mpyi v m’ | otherwise = mpy v m where m’ = (p ∗ 2 ↑ n + (q − 1)) ‘div‘ q -- integer exponent and division m = unwrds4 m’ b = unwrds4 $ (s ∗ 2 ↑ n) ‘div‘ q © 2009 Galois, Inc. All rights reserved.
3. BASIC (targets LLVM) Import Basic main = runBASIC $ do
http://hackage.haskell.org/package/BASIC
10 GOSUB 1000 100 LET I := INT(100 * RND(0)) 200 PRINT "Guess my number:" 210 INPUT X 220 LET S := SGN(I-X) 230 IF S <> 0 THEN 300 240 FOR X := 1 TO 5 250 PRINT X*X;" You won!" 260 NEXT X © 2009 Galois, Inc. All rights reserved.
EDSLs: Summary • DSLs increase abstraction, enabling new levels of – – – –
Portability Verifiability Maintainability Without sacrificing performance
• Embedded DSLs are cheaper to construct – Reuse significant resources of a compiler toolchain
• Haskell is a rich playground for EDSLs, with many examples, in a wide range of domains • More examples in the paper © 2009 Galois, Inc. All rights reserved.
Part 2: Parallel Programming in Haskell
© 2009 Galois, Inc. All rights reserved.
The Haskell Approach to Multicore Two broad approaches to multicore programming provided by Haskell • Deterministic parallelism 1. Hand-annotated speculation + work stealing queues 2. Nested data parallelism
• Concurrency for multicore 3. Very lightweight threads 4. Communication via MVars and transactional memory © 2009 Galois, Inc. All rights reserved.
Haskell and Parallelism: Why? • Language reasons: – Purity, laziness and types mean you can find more parallelism in your code – No specified execution order! – Speculation and parallelism safe. • Purity provides inherently more parallelism • A very high level language, but lots of static type information for the optimizer © 2009 Galois, Inc. All rights reserved.
Haskell and Parallelism • Custom multicore runtime: high performance threads a primary concern – thanks Simon Marlow! • Mature: 20 year code base, long term industrial use, massive library system • Ready to go: http://haskell.org/platform
© 2009 Galois, Inc. All rights reserved.
The GHC Runtime Model • Multiple virtual cpus – Each virtual cpu has a pool of OS threads – CPU local spark pools for additional work
• Lightweight Haskell threads map onto OS threads: many to one. • Even lighter 'sparks' used for speculative work • Automatic thread migration and load balancing • Parallel, generational GC • Transactional memory and MVars. © 2009 Galois, Inc. All rights reserved.
Approach 1. Parallel Strategies Useful speculation built up from the `par` combinator: a `par` b • Creates a spark for 'a' – very cheap! A speculation “hint” • Runtime sees chance to convert spark into a thread • Which in turn may get run in parallel, on another core • 'b' is returned • No restrictions on what you can annotate – very flexible approach to post-hoc parallelization © 2009 Galois, Inc. All rights reserved.
Parallel Strategies: Programming Model • Deterministic: – Same results with parallel and sequential programs – No races, no errors – Good for reasoning: erase the `par` and get the original program • Cheap: sprinkle par as you like, then measure and refine • Measurement much easier with Threadscope • Strategies: combinators for common patterns © 2009 Galois, Inc. All rights reserved.
New Tools : ThreadScope • New thread profiling tool: ThreadScope
© 2009 Galois, Inc. All rights reserved.
Approach 2: Nested Data Parallelism We can write a lot of parallel programs strategies or explicit threads, however •
par/seq are very light, but granularity is hard to get right
•
forkIO/MVar/STM are more precise, but more complex
•
Trade offs between abstraction and precision
Another way to parallel Haskell programs: • nested data parallelism © 2009 Galois, Inc. All rights reserved.
Nested Data Parallelism • If your program is expressible as a nested data parallel program – – – –
The compiler will flatten it to a flat data parallel one No worrying about explicit threads or synchronization Clear cost model (unlike `par` speculation) Good locality of data/easier partioning of work
• Looks like a good model for array and GPU programming (see Chakravarty's 'accelerate' @ UNSW) • Good speedups with many hardware threads (T2) © 2009 Galois, Inc. All rights reserved.
Approach 3: Explicit lightweight threads • Lightweight threads are preemptively scheduled (10 … 10M+ Haskell threads possible) • Non-deterministic scheduling: random interleaving • When the main thread terminates, all threads terminate (“daemonic threads”) • Threads may be preempted when they allocate memory • Communicate via messages or shared memory • See http://ghcsparc.blogspot.com/ for benchmarks © 2009 Galois, Inc. All rights reserved.
Communicating between threads • We need to communicate between threads • We need threads to wait on results • Use shared, mutable synchronizing variables to communicate Synchronization achieved via async messages, MVars or STM By far the most popular concurrency technique, and maps onto multicore well See Simon Marlow's pubs. for lots of benchmarks
© 2009 Galois, Inc. All rights reserved.
Approach 4: Transactional Memory • Optimistic: Each atomic block appears to run in complete isolation • Runtime publishes modifications to shared variables to all threads, or, • Restarts the transaction that suffered contention • You have the illusion you're the only thread • Composable, deadlock free communication • Used in concurrency-heavy systems at Galois • Slower than MVars, but useful and can be tuned © 2009 Galois, Inc. All rights reserved.
Parallelism and Haskell: Summary • More information: Google for: – “Parallel Programming in Haskell: A Reading List”
• Sophisticated, fast runtime – – – –
1. Sparks and parallel strategies 2. Nested data parallel arrays 3. Explicit threads + MVars and shared memory 4. Transactional memory
• Available in a widely used open source language © 2009 Galois, Inc. All rights reserved.
About Galois Research and tech transition company Just over a decade old Specialists in – Compiler and language engineering – Domain-specific languages – Formal methods – High assurance systems – High performance cryptography Clients include DOE, DARPA, DHS, DOD and IC Looking to collaborate! © 2009 Galois, Inc. All rights reserved.