Open Source SOA in the Cloud: Data Analytics in the Cloud Tom Plunkett Michael Sick
[email protected] [email protected]
SOA World 2009
Overview
Data Analytics in the Cloud
Introductions
• Who are we? • Baselines & definitions
Opportunity
• Targeted Use Cases • Technical convergence & opportunities • Commercial opportunities & drivers
Technology & Standards
• State of current technology • Commercial & FOSS solutions • Hadoop Focus
Challenges
• Challenges to Meet Target Use Cases • Economic challenges & the role of “free” • Wide scale challenges in Cloud and data analytics
Questions
• Questions • Contacts
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 2
Data Analytics in the Cloud: Introductions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 3
Introductions
Opportunity
Tom Plunkett
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Extensive Federal Government Experience IBM Certified SOA Solution Designer Patents Teach OOP and Java for Virginia Tech
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 4
Introductions
Opportunity
Michael Sick
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Commercial & Federal Enterprise Architect Owner: Serene Software Inc. – EA Services Firm Clients include: BAE, USAF, Raytheon, BearingPoint, McGraw-Hill, Sun Microsystems, Badcock Furniture Fascinated by technology -15 years running
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 5
Introductions
Opportunity
Serene Software
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
• Serene is a boutique consulting company focusing on delivery of Enterprise Architecture services and solutions • Service Areas – IT Governance – IT Strategy – IT Cost Containment – Service Oriented Architectures (SOA) – IT Solution Selection – IT Audit & Analysis • Experience includes: BAE, USAF, Raytheon, BearingPoint, McGraw-Hill, Sun Microsystems, Badcock Furniture, … • Founded in 2003 (privately held, no debt) and headquartered in Jacksonville, FL
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 6
Introductions
Opportunity
Draft NIST Definition of Cloud Computing
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
A model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction Essential Characteristics
Delivery Models
Deployment Models
• On-demand self-service
• Cloud Software as a Service (SaaS)
• Private cloud
• Ubiquitous network access • Location independent resource pooling • Rapid elasticity
• Cloud Platform as a Service (PaaS) • Cloud Infrastructure as a Service (IaaS)
• Community cloud • Public cloud • Hybrid cloud
• Measured Service
Source: Draft NIST Definition of Cloud Computing, 06/2009 This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 7
Introductions
Opportunity
OSI Open Source Definition
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Free Redistribution Source Code Derived Works Integrity of The Author's Source Code No Discrimination Against Persons or Groups No Discrimination Against Fields of Endeavor Distribution of License License Must Not Be Specific to a Product License Must Not Restrict Other Software License Must Be Technology-Neutral Source: http://www.opensource.org/docs/osd This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 8
Introductions
Opportunity
The Open Group SOA Definition
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Service-Oriented Architecture (SOA) is an architectural style that supports service orientation Service orientation is a way of thinking in terms of services and service-based development and the outcomes of services
Source: http://www.opengroup.org/projects/soa/doc.tpl?gdid=10632 This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 9
Data Clouds & Data Grids – What‘s the difference?
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Often Data Clouds & Data Grids are used interchangeably, we make the following distinctions Data Grids
Data Clouds
• Grid computing system optimized to share large amounts of distributed data
• Focuses on perception of infinite storage, computing capacity
• Focus on technical capabilities
• Focus on cost, virtualization & flexible capacity
• Often combined with computational grid computing systems
• Enables scale-up/scale-down economics
• Data often moved to compute grid for use
• Data moved rarely, locality is a key feature
• Often oriented towards highly structured scientific data computing applications
• Clouds thus far focusing on column oriented, massively scalable data stores
Sources: Wikipedia & [Grossman 1] This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 10
Introductions
Opportunity
Definition: Mashups
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Web available resource that combines data/functions from two or more external resources Idea of mashup efforts is to reduce the cost of producing and consuming resources Integration should be fast, easy Often focuses on widely available formats/protocols like RSS or Atom over HTTP
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 11
Data Analytics in the Cloud: Opportunities
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 12
Use Case: Cloud Data Analytical Tools for Intelligence Community Field Analyst
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Problem Statement: Analytical Tools Obsolete On Deployment, field analysts need timely, configurable data analytics. How does cloud based DA meet the needs of IC analysts Customer Problem
Cloud Analytical Tools Solution
• Traditional business intelligence tools require years to develop
• Recomposable Cloud Computing Data Analytical Tools
• Field Analysts confront situations which are rapidly changing • Petabytes of data require analysis
This work is licensed under a Creative Commons Attribution 3.0 United States License
– Apache Hadoop
Customer Value • Enabling field analysts to quickly build the analytical tool they need to analyze petabytes of data
– Mashups – Service-Oriented Architecture
Tom Plunkett & Michael Sick 13
Why the “Buzzword” Soup? Convergence of Capabilities
Cloud Computing
Data Analytics
SaaS
Mashups
This work is licensed under a Creative Commons Attribution 3.0 United States License
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Convergence of capabilities New opportunities in breadth and depth of DA services
Free Open Source Software (FOSS)
Virtualization
Introductions
• Big Data: Cloud disk and data storage engines make petabyte environments available to new clients • Value Based Billing: Heavy use of FOSS in the cloud reduces costs directly & indirectly • Capacity Scaling: Scaling up/down of capacity in pay-go fashion makes DA available to wider audience • Composable UI’s: Capability to assemble DA results into various interfaces Tom Plunkett & Michael Sick
14
Introductions
Early Data Analytic Cloud Consumers/Providers
Government Organizations
DAaaS Providers This work is licensed under a Creative Commons Attribution 3.0 United States License
Questions
Big Internet Companies
• Yahoo, Amazon – can build DA on inf.
SaaS Companies
• Force.com – DA & Warehousing to SBA’s
Social Platforms
• Facebook – sell DA access to anon. user info
Insurers
• BCBS – private clouds across consortium
Healthcare & Biotech
• Kaiser Permanente – common DA services
Rating Agencies
• S & P – open DA cloud to customers
Intelligence Community
• CIA –private org-wide Cloud
Services
Example Companies
Services
Cloud DA Opportunities
Large datacentric Traditional Co’s
Challenges
Services
Internet Scale Service Providers
Types
Technology & Standards
Defense Managed Services • DISA -- offer DA to .mil clients Healthcare
• SSA – offer DA to fraud prevention analysts
DAaas Infrastructure
• Cloudera –managed Hadoop instances
SMB DAaaS Provider
• ?? – managed DAaaS, simplified, low cost
Services
Profile
Opportunity
Data Analytics in the Cloud
Tom Plunkett & Michael Sick 15
Data Analytics in the Cloud: Technology & Standards
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 16
Introductions
Opportunity
Google MapReduce
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Algorithm for computing distributed problems using a divide and conquer approach with a cluster of nodes Master node Maps input into smaller sub-problems and distributes the work to the cluster. A worker node may further map the work for a further cluster of nodes. The worker nodes then process the smaller problems, and return the answers back to the master node
Master node then Reduces the set of answers into the answer to the original problem
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 17
Introductions
Opportunity
Apache Hadoop
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Open Source implementation of the MapReduce algorithms Hadoop can store and process petabytes of data Subprojects include HBase, Chukwa, Hive, Pig, and ZooKeeper Yahoo (more than 100,000 CPUs in >25,000 computers running Hadoop) and other companies make extensive use of Hadoop
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 18
As-Is Hadoop Simplified Reference Architecture
Chukwa
Zookeeper
This work is licensed under a Creative Commons Attribution 3.0 United States License
ETL
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
HBase
Apache Hadoop
Business Intelligence
Introductions
Pig
Structured Data Unstructured Data
Hive
Tom Plunkett & Michael Sick 19
Introductions
Opportunity
Apache Hadoop Sub-projects
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Hadoop Subprojects
Capabilities
Example Companies
Chukwa
• Data collection system for monitoring and analyzing large distributed systems
• Yahoo
HBase
• Similar to Google’s BigTable • Distributed database for structured data • Multi-dimensional sorted map
• Yahoo
Hive
• Data warehouse infrastructure for large datasets • Hive QL query language
• Facebook
Pig
• High-level language for data analysis • Compiler for Map-Reduce programs
• Yahoo
Zookeeper
• Configuration, Naming, Distributed Synchronization, and group services
• Yahoo
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 20
Data Analytics in the Cloud: Challenges
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 21
Introductions
Opportunity
To-Be Simplified Hadoop Architecture
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
REST API HBase SOAP API Business Intelligence
Query Language
Pig
Hive
Apache Hadoop Chukwa
Zookeeper
Structured Data Unstructured Data
Algorithm Library
ETL This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 22
Introductions
Opportunity
Key Challenges
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Infrastructure
Adoption
Emerging Challenges
Administration
Input & Analysis
Output
This work is licensed under a Creative Commons Attribution 3.0 United States License
Hardware
Speed of Rack Interconnects, Multi-core
Parallelization
Core platform, Data Analytic Components
Node Affinity
Make use of super nodes, XML i/o, en/de-crypt
Cost
“brutally efficient” pricing, FOSS advantages
Cost Models
Accurate, open models of CapEx, OpEx costs
Migration Pain
Full warehouse migration, ETL,
Ease of Admin.
Parallel current RDBMS, Warehouse admin
Debugging
Distributed debugging, integration w/ Provider
Flexible Provisioning
Multi-level provisioning – co., dept, individual
System Reporting
Reporting, audit trails, view to DA system
ETL Integration
Interface, metadata optimized for ETL loading
Intuitive API’s
Declarative & programmatic cross language
Product Integration
BI, Applications (SAP, Oracle Financial, Lawson)
Data Visualization
Viewing & drill down of very large data sets
Intuitive API’s
Declarative & programmatic cross language
Mashups/Dynamics
Easy discovery of data & functions & workflows Tom Plunkett & Michael Sick 23
Introductions
Opportunity
Solutions: Projected & In-Progress
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Infrastructure
Adoption
Emerging Challenges
Administration
Input & Analysis
Output
This work is licensed under a Creative Commons Attribution 3.0 United States License
Hardware
Interconnect $$ dropping, hardware maturing
Parallelization
Platforms advance, market for components
Node Affinity
Discovery of capability, affinity into Hadoop, …
Cost
FOSS’s game to loose, small diff * a lot = a lot
Cost Models
Industry standard ROI/IRR models for CC
Migration Pain
Migration toolkits for traditional DW products
Ease of Admin.
Integrated & extended admin packages
Debugging
Commercial distributed debugging
Flexible Provisioning
Multi-level provisioning – co., dept, individual
System Reporting
Reporting, audit trails, view to DA system
ETL Integration
ETL interface, support of popular packages
Intuitive API’s
SQL like interface in core, language bindings
Product Integration
3rd party adaptors, IWay et al
Data Visualization
Modeling, meta-data, traceability, and new UI’s
Intuitive API’s
SQL like interface in core, language bindings
Mashups/Dynamics
Generic datatypes, discovery services Tom Plunkett & Michael Sick 24
Data Analytics in the Cloud: Questions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Introductions
Opportunity
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 25
Introductions
Opportunity
Question? & Contact Information
Data Analytics in the Cloud
Technology & Standards
Challenges
Questions
Principle Architect / Partner Michael A. Sick 888.777.1847
[email protected]
Cloud Computing Architect Tom Plunkett 888.777.1847
[email protected]
Address Serene Software 116 19th Ave. North, Suite 503 Jacksonville Beach, FL URL: www.serenesoftware.com
Address Serene Software 116 19th Ave. North, Suite 503 Jacksonville Beach, FL URL: www.serenesoftware.com
This work is licensed under a Creative Commons Attribution 3.0 United States License
Tom Plunkett & Michael Sick 26