Hbase Goes Realtime

Uploaded by: Oleksiy Kovyrin
0
0

May 2020
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Hbase Goes Realtime as PDF for free.

More details

Words: 1,185
Pages: 21

Preview
Full text

HBase Goes Realtime

Wednesday, June 10, 2009 Santa Clara Marriott

Quick Overview » » » » » » »

Who are we? What is HBase? HBase 0.20 Primary Goal HBase 0.20 Architecture and Specifics HBase 0.20 By The Numbers Zookeeper Integration What’s Next?

Who are we? » Jonathan Gray › › › › ›

CoFounder and CTO, Streamy.com Background in CE, databases, distributed systems User of technology as a competitive advantage Contributing to HBase for ~1 year HBase in production for ~9 months

» JeanDaniel Cryans › › › ›

HBase Committer Graduate student at the Université du Québec HBase consultant currently working for OpenPlaces.com Contributing to HBase for ~18 months

What is HBase? » HBase is a… › › › › › › ›

Sorted, Distributed, ColumnOriented, MultiDimensional, HighlyAvailable, HighPerformance, Persisted Storage System

HBase adds random access reads and writes atop HDFS Billions of Rows * Millions of Columns * Thousands of Versions

HBase 0.20 Primary Goal » First ever Performance Release 1. Random Access Time 2. Scan Time 3. Insert Time

» As a randomaccess store, we are well suited for the storing and serving of Web applications › But high latency and variability (100s of ms to seconds) has reduced the usefulness of HBase and required the use of external caching in the past

HBase 0.20 Architecture » The Guiding Philosophy – Unjavafy Everything! › › › › › ›

Zerocopy reads Blockbased storage, reading, and indexing Drastically reduce Object instantiation Eliminate widespread usage of Trees Sorted merges using Heap structures Fast and intelligent caching with memoryawareness

» Effort Lead By… › Jonathan Gray and Erik Holstad, Streamy.com › Michael Stack, Powerset/Microsoft › Ryan Rawson, StumbleUpon

HBase 0.20 Architecture – Storage » New Key Format – KeyValue › Contains only (byte [] buf, int offset, int length) › Compact binary format with binary comparators › Our “pointer” to keys inside blocks

» New File Format – HFile › › › ›

Originally based on TFile (HADOOP3315) and BigTable Block based binary format with a block index Contains any number of Meta blocks Persisted storage of List

HBase 0.20 Architecture – API » New Query API › › › ›

Put, Get, Scan, Delete operations Extended support for versioning Drastically reduces API size and complexity An API that more closely mirrors implementation

» New Result API and optimized Serialization › › › ›

Result is just a wrapper for KeyValue[] Userfriendly Trees are built ondemand, clientside Deserialization allocates a single byte[] for all KVs Zerocopy building, single allocation receiving

HBase 0.20 Architecture – Algorithms » New Scanners – KeyValueScanner / KeyValueHeap › Replace linear sort logic with an encapsulated Heap › Abstract the handling of versions, deletes, query params › Now capable of processing individual rows with millions of columns and versions › Linear (or worse) to Logarithmic, Logarithmic to Constant

» New Block Cache Concurrent LRU › › › ›

Backed by ConcurrentHashMap LRU eviction with scanresistance and block priorities Memorybound using HeapSize interface Nonblocking and unsynchronized LRU map

HBase 0.20 By The Numbers (Uncached) » Tall Table: 1 Million Rows with a single Column › Sequential insert – 24 seconds (.024 ms/row) › Random reads – 1.42 ms/row (average) › Full scan – 11 seconds (117 ms/10,000 rows, .011ms/row)

» Wide Table: 1000 Rows with 20,000 Columns each › Sequential insert – 312 seconds (312 ms/row) › Random reads – 121 ms/row (average) › Full scan – 146 seconds (14.6 seconds/100 rows, 146ms/row)

» Fat Table: 1000 Rows with 10 Columns,1MB values › Sequential insert – 68 seconds (68 ms/row) › Random reads – 56.92 ms/row (average) › Full scan – 35 seconds (3.53 seconds/100 rows, 35ms/row) Each test yielded >1 region, additional rows have no impact on performance

HBase 0.20 Performance Conclusion » We surprised even ourselves › Random read times similar to that of an RDBMS • 20100 times faster with far less variability

› Scan times reduced • 30 times faster than previous versions

› Insert times reduced • 210 times faster with less than half the memory usage

» We improved our performance by more than an order of magnitude in most cases › While drastically improving our memory usage and code readability

Zookeeper Integration

Why? » » » »

Takes 2 mins to figure a RegionServer’s death Clients have to ask Master for ROOT address Managing shared state in HBase is a zoo ;) And...

»Master is a SPOF!

Zookeeper? • Project under Hadoop started by Y! • Centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services. • Highly available when used on an ensemble of machines, typically 5 or more. • ZK’s data model is a simple namespace with permanent and ephemeral nodes.

Tough Decisions » Should we impose the usage of a ZK quorum on every setup? » How much should we rely on ZK for HA, are there better alternate solutions for some of our problems? » Should we have an HBase implementation of ZK ? » Should we package our own version of ZK?

Major Integration Points » » » » » »

Master address is stored in ZK Master election is a race for that lock ROOT address is also stored in ZK Region Servers are all registered in ZK The RSs watch the Master’s node Backup Masters are watching both Master’s node and a “cluster state” node

What it Changes for You » Standalone and pseudodistributed setups: › a ZK server that listens on localhost is started for you. It starts/stops with the rest of the cluster.

» Fullydistributed setup: › poss. to keep the managed ZK server but have to make it point on a nonlocal IP/hostname. › better is to get a quorum, can also use it for other purposes, for higher availability.

Fullydistributed setup » What you have to do with ZK: › hbasesite.xml: set hbase.cluster.distributed to true, also notice that hbase.master is deprecated. › hbaseenv.sh: export HBASE_MANAGES_ZK=false › zoo.cfg: server.0=… server.1=… etc. You also have to configure those servers per ZK doc.

» You want backup masters? › ${HBASE_HOME}/bin/hbasedaemon.sh start master › It’s also a good idea to set hbase.master.dns.nameserver and hbase.master.dns.interface to have them binding at the right place.

New Features from ZK integration in 0.20 » No more SPOF › Automatic Master failover

» Rolling upgrades of point releases » Modify some cluster configuration without full cluster restart

What’s next? » More performance and reliability › 0.20 was mostly a RegionServer rewrite › 0.21 will rewrite Master with better ZK integration

» HBase 0.21 Roadmap › Decentralized Master responsibilities + More ZK • • • •

› › › ›

Further capability to modify configurations at run time State sharing via ZK nodes Ephemeral nodes for region ownership Distributed queue for region assignment

Languageagnostic, binary RPC Native C/C++ client library MultiDC Replication Further optimizations on algorithms and data structures

More Information about HBase » HBase Website and Wiki › http://www.hbase.org › http://wiki.apache.org/hadoop/Hbase

» Mailing List › http://hadoop.apache.org/hbase/mailing_lists.html

» IRC Channel › #hbase on Freenode › All committers and core contributors are here

» Follow us on Twitter! › @hbase

Hbase Goes Realtime

Overview

More details

Related Documents

Hbase Goes Realtime

Wince Realtime

Realtime Systems

Realtime Testcases

Aar Realtime Indicators Report

19 Realtime Synchronization

More Documents from ""

Mysql Magazine. Issue 6 - Fall 2008

Standford Cs 193p: 12-text Input Present Modal

Jobacle Sick Day Calendar 2009

Mastering Postgresql Administration

The Little Manual Of Api Design