MySQL Scaling and High Availability Architectures Jeremy Cole
[email protected]
Eric Bergen
[email protected]
Who are we? • Proven Scaling is a consulting company founded in 2006 by Eric and Jeremy specializing in MySQL • We primarily deal with architecture and design for large scalable systems • We also do training, DBA work, custom MySQL features, etc. • Jeremy: optimization, architecture, performance • Eric: operations, administration, monitoring
Overview • • • • • •
What’s the problem? Basic Tenets of Scaling and High Availability Lifetime of a Scalable System Approaches to Scaling Approaches to High Availability Tools and Components
What’s the problem? • Internet-age systems can grow (or be forced to choose between growth and death) very quickly • No matter what you plan for or predict, users will always surprise you • Mobs, err, valued users can be very annoying sometimes (e.g. “biggest group ever” logic) • Users may have vastly different usage patterns • Web 2.0™ (blechhhh!) sites have changed the world of scaling; it’s much harder now • Everyone (your VCs included) expects you to be Web 2.0® compliant™
Basic Tenets • Don’t design scalable or highly available systems: Using components you do not control or that have loose tolerances (e.g. DNS) Using processes with potentially ugly side effects (e.g. code changes to add a new server) [Yes, configuration files are very often “code”]
• If a user doesn’t think/notice something is down, it’s not really “down” • Eliminate (or limit) single points of failure -- if you have only one of any component, examine why • Cache everything
Lifetime of a Scalable System
Newborn • Shared hosting • Might start worrying (a little bit) about query optimization at this point • Don’t have much control over configuration • Overall performance may be poor • Traffic picks up, and performance is bad... What do we do about it?
Toddler • • • •
A single (dedicated) server for everything MySQL and Apache etc., competing for resources MySQL needs memory for caching data Apache (and especially PHP etc.) needs lots of memory for handling requests • Memory contention will be the first major bottleneck
Child • Separate web servers and database server • Usually go ahead and get multiple web servers now, since it’s easy • Get a single database server, since it’s hard -maybe better hardware? • Now we need to do session management across web servers… hmm we have this nice database… • Other load same as before, but now with added network overhead • Single database server becomes your biggest bottleneck
Teenager • “Simple” division of load by moving tables or processes • Use replication to move reporting off production • Move individual tables or databases to lighten load • Use replication to move reads to slaves • Modify code to know where everything is • Still too many writes in some parts of system • Replication synchronization problems mean either annoying users or writing lots of code to work around the problem
Late teens to 20s • The “awkward” stage • This is where many applications (and sometimes entire companies) die by making bad decisions • Death can be slow (becoming irrelevant due to speed or lack of scalability) or quick (massive meltdown losing user confidence) • Managing the move from teenager into adulthood is often the first real project requiring specs and real processes to do it right • Downtime at this point is hard to swallow due to size of userbase
Adult • Scalable system that can grow for a long time, generally based on data partitioning • Most improvements now are incremental • System is built to allow incremental improvements without downtime • A lot has been learned from the successful transition to adulthood
Data Partitioning: The only game in town
What is partitioning? • Distributing data on a record-by-record basis • Usually a single basis for distributing records in each data set is chosen: a “partition key” • An application may have multiple partition keys • Each node has all related tables, but only a portion of the data
Partitioning Models • Fixed “hash key” partitioning • Dynamic “directory” partitioning • Partition by “group” • Partition by “user”
Partitioning Difficulties • Inter-partition interactions are a lot more difficult • Example: Partitioning by user, where do we store a message sent from one user to another? How about a friend list? • Overall reporting becomes more difficult • Example: Find the average number of friends a user has by state…
Partition by … • Partitioning by user (or equivalent) allows for the most flexibility in most applications • In many cases it may make sense to partition by groups, if most (or all) interactions between users are within that group • You could also get most of the same benefits of partitioning by group by partitioning by user with an affinity based on group
Fixed Hash Key • • •
Divide the data into B buckets Divide the B buckets over M machines Example: Define 1024 user buckets 0..1023 based on (user_id % 1024) for 4 physical servers, so each server gets 256 of the buckets by range: 0-255, 256-511, 512-767, 768-1023
•
Problem: Moving entire buckets means affecting 1/B of your users at a time in the best case… in simple implmentations you may have to affect 1/M or 2/M of your users Problem: The bucket-to-machine mapping must be stored somewhere (usually in code) and updated atomically Problem: You have no control over which bucket (and thus machine) a given user is assigned to
• •
Dynamic Directory • A “directory” server maintains a database of mappings between users and partitions • A user is assigned (often randomly) to one partition and that mapping is stored • Any user may be moved later by locking the user, moving their data, and updating their mapping in the directory • Solution: Only single users are affected by any repartitioning that must be done • Solution: Partitions may be rebalanced user-byuser at any time
Custom Solutions
Custom Solutions • It’s very easy to implement simple hash key partitioning to get data distributed • It’s much more difficult to be able to re-partition • It’s difficult to grow
Hibernate Shards
Hibernate Shards • Sort of a merge between fixed key partitioning and directory-based partitioning • “Virtual Shards” abstract the mapping of objects to shards, but simplistically • It’s still painful to repartition • It doesn’t handle rebalancing at all currently • It doesn’t handle aggregation at all
HiveDB
HiveDB Project • HiveDB is an Open Source project to design and implement the entire “standard” partition-by-key MySQL system in Java • Originally envisioned by Jeremy while working with several customers • Implemented by Fortress Consulting and CafePress along with help and guidance from Proven Scaling • Many companies have built somewhat similar systems but nobody has really open sourced it
Why HiveDB? • Many solutions that exist only solve the easy part: storing and retrieving data across many machines • Nobody really touches on the hard part: being able to rebalance and move users on the fly
Server Architecture • Hive Metadata Partition definition
• Directory Partition Key -> Partition mapping Secondary Key -> Partition Key mapping
• Hive Queen - makes management and rebalancing decisions • Job Server (Quartz) - actually executes tasks • Aggregation Layer (future)
Client Architecture • Client uses Hive API to request a connection for a certain partition key • Client uses those direct connections to do work • Hive API should be written in each development language as necessary
High Availability
Goals • • • • •
Avoid downtime due to failures No single point of failure Extremely fast failover No dependency on DNS changes No dependency on code changes
• Allow for painless, worry-free “casual failovers” to upgrade, change hardware, etc. • Fail-back must be just as painless
MySQL Replication
Basics • MySQL replication is master-slave one-way asynchronous replication • “Master” keeps logs of all changes – called “binary logs” or “binlogs” • “Slave” connects to the master through the normal MySQL protocol on TCP port 3306 • Slave requests binary logs from last position • Master sends binary logs up to current time • Master keeps sending binary logs in real-time
More Basics • Replication works with all tables types and (mostly) all features • Any “critical”, reads, ones that cannot be allowed to return stale data, must be done on the master – replication is asynchronous, so there may be a delay at any time
Typical Setup • One Master (single source of truth) • Any number of slaves • Slaves are used for reads only • All writes go to the master • There are many other possibilities…
Replication Topologies
Master with One Slave Master
Slave
Master with Many Slaves Master
Slave
Slave
Slave
Slave
Slave
Master with Relay Slave Master
Relay Slave
Slave
Master with Relay and Many Slaves Master
Relay Slave
Slave
Slave
Slave
Slave
Slave
Master with Many Relays Master
Relay Slave
Relay Slave
Relay Slave
Relay Slave
Relay Slave
Slave
Slave
Slave
Slave
Slave
Dual Masters
Master
Master
Dual Masters with Slaves Slave
Master
Master
Relay Slave
Slave
Slave
Slave
Ring (Don’t Use)
Master
Master
Master
High Availability Options
Dual Master • Two machines with independent storage configured as master and slave of each other • Optionally: Any number of slaves for reads only • Manual (scripted) or automatic (heartbeat-based) failover is possible
Dual Master Pros • Very simple configuration • Simple to understand = simple to maintain • Very similar to basic master-slave configuration that many are familiar with • Allows easy failover in either direction without reconfiguration or rebuilding • Allows for easy and reliable failover for nonemergency situations: upgrades, schema changes, etc. • Allows for quick failover in emergency • Can work between distant sites fairly easily
Dual Master Cons • Does not help scale writes (no, not at all) • Limited to two sites; replication does not allow multiple masters, so three or more is not possible • Replication is asynchronous, and may get behind -there is always a chance of data loss (albeit small)
SAN • Shared storage of a single set of disks by two MySQL servers, with a single copy of the data on a FibreChannel or IP/iSCSI SAN • Automatic (heartbeat) failover by fencing and mounting the SAN on the other machine
SAN Pros • Single copy of the data means lower storage cost for extremely large databases • No worries about replication getting behind • SAN systems can achieve very high performance for same or lower cost as two very large RAIDs
SAN Cons • Single copy of the data means corruption is possible, and could be very damaging • For medium or small databases, cost can be prohibitive • FibreChannel requires additional infrastructure often not present in typical MySQL systems; iSCSI can be very helpful in this regard • Single copy of the data -- no schema change tricks are possible
DRBD • Block device-level replication between two machines with their own independent storage (mirrors of the same data) • Automatic (heartbeat-based) failover by fencing and mounting local copy of filesystem is typical
DRBD Pros • Simple hardware and infrastructure using locallyattached RAID • No expensive hardware or network
DRBD Cons • Complex configuration and maintenance • May cause performance problems, especially if poorly configured • Failure of or problems with mirror can cause problems in production • From the software perspective, there is still a single copy of the data, which may get corrupted • Single copy of the data -- no schema change tricks are possible
Putting It All Together
Partitioning + HA • No partitioning solutions really address HA .. They treat the “shards” or “partitions” as single MySQL servers • In reality you would implement an HA solution for each partition • There are many possibilities
HiveDB + Dual Master • We recommend HiveDB plus Dual Master for most installations • While not technically perfect, and with a chance of data loss, administrative tasks are very simple • Additionally, LVM for volume management gives ability to take snapshot backups easily
Any questions? Discussion!