This document was uploaded by user and they confirmed that they have the permission to share
it. If you are author or own the copyright of this book, please report to us by using this DMCA
report form. Report DMCA
Overview
Download & View Pubsubhubbub For Developers as PDF for free.
Brett Slatkin Software Engineer Google Inc. September 28, 2009
Agenda • • • • •
Background Intro Motivation Scale Progress
Background
Why do real-time messaging? • Syndication o Creating a "flow" o Simultaneous delivery of an event spurs immediate conversation o More participation enables more developed conversations, better exchanging of ideas o Cross-site allows promotion, linking, swarming around sources, mash-ups, growth opportunity
Why do real-time messaging? • Business, politics o 1 minute of delay could cost a company millions, cause a political scandal, be harmful to investors, etc o Concrete example: SEC earnings requirements
Why do real-time messaging? • Future applications (out of scope, but ...) o Financial data o Public scientific measurements (e.g., stream of weather data, traffic status, polling, votes) o Sensor networks o Emergency information distribution o Anything you can think of that's a stream of information!
Why do decentralized messaging? • • • •
Web was built on decentralized protocols No single point of failure Interoperability is key to network effects and growth One API for application developers
Intro
What is PubSubHubbub? • • • •
A simple publish/subscribe protocol Turns Atom and RSS feeds into real-time streams Web-scale, low-latency messaging Three participants: Publisher, Subscriber, Hubs
Publisher
Hub
Subscriber
Design goals of PubSubHubbub • • • • •
Decentralized: No one company in control Scale to the size of the whole web Publishing and subscribing as easy as possible Complexity in the Hub Pragmatic (i.e., not theoretically perfect, but solve huge, known use cases with minimal effort)
How-to for Publishers 1. Add a declaration in your feed with your Hub of choice
2. Add something to your feed!
3. Send a ping to the Hub with the feed URL POST / HTTP/1.1 Content-Type: application/x-www-form-urlencoded ... hub.mode=publish&hub.url=
4. 204 response = Success, 4xx = Bad request, 5xx = Try again
How-to for Subscribers 1. Detect the Hub declaration in a feed 2. Send a subscribe request to the feed's Hub POST / HTTP/1.1 Content-Type: application/x-www-form-urlencoded ... hub.mode=subscribe&hub.verify=sync& hub.topic=&hub.callback=
3. Hub will send a request to verify the subscription GET /callback?hub.challenge= HTTP/1.1 HTTP/1.1 200 ... <echo random>
How-to for Subscribers Process new content from the Hub POST /callback HTTP/1.1 Content-Type: application/atom+xml ... Awesome feed ... <entry> ...
The role of the Hub • Logical component o Publishers may be their own Hub o Combined Hub/Publisher has p2p speed-up • Distinct functions o Accept and verify subscriptions to new topics o Receive pings from publishers, retrieve content o Extract new/updated items from feed o Send all subscribers the new content
The role of the Hub • Scalability o # of subscribers & feeds, update frequency o Delegation of content distribution (= bandwidth) • Reliability o Retry fetch, delivery, idempotence
How the hub works
How the hub works
See my talk on building a hub using App Engine http://tinyurl.com/building-a-hub
Security model
• Subscriber verification prevents DoS attacks • Declaration of the Hub is a delegation of trust o Subscribers may trust the Hub to deliver content on publisher's behalf o v0.2 supports shared-secret HMACs for subscribers to verify that notifications came from the hub • Privacy through HTTPS for hubs, feeds, and callbacks o URLs and payloads can be sent via encrypted channel o Subscribed topics are not discoverable o Unguessable, capability URLs (e.g., from OAuth) • Publishers can run their own hub!
Motivation
Push it to the limit Why push content?
Push it to the limit Why push content? Learn from our forefathers.
Push it to the limit Why push content? Learn from our forefathers.
TCP
(est. 1974)
Push it to the limit What is magical about TCP? The Window.
Push it to the limit Without the window, the tube can't be full.
Push it to the limit TCP maximizes the throughput of a link • Dump data in, it will be received • The window means no waiting for acks! • When acks are missed, the sender will retransmit • Receivers reassemble the message in-order, de-dupe • Good citizenship with congestion control
Push it to the limit Where is such efficiency for application-level protocols? • Exists, but often proprietary or an interoperability nightmare
Push it to the limit Where is such efficiency for application-level protocols? • Exists, but often proprietary or an interoperability nightmare (cough SOAP cough)
Why another protocol?
Why another protocol? • We want interoperable, web-scale messaging
• Almost every company already has an internal system o TIBCO, WebsphereMQ, ActiveMQ, RabbitMQ, ... o Proprietary message payloads, topics, networks • Existing attempts at an standard haven't caught on o XMPP weirds people out; started in 1999, still isn't used for interop widely beyond IM o These standards are too complex or not pragmatic (XEP0060, WS-*, AMQP, RestMS, new REST-*)
Why another protocol?
• Build the simplest interoperable messaging protocol that can scale to the size of the web • Make the base specification bare-bones, easy-to-use • Target Atom/RSS initially as a payload format; everyone uses them for time-based, idempotent streams • In the future, add extensions for cool stuff
Why another protocol? • Proof of simplicity is in the code o Bret Taylor added PubSubHubbub subscription to FriendFeed in a single evening
Scale
Goal
• World-wide RSS publishing currently o ~X,000 updates per second • Legitimate email currently o ~X,000,000 per second • Need to scale by at least 1000x; hopefully more • Trying to enable new use-cases
Light pinging
Light pinging
• Protocols exist for faster Atom/RSS o Ping-o-Matic, changes.xml, SUP, rssCloud • All only indicate the feed URL that has changed o Still need to go and fetch the content o These protocols are just optimized polling o Equivalent to killing the TCP window!
Light pinging
• Optimized polling is still worse o Latency is high: 3 round trips o Thundering herd as subscribers fetch published feeds Unpredictable, bursty load pattern o More bandwidth, CPU, connection star-pattern
Light pinging
Light pinging
Light pinging at scale What if you had to use light pinging at scale? • Send out pings slowly to reduce the herd • Herd causes all feeds to be fully regenerated o Invalidates existing caches • Bandwidth increases extremely fast o (average updates per feed) * (# feeds) * (# subscribers) * (average feed size) o Often 99.5%+ more than you needed • CPU costs increase for subscribers with update frequency
Light pinging at scale Consider a single-master replication scheme • After each update, wait for copying to all replicas
Fat pinging
Fat pinging Compared to light pings • Latency: 1/3 as much • Based on reasonable averages o Bandwidth: ~20x less o CPU:~20x less • Never wait for replication delays
Fat pinging
Fat pinging
Fat pinging at scale What if you had to scale fat pinging? • Run your own hub • Compute feed deltas at update time; no need to regenerate a whole feed (or churn your caches) • Send out new content at sustained network rate • Bandwidth is minimum possible per subscriber o (update size) * (# feeds) * (# subscribers)
• CPU costs is minimum possible per subscriber
Fat pinging at scale
Fat pinging at scale
Fat pinging at scale Advanced protocol pieces • Connection reuse from HTTP/1.1 • Pipeline HTTP requests for feed fetching • Use aggregated content delivery o Many Atom feeds in a single XML doc o Fewer connections
Progress
PubSubHubbub status
• Over 100 Million feeds are PubSubHubbub-enabled • Companies: Google, FriendFeed (FB), livedoor, Six Apart, LiveJournal, LazyFeed, Superfeedr, ... • Google products: FeedBurner, Blogger, Reader shared items, Google Alerts, ... • Cool apps: Socnode, Reader2Twitter, chat gateways, ... • More publishers, subscribers, hubs, apps on the way • Publisher clients: Perl, PHP, Python, Ruby, Java, Haskell, C#, MovableType, WordPress, Django, Zend • Active mailing list with 240+ members
Getting involved
• Review the spec; recommend improvements o Open process, will be licensed by Open Web Foundation • Write some sample code for your favorite language or CMS • Contribute to one of the open source Hub implementations • Write on your blog about why we need push for the future o Do it for the children
What Facebook can do right now
• Subscribe to feeds that are PubSubHubbub-enabled o Put that great UI to work o Maybe reuse the FriendFeed index pipeline? o Call Bret and Ben • Enable PubSubHubbub for activity streams o Provide Facebook app developers with real-time updates to users' home streams o Speeds up surfacing Facebook in other apps o Detecting new events could trigger the app to take action in real-time (send an email, classify a photo, initiate an action in a game, etc)
What Facebook can do next
• Figure out if private feeds will work with this model o Run your own hub o Use capability URLs (OAuth token in the query string) • Give your developers more feeds to consume and syndicate
Rehash
Rehash
• Push for the future! Scale to new use-cases • Decentralized, open spec: no company owns it • One API for all stream-based content
Rehash • Project page: http://pubsubhubbub.googlecode.com o Full Hub source code with tests o Example publisher and subscriber apps o Demo hub at http://pubsubhubbub.appspot.com
?
Hub storage space • How much storage space does a Hub need? o Manageable costs ~10 million feeds ~1 million subscribers o Assume 1 billion events per day (~11,000/second) Thar be dragons!
Hub storage space FeedEntryRecord • Key name o "FeedEntryRecord" + entry_id_hash + parent key o 400 bytes, could be smaller • Indexed properties o Entry ID hash (again-- doh!): 160 bytes o Entry content hash: 160 bytes o Update time: 8 bytes • Unindexed properties o Entry ID: 2048 bytes maximum, 200 on average Result • ~1KB per entry • 27TB per month at ~11,000 req/sec -- no sweat!
WebFinger Unified discovery for email addresses • Transform an email address into XRD • XRD defines all the services that address has • Helps provide social networking as a protocol • E.g., Simple way to discover if an account has a Portable Contacts interface