Introduction to P2P systems
Davide Carboni © 2005-2006
1
License
Attribution-ShareAlike 2.5 You are free: to copy, distribute, display, and perform the work to make derivative works to make commercial use of the work Under the following conditions: Attribution. You must give the original author credit. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one. For any reuse or distribution, you must make clear to others the licence terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above. This is a human-readable summary of the Legal Code (the full licence). Disclaimer
2
P2P is about sharing resources
Your CPU time Your bandwidth Your disk space
3
What is P2P From Wikipedia A peer-to-peer computer network is a network that relies on the computing power and bandwidth of the participants in the network rather than concentrating it in a relatively low number of servers
4
P2P and GRID From Wikipedia
Grid computing […] performs higher throughput computing by taking advantage of many networked computers to model a virtual computer architecture.
5
Topology Comparison client=server server client
Client/server
GRID
P2P
6
Overlay Mobile phones in cell xyz
Crs4.it
Australian ISP
7
Overlay Mobile phones in cell xyz
Crs4.it
Australian ISP
8
Three main issues in P2P systems
Bootstrapping Index/Lookup (query) Delivery of large objects (in case of file sharing)
9
A la Napster Query / Query Hits
GET 10
Copyright issues with Napster
Napster claimed that the law allows people to share music with friends. The court considered this position illegal and Napster was closed.
11
Gnutella Overlay
Requestor
Responder
12
Gnutella Messages Byte
Description
0 - 15
GUID
16 17
ping, pong, push, query, queryhit TTL
18
hops
19-22
Payload length
23 – 23+payload length 13
Gnutella messages
ping: discover hosts on network pong: reply to ping query: search for a file query hit: reply to query push: download request for firewalled servents Ref. http://rfc-gnutella.sourceforge.net/developer/stable/index.html 14
Gnutella: PING
Requestor
PING
15
Gnutella: PONG
Requestor
PONG
16
Gnutella: QUERY
Requestor
QUERY
17
Gnutella: QUERY-HITS C D
Requestor
Responder 1
B A
QUERY-HITS Responder 2
18
Gnutella: GET the file Responder 1 Requestor
GET file HTTP/1.1 file
19
Gnutella, behind firewalls
Requestor
Responder GET file
20
Gnutella, behind firewalls (2) C Responder D
Requestor B A
PUSH
21
Gnutella, behind firewalls (3) Responder Requestor FILE
22
Bootstrapping in Gnutella
X-Try Ping/Pong Storing from QueryHit messages GWebCache
23
Open issues in Gnutella
Latency Scalability Vulnerability Privacy Security
24
Is Gnutella obsolete?
Alive and Kicking The version 0.6 of the protocol prevents pure flooding and uses smart routing based on Ultrapeers More than 2 millions users with 500,000 nodes always up
25
Popularity of P2P Networks (measured by Slick.com)
Latest Statistics taken 2006-02-26 22:14:12: eDonkey2K Users: 3,474,261 FastTrack Users: 2,609,688 Gnutella Users: 2,219,539 Overnet Users: 578,521 MP2P Users: 252,893 Filetopia Users: 4,806 26
Hub (Gnutella2 et al.)
Hub Web
27
Hub Requirements
> 100 sockets CPU and RAM for servicing the network Uptime (>2 hours) Broadband (also for upload) Able to receive inbound TCP and/or UDP (IP in the global address space, no NAT)
28
Hub Tasks
Keep up-to-date information about other hubs Manage routing tables to route messages efficiently Manage filters for query messages Monitor they own resources.
29
Query Hash Table QHT
QHTs provide information to know that a particular node (and possibly its descendants) will not be able to provide any matching objects for a given query. queries can be discarded confidently. Neighbours know what their neighbours do not have, but cannot say for sure what they do have. 30
What is Hashing
From Wikipedia, the free encyclopedia A hash function or hash algorithm is a function for examining the input data and producing an output hash value. The process of computing such a value is known as hashing. The process of hashing has the property that two different inputs are unlikely to hash to the same hash value. 31
What is Hashing (2)
Collisions occur with 2^(-N) 32
Query Hash Table 1
1
1
0
1
2
1
1
1
1
1
1
1 2^N
0<= Hash(word) <= 2^N
33
Query Filtering
If any of the lookups based on URNs found a hit, send the query packet If at least two thirds of lookups based on words found a hit, send Otherwise, drop the packet Consider all text content in the query, including generic search text and metadata search text if it is present. Tokenize quoted phrases into words, ignoring the phrase at this level 34
Distributed hashtables
35
Distributed Hashtables
Main features: a key is mapped onto a node of the network. Several proposals: Chord, Pastry and Kademlia. Lookup(key) reaches the right node with O(log(N) ) hops.
36
How DHT works
In DHT each node has a node ID which belogs to a set S (for instance the set of bitstrings with length 160) Also keys must hashed in the same set S (hash(key) belongs to S)
37
Kademlia
S = [00 ....0 - 11 ...1] the set of 160bit strings Each node has a node ID in S For each 'key' hash(key) is in S
38
Kademlia distance
Given x,y in S Define the distance d(x,y) = xor(x,y) d has the following properties: d(x,y) = d(y,x) d(x,x) = 0 d(x,y) + d(y,z) >= d(x,z)
39
k-Buckets in kademlia
Each node stores an array of lists: list[i] i = 0,1, ... , 159 list[i] stores up to k tuples: (IP,port,ID) list[i] stores tuples whose ID is:
2^i <= D(this,ID)< 2^(i+1)
list[i] is ordered as LRS (last recent seen)
40
Tree for nodes in kademlia 0
1
0
1 1
0
1
0 0101
41
k-Buckets in kademlia
For small values of i, list[i] has few elements For larger values of i, list[i] is likely to contain more elements.
42
Operations in kademlia
PING (IP, port) STORE (key, value) FIND_VALUE (key) FIND_NODE (ID)
43
Lookup in Kademlia
FIND_NODE(hash(k)) Compute D=xor(this,hash(key)) Find a tuples in list[i] (i.e. a=3) Send FIND_NODE(hash(key)) to the 3 nodes I receive other node addresses. Reiterate FIND_NODE(hash(key)) on them. Stop when no new addresses are received
44
Nodes Joining and Leaving
Whenever one node asks another for its contacts, the called node stores the contact information of the caller. When a node joins the network it takes some of the contacts of an arbitrary node and uses them as its own. It then does a search for itself. This results in other nodes being called, which makes them aware of the new node's existence 45
Node Joining and Leaving (2)
A new node may have become the closest node to certain keys The previous closest nodes will replicate the appropriate key/value pairs to the new node Ignoring replication the cost of a node joining is only O(log n) messages.
46
Range Query in DHT (1)
DHT maps a key onto a node It is easy to lookup a value given a key It is uneasy lookup values in a range of keys Example 1:
Lookup all tuples in ‘aaaa’ < key < ‘bbbb’
Example 2:
Lookup all tuples in ’39,88’ < lat < ’39,94’
47
Possible applications of DHT
DHT DNS Content lookup Web search engine
48
DNS over DHT (1)
Problem: how to register a name onto a IP address Assign a name to your machine, example ‘mymachine’ Check if this name is available or not using the DHT operation get(‘mymachine’). If the result is null then register the name and the IP with the DHT operation put(‘mymachine’, 212.22..) 49
DNS over DHT (2)
Problem: how to resolve a name onto a IP address Use the DHT operation get(hostname). The result if not null is the IP address you’re searching
50
Content indexing/lookup on DHT
A content has a set of metadata (i.e. author, editor, genre, …) Build a different index based on DHT for each metadata i.e. the index for author
put(‘john’, http://host/dir/content.avi)
51
Web crawlers and DHT
Assume a network of nodes in a DHT Assume each node runs also a crawler. For each word in a Web page it performs
Put(word,URL)
So a distributed index of the Web is built[1]
52
Web search and DHT
When the user type a keyword ‘foo’ lookup the DHT
Get(‘foo’)
The DHT will give the list of URL indexed with ‘foo’
53
References (1)
Napster Timeline
The Gnutella Developer Forum
http://www.the-gdf.org/wiki/index.php?title=Main_Page
History of Gnutella in ‘Gnutella’
http://www.cnn.tv/SPECIALS/2001/napster/timeline.html
http://ntrg.cs.tcd.ie/undergrad/4ba2.02-03/p5.html
Slyck.com DHT Links
http://www.etse.urv.es/~cpairot/dhts.html 54
References (2)
YACY (DHT Web search/index)
Kademlia: A Peer-to-peer Information System Based on the XOR Metric. (paper) Khashmir – Kademlia in Python
http://www.yacy.net/yacy/
http://khashmir.sourceforge.net/
A Case Study in Building Layered DHT Applications (paper on range query/DHT)
http://www.placelab.org/publications/pubs/IRS-TR-05-001.pdf 55
License
Attribution-ShareAlike 2.5 You are free: to copy, distribute, display, and perform the work to make derivative works to make commercial use of the work Under the following conditions: Attribution. You must give the original author credit. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one. For any reuse or distribution, you must make clear to others the licence terms of this work. Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above. This is a human-readable summary of the Legal Code (the full licence). Disclaimer
56