Hands-On Hadoop Tutorial Chris Sosa Wolfgang Richter May 23, 2008
General Information Hadoop
uses HDFS, a distributed file system based on GFS, as its shared filesystem
HDFS
architecture divides files into large chunks (~64MB) distributed across data servers
HDFS
has a global namespace
General Information (cont’d)
Provided a script for your convenience – Run source /localtmp/hadoop/setupVars from centurtion064 – Changes all uses of {somePath}/command to just command
Goto http://www.cs.virginia.edu/~cbs6n/hadoop for web access. These slides and more information are also available there.
Once you use the DFS (put something in it), relative paths are from /usr/{your usr id}. E.G. if your id is tb28 … your “home dir” is /usr/tb28
Master Node Hadoop
currently configured with centurion064 as the master node
Master
node
– Keeps track of namespace and metadata about items – Keeps track of MapReduce jobs in the system
Slave Nodes Centurion064
node
Slave
also acts as a slave
nodes
– Manage blocks of data sent from master node – In terms of GFS, these are the chunkservers Currently
centurion060 is also
Hadoop Paths
Hadoop is locally “installed” on each machine – Installed location is in /localtmp/hadoop/hadoop-0.15.3 – Slave nodes store their data in /localtmp/hadoop/hadoop-dfs (this is automatically created by the DFS) – /localtmp/hadoop is owned by group gbg (someone in this group must administer this or a cs admin)
Files are divided into 64 MB chunks (this is configurable)
Starting / Stopping Hadoop For
the purposes of this tutorial, we assume you have run the setupVars from earlier
start-all.sh
– starts all slave nodes and master node stop-all.sh – stops all slave nodes and master node
Using HDFS (1/2)
hadoop dfs – – – – – – – – – – – – – – – –
[-ls <path>] [-du <path>] [-cp <src> ] [-rm <path>] [-put ] [-copyFromLocal ] [-moveFromLocal ] [-get [-crc] <src> ] [-cat <src>] [-copyToLocal [-crc] <src> ] [-moveToLocal [-crc] <src> ] [-mkdir <path>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-help [cmd]]
Using HDFS (2/2)
Want to reformat?
Easy – hadoop namenode –format
Basically we see most commands look similar – hadoop “some command” options – If you just type hadoop you get all possible commands (including undocumented ones – hooray)
To Add Another Slave
This adds another data node / job execution site to the pool
– Hadoop dynamically uses filesystem underneath it – If more space is available on the HDD, HDFS will try to use it when it needs to
Modify the slaves file
– In centurion064:/localtmp/hadoop/hadoop0.15.3/conf – Copy code installation dir to newMachine:/localtmp/hadoop/hadoop-0.15.3 (very small) – Restart Hadoop
Configure Hadoop
Can configure in {$installation dir}/conf – hadoop-default.xml for global – hadoop-site.xml for site specific (overrides global)
That’s it for Configuration!
Real-time Access