HADOOP - 数据挖掘研究组-

HadoopIntroducing Installation and Configuration数据挖掘研究组 Data Mining Group Xiamen UniversityA Distributed data-intensive Programming FrameworkHDFSMapReduceHadoopDistributed Distributed storagestorageParallel computingParallel computing数据挖掘研究组 Data Mining Group Xiamen UniversityIntroducing to HDFSHadoop Distributed File System (HDFS)An open-source implementation of GFS has many similarities with distributed file systems. However, comes differences with it. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.数据挖掘研究组 Data Mining Group Xiamen UniversityHow it works?Features of itAn important feature of the design An important feature of the design : :data is never moved through the data is never moved through the namenodenamenode. .Instead, Instead, all all data transferoccurs directly data transferoccurs directly between between clients clients and datanodesand datanodes数据挖掘研究组 Data Mining Group Xiamen UniversityMapReduce?Lets talk it next timeLets talk it next time数据挖掘研究组 Data Mining Group Xiamen University“Running Hadoop?”What means for it?“Running Hadoop” means running a set of daemons.NameNodeDataNodeSecondary NameNodeJobTrackerTaskTracker数据挖掘研究组 Data Mining Group Xiamen UniversityWho Works for who?HDFSMapReduceHadoopNameNodeSec NDTaskTrackerJobTrackerDataNodeNameNodeHadoop employs a master/slave architecture for Hadoop employs a master/slave architecture for both distributed storage and both distributed storage and distributed distributed computation.computation.NameNodeNameNode is the master of HDFS that directs the is the master of HDFS that directs the slave slave DataNodeDataNode daemons to perform the low- daemons to perform the low- level I/O taskslevel I/O tasksNameNodeNameNode is the bookkeeper of HDFS is the bookkeeper of HDFSkeeps track of how your files are broken down keeps track of how your files are broken down into file blocksinto file blockskeeps track of the overall health of the keeps track of the overall health of the distributed fidistributed filesystemlesystemDataNodereading and writing HDFS blocks for clientsreading and writing HDFS blocks for clientscommunicate with other communicate with other DataNodesDataNodes to to replicate its data blocks for redundancyreplicate its data blocks for redundancy数据挖掘研究组 Data Mining Group Xiamen UniversityNameNode and DataNodeSecondary NameNodeSNN is an assistant daemon for monitoring SNN is an assistant daemon for monitoring the state of the cluster HDFSthe state of the cluster HDFSdiffers from the differs from the NameNodeNameNode in that this in that this process doesnt receive or record any real-time process doesnt receive or record any real-time changes to HDFSchanges to HDFScommunicates with the communicates with the NameNodeNameNode to take to take snapshots of the HDFS metadatasnapshots of the HDFS metadataRecovery:Recovery:NameNodeNameNode failure ? failure ?We reconfigure the cluster to use the SNN as We reconfigure the cluster to use the SNN as the primary the primary NameNodeNameNodeJobTrackerthe liaison between your application and the liaison between your application and HadoopHadoopsubmit your code to your cluster, the submit your code to your cluster, the JobTrackerJobTracker determines the execution plan determines the execution plandetermining which files to processdetermining which files to processassigns nodes to different tasksassigns nodes to different tasksmonitors all tasks as theyre running monitors all tasks as theyre running a task fail?a task fail?JobTrackerJobTracker will will relaunchrelaunch the task on a different the task on a different nodenodeTaskTrackerEach Each TaskTrackerTaskTracker is responsible for is responsible for executing the individual tasks that the executing the individual tasks that the JobTrackerJobTracker assigns assigns数据挖掘研究组 Data Mining Group Xiamen UniversityJobTracker and TaskTrackerInstallation and ConfigurationPseudo-distributed modePseudo-distributed modeAll All daemons run on daemons run on onon the machine the machineFully distributed modeFully distributed modeWhat Different?What Different?数据挖掘研究组 Data Mining Group Xiamen UniversityInstallation forPseudo-distributed modePrerequisitesPrerequisitesUbuntu LinuxUbuntu LinuxHadoop 0.20.2Hadoop 0.20.2Sun Java 6Sun Java 6$ $sudosudo add-apt-repository “deb http:/archive.canonical.com/ lucid partner“ add-apt-repository “deb http:/archive.canonical.com/ lucid partner“ $ $sudosudo apt-get update apt-get update $ $sudosudo apt-get install sun-java6-jdk apt-get install sun-java6-jdk数据挖掘研究组 Data Mining Group Xiamen UniversityConfiguring SSHHadoop requires SSH access to manage its nodes, remote machines plus your local machine if yo