mapreduce倒排索引算法-

Mapreduce程序设计报告姓名：学号：题目：莎士比亚文集倒排索引算法 1、实验环境联想pc机虚拟机：VM 10.0操作系统：Centos 6.4Hadoop版本：hadoop 1.2.1Jdk版本：jdk-7u25Eclipse版本：eclipse-SDK-4.2.2-linux-gtk-x86_642、实验设计及源程序2.1实验说明对莎士比亚文集文档数据进行处理，对莎士比亚文集文档数据进行倒排索引处理，结果输出到指定文件2.2实验设计（1）InvertedIndexMapper类这个类实现 Mapper 接口中的 map 方法，输入参数中的 value 是文本文件中的一行，利用正则表达式对数据进行处理，使文本中的非字母和数字符号转换成空格，然后利用StringTokenizer 将这个字符串拆成单词，最后将输出结果,outkey为单词+单词所在的文件名，outvalue为1。public static class InvertedIndexMapper extends Mapper private final static IntWritable one = new IntWritable(1); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException /获取文件名以及预处理 FileSplit filesplit =(FileSplit)context.getInputSplit(); String filename =filesplit.getPath().getName(); String line=value.toString(); String s; /利用正则表达式除去非数字和字母的符号 Pattern p =Pattern.compile(w+); Matcher m=p.matcher(line); String line2=m.replaceAll( ); StringTokenizer itr = new StringTokenizer(line2); /按照空格对字符串进行划分 while (itr.hasMoreTokens() s=itr.nextToken().toLowerCase(); if(!ls.contains(s) Text filename_num=new Text(s+,+filename);/将单词和单词所在的文件名进行合并 context.write(filename_num, one); (2)InvertedIndexPartitioner类这个类是自定义的Partitioner类,通过复写getPartition() 方法来自定义子集的分区key。将 key按照分隔符进行分割，取key的前面部分进行分区，将相同的（即单词相同）分入同一个reduce。public static class InvertedIndexPartitioner extends HashPartitioner public int getPartition(Text key,IntWritable value,int numReduceTasks)Text key1 =new Text(key.toString().split(,)0); super.getPartition(key1,value,numReduceTasks); return 0; （3）CombineReducer类这个类是在map输出结果之后输入reduce之前做的一个操作，是一个小型的reduce操作，这个操作可以减少reduce阶段的工作量，从而优化性能。public static class CombineReducer extends Reducer public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException int sum = 0; for (IntWritable val : values) sum += val.get(); context.write(key,new IntWritable(sum);（4）InvertedIndexRedecuer类这个类实现了Reducer接口中的reduce方法，map的结果经过combine处理之后，数据输入reduce，key为单词+单词所在文件名，value为单词的词频数，由于要实现倒排，所以key只能为单词，取key的第一部分即单词，把key的第二部分即文件名和原有value合并和新的value，作为新的key和value，然后输出结果，outkey为单词，outvalue为文件名+单词词频数。public static class InvertedIndexReducer extends Reducer private Text filename_num=new Text(); StringBuilder all=new StringBuilder(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException Text key1=new Text(key.toString().split(,)0); /表示单词 int sum = 0;/p为定义的一个List类型的全局变量，用来存储每个单词的所在文件名和词频数 newkey为定义的一个Text的全局变量 for (IntWritable val : values) sum += val.get(); if(newkey = null | !newkey.equals(key1) if(newkey!= null) StringBuffer all =new StringBuffer(); for(Text t:p) all.append(t.toString(); all.append(;); context.write(newkey, new Text(all.toString(); p.clear(); /每一个单词的结果输出完毕后，p要格式化 newkey.set(key1);/每一个单词的结果输出完毕后，换成另一个单词开始计数 filename_num=new Text(key.toString().split(,)1+sum); p.add(filename_num); /reduce阶段的清理工作，用来输出最后一个单词的结果 public void cleanup(Context context) throws IOException, InterruptedException StringBuffer all =new StringBuffer(); for(Text t:p) all.append(t); all.append(;); context.write(newkey, new Text(all.toString(); （6）主程序定义了一个job，进行一个必要的设置。 public static void main(String args) throws Exception Configuration conf = new Configuration(); String otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) System.err.println(Usage: wordcount ); System.exit(2); String uri=hdfs:/localhost:8000/user/tzj/stop_words;/从hdfs读取停词 FileSystem fs=FileSystem.get(URI.create(uri), conf); FSDataInputStream in =fs.open(new Path(uri); InputStreamReader lsr=new InputStreamReader(in); BufferedReader buf=new BufferedReader(lsr); String input; while(input=buf.readLine()!=null) ls.add(input); System.out.println(The stop_words are:); Iterator it =ls.iterator(); while(it.hasNext()System.out.print(it.next()+ ); System.out.println(); Job job = new Job(conf, word count); FileInputFormat.addInputPath(job, new Path(otherAr