资源预览内容
第1页 / 共44页
第2页 / 共44页
第3页 / 共44页
第4页 / 共44页
第5页 / 共44页
第6页 / 共44页
第7页 / 共44页
第8页 / 共44页
第9页 / 共44页
第10页 / 共44页
亲,该文档总共44页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
大规模数据处理/云计算 Lecture 1 Introduction to MapReduce,What is this course about?,Data-intensive information processing Large-data (“web-scale”) problems Focus on MapReduce programming An entry-level course,2,What is MapReduce?,Programming model for expressing distributed computations at a massive scale Execution framework for organizing and performing such computations Open-source implementation called Hadoop,3,Why Large Data?,How much data?,Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERNs LHC will generate 15 PB a year (?),640K ought to be enough for anybody.,5,6,7,Happening everywhere!,Molecular biology (cancer),microarray chips,Particle events (LHC),particle colliders,microprocessors,Simulations (Millennium),Network traffic (spam),fiber optics,300M/day,1B,1M/sec,8,Maximilien Brice, CERN,9,Maximilien Brice, CERN,10,Maximilien Brice, CERN,11,Maximilien Brice, CERN,No data like more data!,(Banko and Brill, ACL 2001),(Brants et al., EMNLP 2007),s/knowledge/data/g;,How do we get here if were not Google?,12,Example: information extraction,Answering factoid questions Pattern matching on the Web Works amazingly wellLearning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances,Who shot Abraham Lincoln? X shot Abraham Lincoln,Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879),Wolfgang Amadeus Mozart (1756 - 1791),Einstein was born in 1879,PERSON (DATE PERSON was born in DATE,(Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ),13,14,Example: Scene Completion,Image Database Grouped by Semantic Content 30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector Torralba Color similarity Local context matching,Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images ,Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007,More Data More Gains?,CNNIC中国互联网络发展状况统计 截至 2010年6月底,我国网民规模达4.2亿人,互联网普及率持续上升增至31.8%。手机网民成为拉动中国总体网民规模攀升的主要动力,半年内新增 4334万,达到2.77亿人,增幅为18.6%。值得关注的是,互联网商务化程度迅速提高,全国网络购物用户达到1.4亿,网上支付、网络购物和网上银 行半年用户增长率均在30%左右,远远超过其他类网络应用。,15,2009年全国新闻出版业基本情况,2009年:出版书籍238868种(初版145475种,重版、重印93393种),总印数37. 88亿册(张),总印张312.46亿印张,折合用纸量73.4万吨(包括附录用纸1.41亿印张,折合用纸量0.33万吨),定价总金额567.27亿 元(包括附录定价总金额4.73亿元)。与上年相比种数增长8.86%(初版增长11.24%,重版、重印增长5.36%),总印数增长4.53%,总印 张增长4.61%,定价总金额增长8.94%。,16,Did you know?,17,Did you know?,“We are currently preparing our students for jobs that dont yet exist ” “It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century” “The amount of new technical information is doubling every 2 years” “So what does IT ALL MEAN?”,18,“We are living in exponential times “,19,20,Two Different Views,a “thrower-awayer”,MyLifeBits,“丢弃,必要时再找回来的代价 要比维护它们要小得多” “trying to live an efficient life so that one has time to work and be with ones family. “,Jennifer Widom,Gordon Bell,Information Overloading,不能学以致用的原因之一:信息超载 对于那些只接触过一次的信息,我们通常只能记住其中一小部分。 我们应该少而精而非多而浅地去学习。 要想掌握某件事,关键在于间隔性重复。 一旦真正透彻地掌握了自己的工作,人们就会变得更有创造性,甚至能够创造奇迹。,21,What is Cloud Computing?,The best thing since sliced bread?,Before clouds Grids Vector supercomputers Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service,23,Rebranding of web 2.0,Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, “The network is the computer”: take two User data is stored “in the clouds” Rise of the netbook, smartphones, etc. Browser is the OS,24,Source: Wikipedia (Electricity meter),Utility Computing,What? Computing resources as a metered service (“pay as you go”) Ability to dynamically provision virtual machines Why? Cost: capital vs. operating expenses Scalability: “infinite” capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers,I think there is a world market for about five computers.,26,Everything as a Service,Utility computing = Infrastructure as a Service (IaaS) Why buy machines when you can rent cycles? Examples: Amazons EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce,27,Utility Computing,“pay-as-you-go” 好比让用户把电源插头插在墙上,你得到的电压和Microsoft得到的一样,只是你用得少,pay less;utility computing的目标就是让计算资源也具有这样的服务能力,用户可以使用500强公司所拥有的计算资源,只是use less pay less。这是cloud computing的一个重要方面,
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号