大数据介绍英文讲述-

BIG DATA EVERY MINUTE 1,388 cabs 2,777 private cars Didi rides hailed: EVERY MINUTE 395,833 People log in To WeChat 194,444 people are video or audio chatting EVERY MINUTE 625,000 Youku Tudou videos being watched EVERY MINUTE 64,814 posts and reposts on Weibo SEARCH 4,166,667 search queries EVERY MINUTE 774 people buy something on Alibabas marketplaces US$1,133,942 spent on Alibaba 1 Definition 2 Characteristic 3 NoSQL 4 RDBMS 5 MapReduce CONTENTS 6 Applications 1 Definition 1 Definition BIG DATA volume of data important data on a day-to-day basis for better decisions 2 Characteristic 2 Characteristic Volume The quantity of generated and stored data. Variety The type and nature of the data. The quality of captured data can vary greatly, affecting accurate analysis. Velocity In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Variability Inconsistency of the data set can hamper processes to handle and manage it. Veracity 3 NoSQL 3 NoSQL NoSQL refers to document-oriented databases SQL doesnt scale well horizontally. It is schemaless. But not formless (JSON format). JSON: data interchange format Mongo Database Couch Database 3 NoSQL Basic Availability spread data across many storage systems with a high degree of replication. Soft State Eventual Consistency Base Model data consistency is the developers problem and should not be handled by the database. at some point in the future, data will converge to a consistent state. No guarantees are made “when”. 3 NoSQL field1: value1, field2: value2 fieldN: valueN var mydoc = _id:ObjectId(“5099803df3f4948bd2f98391“), name: first: “Alan“, last: “Turing“ , birth: new Date(Jun 23, 1912), death: new Date(Jun 07, 1954), contribs: “Turing machine“, “Turing test“, , views : NumberLong(1250000) JSON Structure 3 NoSQL RDBMS vs NoSQL Xszc Row DB: 001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,5 5000; index: 001:40000;002:50000;003:44000;004:55000; Column DB: 10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy: 003,Bob:004;40000:001,50000 ;Smith:001,Jones:002,004,Johnson:003; 3 NoSQL Benefits Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows. Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek. Row-oriented organizations are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek. 3 NoSQL SQL vs Non SQL A good compromise is to design your system with 3 logical DBs 1. Normal SQL DB used by your admin application to create content. 2. No-SQL DB for front-end/public/high-volume applicaiton used by the public internet. 3. The last DB is for analytical reporting system using cubes and all that good stuff. Then data flows from the Admin DB to the client No- SQL DB when someone “Publishes“ a piece of content, the client (NoSQL) db provides very fast read access and records user interactions with the content. Then you have a scheduled job that pulls the data from the client DB into the reporting system. Since Admin, client, and reporting are often separate apps, each application team can work with data in the format that best serves the application and the transition from one system to the other is handled in the service layers. 4 RDBMS 4RDBMS fixed-schema, row-oriented databases with ACID properties and a sophisticated SQL query engine The emphasis is on strong consistency, referential integrity, abstraction from the physical layer, and complex queries through the SQL language. easily create secondary indexes, perform complex inner and outer joins, count, sum, sort, group, and page your data across a number of tables, rows, and columns. 5 MapReduce Dividing and conquering Highly fault tolerant Every data block replicated on 3 nodes Difficult to implement 5 MapReduce 5 Comparison RDBMSMapReduce Data sizeGBPB AccessInteractive and Batch Batch UpdatesRead /Write many times Write once ,Read many times Structure Static Schema Dynamic Scheme Integrated High(ACID)Low Scaling No liner Liner DBA Ratio 1：401：3000 5 How does MapReduce work MapReduce uses key/value pairs. (Traditionally using rows and columns)-Map all the intermediate values for a given output key are combined together into a list.