{企业发展战略}高性能计算发展概述-

高性能计算及应用,任课教师,王云岚 EMAIL ：赵天海 EMAIL：高性能计算研究与发展中心办公室: 勇字楼3楼电话：88493434（O）,2,课程目标,掌握高性能计算编程工具，解决相关问题课程主要内容：介绍高性能计算系统体系结构、高性能并行程序程序设计方法及高性能计算技术最新方向。主要包括：高性能处理机、多处理机系统；集群计算系统、Linux集群系统配置方法，集群资源管理与作业调度，多线程编程及性能优化等；并行编程程序工具：OpenMP、MPI、CUDA、MapReduce等。交流平台 2013年高性能计算课程qq群：158463721,作业,高性能计算相关研究热点的技术报告云计算 CPU/GPU技术虚拟化实验报告集群环境构建并行应用编程：MPI，openMP，Cuda,高性能计算及应用,课程1：高性能计算发展概述,课程内容提纲,应用需求计算机体系结构的发展高性能计算的核心技术：并行计算并行编程的重要性,应用需求,High performance computing,高性能计算与科研，产业需求与意义,基础科研领域的计算需求物理化学生物材料工业领域的需求银行辅助设计医药石油气象在线服务信息安全,传统的科学研究,difficult, 例如建造大型风洞 expensive, 例如建造样机 slow, 例如等待气候的变化，天体的演化 dangerous, 例如武器开发，药品，大气试验，电力系统分析,基于计算科学的科学研究,物理原理和数值方法,理论分析,设计试验,富有挑战性的计算问题遍及科学与工程的各个领域,Science Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences Engineering Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Oil field applications Business Financial and economic modeling Transaction processing, web services and search engines Defense Nuclear weapons - test by simulations Cryptography,Units of High Performance Computing,计算能力,存储能力,全球气候模拟,计算问题: f(经度, 纬度, 海拔, 时间) 温度, 气压, 适度, 风速,做法: 域的离散化分解,10公里解析度(Discretize the domain, e.g., a measurement point every 10 km) 给定时间t设计算法预测t +dt的天气(Devise an algorithm to predict weather at time t+dt given t),应用: 主要事件预测(Predict major events, e.g., El Nino) 用于确定大气散射标准(Use in setting air emissions standards),大气环流模拟需求解Navier-Stokes方程 1分钟时间间隔100个浮点运算/网格点对计算的需求为确保时效需1分钟执行5 x 1011 flops=8 Gflop/s 以天为单位的7 天天气预报需要56 Gflop/s 以月为单位的50年气候预测需要4.8 Tflop/s 以12小时为单位的50年预测288 Tflop/s 如果提高网格解析度则计算复杂性将呈8x,16x增加更高的精确预测模型则需要综合考虑大气,海洋,冰川,陆地,加上地球化学等因素千年气候模型分析目前无法对此进行有效计算,全球气候模拟,高性能计算已经成为复杂系统工程的必备手段,航空高性能计算领域,高端需求主要集中在CAE领域气动力计算结构计算气动弹性分析多学科设计优化飞行载荷计算隐身设计计算稳定性和操纵计算需求飞行仿真其他高性能计算需求数字化装配数字样机主要特点计算能力vs计算规模先导性研究vs工程应用,CFD终极目标：虚拟飞行试验,虚拟风洞(CFD),设计经验,风洞试验,虚拟飞行试验,计算设备/用户/内容,Today,2015,Source：IDF2012,大数据现象,“Data are becoming the new raw material of business: an economic input almost on a par with capital and labor” The Economist, 2010 “Information will be the oil of the 21st century” Gartner，2010,Source：IDF2012,2015 Cloud Vision,Coexistence of Opportunities and Challenges,Source：IDF2012,Trends to Exascale Performance,Roughly 10 x performance every 4 years, predicts that well hit Exascale performance in 2018-19,Source：IDF2012,计算机体系结构的发展,计算机体系结构的发展趋势,体系结构的改进将技术创新转变为计算机的处理性能计算机体系结构历史：电子管、晶体管、集成电路、大规模集成电路超大规模集成电路(Very Large Scale Integration)的发展阶段可以看做为并行处理的探索过程,并行处理是提高计算机处理性能的核心技术,体系结构的发展: 并行方法的探索,Greatest trend in VLSI generation is increase in parallelism 1970 - 1985:位级并行（bit level parallelism） 4-bit - 8 bit - 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) 80年代中期 to 90年代中期: 指令级别并行（ instruction level parallelism） pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units = superscalar execution greater sophistication: out of order execution, speculation, prediction to deal with control transfer and latency problems Now: 线程级并行（thread level parallelism）,VLSI三个阶段,Three phases: Bit-level Instruction-level Thread-level,VLSI Technology Trends,Intel announced that they have reach 1.7 billion with Itanium processor Gigascale Integration (GSI) = 1 billion transistors per chip,http:/users.ece.gatech.edu/jeff/ece4420/technology.pdf,单处理器的性能增长变化,VAX: 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ?%/year 2002 to present,处理器功耗,发展的趋势不在提供时钟频率，而转变为每个芯片的CPU数量,风冷芯片最大功耗的瓶颈,Recent Intel Processors,“We are dedicating all of our future product development to multicore designs. We believe this is a key inflection point for the industry.” Intel President Paul Otellini, IDF 2005,Intels Many Core and Multi-core,Intel 80-core TeraScale Processor (Vangal et al. 2008) 亿级处理器 developed a solver (single precision) for this chip that ran at 1 TFLOP with only 97 Watts,Source： Tim Mattson, Intel Labs,Trends are putting all onto one chip,The future belongs to heterogeneous, many core SOC as the standard building block of computing SOC = system on a chip,Source： Tim Mattson, Intel Labs,集群系统的发展趋势,Large-Scale Computing Systems,大规模集群计算系统,Franklin (NERSC-5): Cray XT4 9,532 compute nodes; 38,128 cores Each node has an AMD quad core processor and 8 GB of memory 25 Tflop/s on applications; 352 Tflop/s peak,Clusters 105 Tflops total Carver IBM iDataplex cluster PDSF (HEP/NP) Linux cluster (1K cores) Magellan Cloud testbed IBM iDataplex cluster,Analytics Euclid (512 GB shared memory) Dirac GPU testbed (48 nodes),Hopper (NERSC-6): Cray XE6 Phase 1: Cray XT5, 668 nodes, 5344 cores Phase 2: 1 Pflop/s peak (2 sockets/node, 12 cores/socket),Tianhe-I(A) 6,144 compute nodes; 24576 cores 2560 AMD Radeon HD 4870*2 GPU 98TB memory in total Rpeak: 4.700 pflops; Rmax: 2.566 pflops