# Intel<sup>®</sup>: Accelerating the Path to Exascale

**Kirk Skaugen** Vice President Intel Architecture Group General Manager Data Center Group



# An Insatiable Need For Computing

### Weather Prediction



Exascale Problems Cannot Be Solved Using the Computing Power Available Today



# Exascale Answers Mankind's Challenges In...

### Weather / Climate



### Healthcare



### New Forms of Energy





# Intel Commitment To Exascale

Efficient Performance

Programming Parallelism

Extreme Scalability



Intel Exascale Commitment: >100X Performance Of Today At Only 2X The Power of Today's #1 System Scaling <u>Today's</u> Software Model



# Exascale Requirements

### Petascale Machine of 2010: TFLOP of Compute



Visceral Focus on System Power Efficiency Improvement



# Scaling Programmability



One Programming Model Democratizes Usage ...Avoid Costly Detours



# Process Technology Leadership



Source: Intel \*Compared to Intel 32nm Technology (intel)

# Intel Labs & HPC

### Strong Research Partnerships



### Government





Memory Stacking & Technologies



#### Programmability



### World Class Research in HPC

Silicon Photonics



Interconnect Technologies



Security



**Power Reduction** 



### Delivering Breakthrough Technologies to Fuel Innovation



Continuing The Journey: Next Intel<sup>®</sup> Xeon<sup>®</sup> Processor Codenamed Sandy Bridge-EP



## **Growing Performance**

- Up to 8 cores per socket
- 2X FLOPS with Intel<sup>®</sup> Advanced Vector Extensions

## Powerful. Intelligent.

# Efficient I/O

Integrated PCIe reduces latency and power

The Foundation of the Innovation in Science and Technology



# Highly Parallel Performance Intel® Many Integrated Core (Intel® MIC) Architecture



A Step Forward In Dealing With Efficient Performance & Programmability









## Evaluating the Intel MIC Architecture

Arndt Bode Leibniz Supercomputing Centre, Germany with input from Iris Christadler, Alexander Heinecke and Volker Weinberg June 2011, ISC, Hamburg

TECHNISCHE UNIVERSITAT MÜNCHEN

1.1.1.1.





- Programming models are the key to harness the computational power of massively parallel devices.
- Obviously, Intel has realized this trend and substantially supports open standards and invests in innovative programming models.
- LRZ and TUM are using Intel hard- and software for many years and know the tool chain by heart.
- We expect: A hardware product that delivers good performance (and energy-efficiency) without loosing programmability.

# Advantages of the MIC Architecture



- Is a standard x86 architecture!
- Allows many different parallel programming models like OpenMP, MPI and Intel Cilk!
- Offers standard math-libraries like Intel MKL!
- Supports whole Intel tool chain, e.g. Compiler & Debugger!

### Writing MIC-accelerated code with minimal effort and great performance



Workloads under Investigation



- Euroben Kernels (7 dwarfs of HPC)
- Data Mining
- TifaMMy Matrix Operations (Demo here at ISC'11!)
- Further Linear Algebra and Simulation Codes

# **Euroben Kernels**

 Selected micro-benchmarks used in PRACE for the evaluation of accelerator hardware & new languages: <u>http://www.prace-project.eu/documents/public-deliverables/d6-6.pdf</u>



### < multiplication</pre>

Performance evaluation of mod2am on KNF with 30 cores @1050 MHz using Intel's Offload Compiler, single precision, data transfer times excluded

# Data Mining with Adaptive Sparse Grids

- Machine learning algorithm
- Learning function from a training dataset
- Important workload for classification and regression of huge datasets
- MIC-Execution: Straightforward
  - First version within a few hours

Optimized version took 2 days Evaluating the Intel MIC Architecture, Prof. A. Bode, LRZ June 2011



# TifaMMy – Idea and Application

- ТШ
- TifaMMy: self-adaptive and cache-oblivious framework for matrix operations optimized on fat x86 cores
- This is done by nested recursions and vectorized kernels
  - On MIC only the kernels were changed, MIC's x86 cores are able to tackle nested recursions!
- > parallelization scheme employing OpenMP can be reused
- having SSE kernels, bringing code to MIC is nearly for free



**Evaluating the Intel MIC Architecture, Prof. A. Bode, LRZ** June 2011

### TifaMMy – Performance Matrix **Multiplication** 700 M 600 500 400 300 200 -GFLOPS -Max 100 0 32 256 480 704 928 1152 1376 1600 1600 1824 2048 2272 2496 2720 2720 2944 33168 3392 3616 4064 4736 4960 4960 5184 4960 5184 5632 5632 5632 5632 6676 6304 6528 6528 6528 6528 6528 6528 77200 77200 **Matrix Size** Testworkload: TifaMMy Evaluating the Intel MIC Architecture, Prof. A. Bode, LRZ Executed on KNF with 36 cores @1200 MHz June 2011

# Advantages of the MIC Architecture



- Is a standard x86 architecture!
- Allows many different parallel programming models like OpenMP, MPI and Intel Cilk!
- Offers standard math-libraries like Intel MKL!
- Supports whole Intel tool chain, e.g. Compiler & Debugger!

Pre-release MIC-accelerated code for a typical scientific workload (e.g. Data Mining, TifaMMy) can reach up to 50% of peak performance! Visit demo here at ISC'11!





"SGI understands the significance of inter-processor communications, power, density and usability when architecting for exascale. Intel has made the leap towards exaflop computing with the introduction of Intel<sup>®</sup> Many Integrated Core (MIC) architecture. Future Intel<sup>®</sup> MIC products will satisfy all four of these priorities, especially with their expected ten times increase in compute density coupled with their familiar X86 programming environment."

## Dr. Eng Lim Goh, SGI CTO



## Intel<sup>®</sup> MIC Architecture : Needed for Exascale



- 125x compute power
- 25x : Moore's Law
- 5x : remains



## Intel<sup>®</sup> MIC Architecture : Familiar x86 Programming





# MIC On Track: ISC Demonstrations<sup>1</sup>



### Hybrid LU Factorizarion

Leverages compute power of both Intel<sup>®</sup> Xeon<sup>®</sup> CPUs and Intel<sup>®</sup> MIC Delivers optimal performance by dynamically balancing large and small matrix Computations between Intel<sup>®</sup> Xeon<sup>®</sup> and Intel<sup>®</sup> MIC



### Hybrid Computing – SGEMM with Intel® MKL

High performing SGEMM with just 18 lines of code – common between Intel<sup>®</sup> Xeon<sup>®</sup> CPUs and Knights Ferry Uses Intel<sup>®</sup> MKL in current version of Alpha stack/tools on Knights Ferry



### 7.4 TFLOP SGEMM in a node

Simultaneous execution of SGEMM on 8 Knights Ferry cards to deliver 7.4 TFLOPS in 1 4U server

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel measured results as of March 2011. See backup for details. For more information go to http://www.intel.com/performance

<sup>1</sup> Refer to backup material for system configurations

Up to 772

1+ TFLOP

7.4 TFLOF

# **Optimized MIC Software Development Platform Performance**

# MIC On Track: ISC Demonstrations #2<sup>1</sup>

| Forschungszentrum<br>Juelich | SMMP Protein<br>Folding                       | - AND                         | Simulates the folding process of proteins<br>to reach their final shape after they are<br>produced by a cell                                                            |
|------------------------------|-----------------------------------------------|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| KISTI                        | Molecular<br>Dynamics                         |                               | Empirical-potential molecular dynamics,<br>widely used for simulating nano-<br>materials including carbon nanotube,<br>graphene, fullerene, and silicon surfaces        |
| LRZ                          | TifaMMy<br>Matrix<br>Multiplication           | $P \rightarrow Q S Q$ $P R P$ | Cache-oblivious implementation of<br>matrix-matrix multiply which uses a<br>recursive scheme to partition input data<br>for computation and parallelization             |
| CERN                         | Core Scaling<br>of Intel® MIC<br>Architecture |                               | Benchmark kernel extracted from the<br>CBM/ALICE HLT software development<br>for collider experiments. It estimates real<br>trajectories from imprecise<br>measurements |
|                              |                                               | <sup>1</sup> Refer to b       | ackup material for system configurations                                                                                                                                |

Programmability For HPC Applications



# MIC Partners at International Supercomputing 2011



\*Other names and brands may be claimed as the property of others.



# How Intel<sup>®</sup> Delivers its Commitments:

Intel Exascale Commitment: >100X Performance Of Today At Only 2X The Power Of Today's #1 Scaling Today's Software Model

### Committed roadmap now and in the future

Flexible, open and scalable programming models

Collaborating with others to ensure the exascale future







## System Configuration 7 TFLOPS SGEMM in a node

- Colfax Model: CXT8000 Server w/Intel<sup>®</sup> 5520 chipset and 4 PLX PEX8647 Gen 2 PCIe switches
- Intel Alpha level software (Intel<sup>®</sup> Compilers, drivers etc.)

| HIM | <b>SDO</b> | Cat         | nnc i |
|-----|------------|-------------|-------|
| HW  | SUC        | <b>IUGU</b> |       |
|     |            |             |       |

8 x KNF D0 Si @1.2GHz, 2GB GDDR5@3.6GT/s

Host Colfax CXT8000: 2 socket platform with 2 Intel<sup>®</sup> Xeon<sup>®</sup> processor X5690 (3.46GHz, 6 cores, 12MB L3 cache) with 24GB DDR3 @1333MHz, Dual Intel<sup>®</sup> 5520 IOH, OS RHEL 6.0

### **KNF SW Stack**

Larrabee kernel driver ver. 1.6.197

Flash Image/uOS: 1.0.0.1137/1.0.0.1137-EXT-HPC

Offload compiler (w/data xfer): Composer XE for MIC 0.043



## System Configuration Hybrid Computing with Intel<sup>®</sup> MKL

- Knights Ferry Software Development Platform (Shady Cove)

- Intel Alpha level software (Intel<sup>®</sup> Compilers, Intel<sup>®</sup> MKL, drivers etc.)

| HW specifications                                                |                                                                                                         | SW specifications |                           |
|------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------|---------------------------|
| 1 x KNF                                                          | D0 Si @1.2GHz, 2GB GDDR5@3.6GT/s                                                                        | MKL4KNF           | MKL KNF.b2 build 20110518 |
| Host                                                             | Shady Cove 2 socket platform with 2 Intel® Xeon® processor X5680 (3.33GHz, 6 cores, 12MB L3 cache) with |                   | 10.3.3                    |
|                                                                  | 24GB DDR3@1333MHz, single Intel® 5520 IOH,<br>OS: RHEL 6.0                                              |                   |                           |
|                                                                  |                                                                                                         |                   |                           |
| KNF SW Stack                                                     |                                                                                                         |                   |                           |
| Larrabee kernel driver ver. 1.6.197                              |                                                                                                         |                   |                           |
| Flash Image/uOS: 1.0.0.1137/1.0.0.1137-EXT-HPC                   |                                                                                                         |                   |                           |
| Offload compiler (w/data xfer): Intel® Composer XE for MIC 0.043 |                                                                                                         |                   |                           |
| Native compiler (w/o data xfer): Version Alpha Build 20110518    |                                                                                                         |                   |                           |
|                                                                  |                                                                                                         | 1                 |                           |



## System Configuration Hybrid Computing LU Factorization

- Knights Ferry Software Development Platform (Shady Cove)

- Intel Alpha level software (Intel<sup>®</sup> Compilers, drivers etc.)

|                                                                  | HW specifications                                                                                                                                        |  |
|------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1 x KNF                                                          | D0 Si @1.2GHz, 2GB GDDR5@3.6GT/s                                                                                                                         |  |
| Host                                                             | Shady Cove 2 socket platform with 2 Intel® Xeon®<br>processor X5680 (3.33GHz, 6 cores, 12MB L3 Cache) with<br>24GB DDR3@1333MHz, single Intel® 5520 IOH, |  |
|                                                                  | KNF SW Stack                                                                                                                                             |  |
| Larrabee ker                                                     | nel driver ver. 1.6.197                                                                                                                                  |  |
| Flash Image/uOS: 1.0.0.1137/1.0.0.1137-EXT-HPC                   |                                                                                                                                                          |  |
| Offload compiler (w/data xfer): Intel® Composer XE for MIC 0.043 |                                                                                                                                                          |  |
| Native compiler (w/o data xfer): Version Alpha Build 20110518    |                                                                                                                                                          |  |
|                                                                  |                                                                                                                                                          |  |



## System Configuration KISTI Molecular Dynamics

- Dell Precision Workstation
- Intel Alpha level software (Intel<sup>®</sup> Compilers, drivers etc.)

|               | HW specifications                                                                                                                                                                       |  |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1 x KNF       | CO Si @1.2GHz, 2GB GDDR5@3.0GT/s                                                                                                                                                        |  |
| Host          | Dell Precision Workstation 1 socket platform with 1 Intel®<br>Xeon® processor X5620 (4 cores, 2.4GHz, 12MB L3 cache)<br>with 24GB DDR3@1333MHz, single Intel® 5520 IOH,<br>OS: RHEL 6.0 |  |
|               |                                                                                                                                                                                         |  |
|               | KNF SW Stack                                                                                                                                                                            |  |
| Larrabee kerr | KNF SW Stack<br>nel driver ver. 1.6.197                                                                                                                                                 |  |
|               |                                                                                                                                                                                         |  |
| Flash Image/u | nel driver ver. 1.6.197                                                                                                                                                                 |  |



## System Configuration CERN openIab: Core Scaling of Intel<sup>®</sup> MIC Architecture

- SGI H4002 System
- Intel Alpha level software (Intel<sup>®</sup> Compilers, drivers etc.)

|                                                | HW specifications                                                                                                                                                       | 2 |
|------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| 1 x KNF                                        | CO Si @1.2GHz, 2GB GDDR5@3.0GT/s                                                                                                                                        |   |
| Host                                           | SGI H4002 2 socket platform with 2 Intel® Xeon® processor<br>X5690 (6 cores, 3.46GHz, 12MB L3 cache) with 24GB<br>DDR3@1333MHz, single Intel® 5520 IOH,<br>OS: RHEL 6.0 |   |
|                                                | KNF SW Stack                                                                                                                                                            |   |
| Larrabee kerr                                  | nel driver ver. 1.6.197                                                                                                                                                 |   |
| Flash Image/uOS: 1.0.0.1137/1.0.0.1137-EXT-HPC |                                                                                                                                                                         |   |

Offload compiler (w/data xfer): Intel® Composer XE for MIC 0.043



## System Configuration LRZ: TifaMMy Matrix Multiplication

- Knights Ferry Software Development Platform (Shady Cove)

- Intel Alpha level software (Intel<sup>®</sup> Compilers, drivers etc.)

|              | HW specifications                                                                                                                                                        |
|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 x KNF      | CO Si @1.2GHz, 2GB GDDR5@3.0GT/s                                                                                                                                         |
| Host         | Shady Cove 2 socket platform with 2 Intel® Xeon®<br>processor X5680 (3.33GHz, 6 cores, 12MB L3 Cache) with<br>24GB DDR3@1333MHz, single Intel® 5520 IOH,<br>OS: RHEL 6.0 |
|              | KNF SW Stack                                                                                                                                                             |
| Larrabee ker | nel driver ver. 1.6.197                                                                                                                                                  |
| Flash Image/ | uOS: 1.0.0.1137/1.0.0.1137-EXT-HPC                                                                                                                                       |
|              |                                                                                                                                                                          |

Offload compiler (w/data xfer): Intel® Composer XE for MIC 0.043



## System Configuration FZ Jülich: SMMP Protein Folding

- Knights Ferry Software Development Platform (Shady Cove)

- Intel Alpha level software (Intel<sup>®</sup> Compilers, drivers etc.)

| HW specifications |                                                                                                                                                                          |  |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 1 x KNF           | CO Si @1.2GHz, 2GB GDDR5@3.0GT/s                                                                                                                                         |  |
| Host              | Shady Cove 2 socket platform with 2 Intel® Xeon®<br>processor X5680 (3.33GHz, 6 cores, 12MB L3 Cache) with<br>24GB DDR3@1333MHz, single Intel® 5520 IOH,<br>OS: RHEL 6.0 |  |
|                   |                                                                                                                                                                          |  |
|                   | KNF SW Stack                                                                                                                                                             |  |
| l arrahee ker     | nel driver ver 16197                                                                                                                                                     |  |

Larrabee kernel driver ver. 1.6.197

Flash Image/uOS: 1.0.0.1137/1.0.0.1137-EXT-HPC

Offload compiler (w/data xfer): Intel® Composer XE for MIC 0.043

