# A closer look at Intel Xeon and Xeon Phi (KNL) for HPC developers Janko Strassburg HPC Knowledge Meeting'16 - HPCKP April 20<sup>th</sup> 2016, Barcelona ### Today's Intel solutions for HPC The multi- and many-core era | Multi-core | Xeon | Many integrated core (MIC | |------------|------|---------------------------------| | | | 1110117 1110001010010010 (11111 | | C/C++/Fortran, OMP/MPI/Cilk+/TBB | C/C++/Fortran, OMP/MPI/Cilk+/TBB | |------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------| | Bootable, native execution model | PCIe coprocessor, native and offload execution models | | Up to 18 cores, 3 GHz, 36 threads<br>until slide 16 | Up to 61 cores, 1.2 GHz, 244 threads | | Up to 768 GB, 68 GB/s, 432 GFLOP/s DP | Up to 16 GB, 352 GB/s, 1.2 TFLOP/s DP | | 256-bit SIMD, FMA, gather (AVX2) | <b>512-bit SIMD</b> , FMA, gather/scatter, EMU (IMCI) | | Targeted at general purpose applications Single thread performance (ILP) Memory capacity | Targeted at highly parallel applications High parallelism (DLP, TLP) High memory bandwidth | ## Intel® Xeon® processor architecture ### Intel® Xeon® processors and platforms E5-2600 v3 (Haswell EP) E5-2600 v4 (Broadwell EP) #### Haswell execution unit overview ### Fused Multiply and Add (FMA) instruction Example: polynomial evaluation 16 cycle latency2 cycle throughput 10 cycle latency 1 cycle throughput | Micro- | | SP FLOPs | DP FLOPs | |--------------|-----------------------|-----------|-----------| | Architecture | Instruction Set | per cycle | per cycle | | Nehalem | SSE (128-bits) | 8 | 4 | | Sandy Bridge | AVX (256-bits) | 16 | 8 | | Haswell | AVX2 (FMA) (256-bits) | 32 | 16 | **2x** peak FLOPs/cycle (throughput) | Latency<br>(clocks) | Xeon<br>E5 v2 | Xeon<br>E5 v3 | Ratio<br>(lower is better) | |---------------------|---------------|---------------|----------------------------| | MulPS, PD | 5 | 5 | | | AddPS, PD | 3 | 3 | | | Mul+Add/FMA | 8 | 5 | 0.625 | >37% reduced latency (5-cycle FMA latency same as an FP multiply) Improves accuracy and performance for commonly used class of algorithms ### Broadwell: 5<sup>th</sup> generation Intel® Core™ architecture Microarchitecture changes #### FP instructions performance improvements - Decreased latency and increased throughput for most divider (radix-1024) uops - Pseudo-double bandwidth for scalar divider uops - Vector multiply latency decrease (from 5 to 3 cycles) #### STLB improvements - Native, 16-entry 1G STLB array - Increased size of STLB (from 1kB to 1.5kB) Enabled two simultaneous page miss walks #### Other ISA performance improvements - ADC, CMOV 1 uop flow - PCLMULQDQ 2 uop/7 cycles to 1 uop/5 cycles - VCVTPS2PH (mem form) 4 uops to 3 uops #### **Divide Latency (cycles)** ### Divide Throughput (cycles to start next) ### Skylake: 6<sup>th</sup> generation Intel<sup>®</sup> Core<sup>™</sup> architecture Dedicated server and client IP configurations #### Improved microarchitecture - Higher capacity front-end (up to 6 instr/cycle) - Improved branch predictor - Deeper Out-of-Order buffers - More execution units, shorter latencies - Deeper store, fill, and write-back buffers - Smarter prefetchers - Improved page miss handling - Better L2 cache miss bandwidth - Improved Hyper-Threading - Performance/watt enhancements #### New instructions supported - Software Guard Extensions (SGX) - Memory Protection Extensions (MPX) - AVX-512 (Xeon versions only) ### Intel® Xeon® Processor E5-2600 v4 Product Family Overview #### New Features: - Broadwell microarchitecture - Built on 14nm process technology - Socket compatible<sup>0</sup> replacement/ upgrade on Grantley-EP platforms ### New Performance Technologies: - Optimized Intel® AVX Turbo mode - Intel TSX instructions<sup>^</sup> #### Other Enhancements: - Virtualization speedup - Orchestration control - Security improvements | Features | Xeon E5-2600 v3 (Haswell-EP) | Xeon E5-2600 v4 (Broadwell-EP) | | |---------------------------|----------------------------------------------------------------------------------------------|--------------------------------|--| | Cores Per Socket | Up to 18 | Up to 22 | | | Threads Per Socket | Up to 36 threads | Up to 44 threads | | | Last-level Cache (LLC) | Up to 45 MB | Up to 55 MB | | | QPI Speed (GT/s) | 2x QPI 1.1 channels 6.4, 8.0, 9.6 GT/s | | | | PCIe* Lanes / Speed(GT/s) | 40 / 10 / PCIe* 3.0 (2.5, 5, 8 GT/s) | | | | Memory Population | 4 channels of up to 3 RDIMMs or 3 LRDIMMs | + 3DS LRDIMM <sup>†</sup> | | | Memory RAS | ECC, Patrol Scrubbing, Demand<br>Scrubbing, Sparing, Mirroring,<br>Lockstep Mode, x4/x8 SDDC | + DDR4 Write CRC | | | Max Memory Speed | Up to 2133 | Up to 2400 | | | TDP (W) | 160 (Workstation only), 145, 135, 120, 105, 90, 85, 65, 55 | | | <sup>♦</sup> Requires BIOS and firmware update; ^ not available broadly on E5-2600 v3; † Depends on market availability # Up to 1.27x Average Generational Gains on Servers using Intel® Xeon® Processor E5-2600 v4 Product Family Normalized Generational Performance Summary (based on published industry benchmark results) ### High Performance Computing Performance Intel® Xeon® E5-2699 v4 (22-core 2.2GHz) vs. Intel® Xeon® E5-2699 v3 (18-core 2.3GHz) Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Results based on Intel® internal measurements as of February 29, 2016. Configurations: see slide 10 # Home Snoop w/DIR+OSB Provides up to 15% more Bandwidth vs Early Snoop on E5-26xx v3 Source as of 21 July 2015: Intel internal measurements on platform with two E5-26xx v4 (22C, CLR:2.8GHz), Turbo enabled, 4x32GB 1DPC DDR4-2400, RHEL 7.0. Platform with two E5-2699 v3, Turbo enabled, 4x32GB DDR4-2133, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a> \*Other names and brands may be claimed as the property of others. ### Intel® Turbo Boost Technology 2.0 and Intel® AVX\* - Amount of turbo frequency achieved depends on: - Type of workload, number of active cores, estimated current & power consumption, and processor temperature - Due to workload dependency, separate AVX base & turbo frequencies will be defined for Intel® Xeon® processors starting with E5 v3 product family #### Additional Resources: - Whitepaper Optimize Performance with Intel AVX - Intel® Xeon® Turbo Boost Opportunistic Frequency Upside - Using Intel AVX to Achieve Maximum Performance on Intel Xeon Processors \*Intel® AVX refers to Intel® AVX, Intel® AVX2 or Intel® AVX-512 ### Per-Core AVX Max Turbo Optimization on Intel® Xeon® processor E5-2600 v4 Product Family Non-AVX Codes AVX Workloads Mixed Workloads Workloads Non-AVX workloads running on all cores have a max turbo frequency of P0n. AVX workloads have a lower max turbo frequency of "P0n AVX" On E5-2600 v3 family, workloads with a mix of cores running AVX and non-AVX experienced lower max turbo frequency on all On E5-2600 v4 family, cores running AVX do not automatically decrease the max turbo frequency of other cores in the socket New manufacturing and algorithmic techniques providing higher potential turbo frequencies for improved performance in systems with heterogeneous workloads # Intel® Xeon Phi™ (co)processor architecture Intel® Many Integrated Core architecture (Intel® MIC) ### Intel® Xeon Phi™ architecture family | Intel® Xeon Phi <sup>™</sup> coprocessor product family <b>"Knights Corner"</b> | Intel® Xeon Phi <sup>™</sup> coprocessor<br>product family<br><b>"Knights Landing"</b> | Upcoming generation of the Intel® MIC architecture "Knights Hill" | |---------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | 2013 | 2H'2015 | 2017? | | 22 nm process | 14 nm process | 10 nm process | | 1 TeraFLOP DP peak | 3+ TeraFLOP DP peak | ? | | 57-61 cores<br>In-order core architecture<br>1 Vector Unit per core | 72 cores (36 tiles) Out-of-order architecture based on Intel® Atom™ core 2 Vector Units per core Up to 3x single thread performance w.r.t. Knights Corner | , | | 6-16 GB GDDR5 memory | On package, 8-16 GB high bandwidth memory (HBM)<br>with flexible models: cache, flat, hybrid<br>Up to 768 GB DDR4 main memory | ? | | Intel® Initial Many Core Instructions (IMIC) | Intel® Advanced Vector Extensions ( <u>AVX-512</u> ) Binary compatible with AVX2 | ? | | PCIe coprocessor | Stand alone processor and PCIe coprocessor versions | ? | | Intel® True Scale fabric | Intel® Omni-Path™ fabric<br>(integrated in some models) | 2nd generation Intel®<br>Omni-Path™ fabric | ### Intel® Xeon Phi<sup>TM</sup> coprocessor product lineup | Family | Specifications | Product name | |------------------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------------------------------------------------| | 7 Family Highest performance, more memory Performance leadership | 61 cores<br>16GB GDDR5<br>352 GB/s<br>> 1.2TF DP<br>270-300W TDP | 7120P (Q2'13)+<br>7120X (Q2'13)+<br>7120D (Q1'14)+<br>7120A (Q2'14)+ | | 5 Family Optimized for high density environments Performance/watt leadership | 60 cores<br>8GB GDDR5<br>320-352 GB/s<br>> 1TF DP<br>225-245W TDP | 5110P (Q4'12)<br>5120D (Q2'13) | | 3 Family Outstanding parallel computing solution | 57 cores<br>6-8GB GDDR5<br>240-320 GB/s<br>> 1TF DP | 3120A (Q2'13)<br>3120P (Q2'13)<br>31S1P (Q2'13)* | 270-300W TDP (+) Special offer with a free 12-month trial of Intel® Parallel Studio XE Cluster Edition - until September 30, 2016 Performance/\$ leadership ### Intel® Xeon Phi<sup>TM</sup> platform architecture (c) 2013 Jim Jeffers and James Reinders, used with permission. 1 to 8 Intel Xeon Phi Coprocessors per host Each coprocessor connected to one host through PCIe bus - PCle Gen 2 (client) x16 - Between 6-14 GB/s (relatively slow) - Up to 8 coprocessors per host - Inter-node coprocessors communication through Ethernet or InfiniBand - InfiniBand allows PCIe peer-to-peer interconnect without host intervention Each coprocessor can be accessed as a network node - It has its own IP address - Runs a special uLinux OS (<u>BusyBox</u>) - Intel® Many Core Software Stack (MPSS) ### Intel® Xeon Phi<sup>TM</sup> uncore architecture (c) 2013 Jim Jeffers and James Reinders, used with permission. #### High bandwidth interconnect Bidirectional ring topology #### Fully cache-coherent SMP on-a-chip - Distributed global tag directory (TD) - About 31 MB of "L2 cloud" - >100-cycle latency for remote L2 access #### 8-16 GB GDDR5 main memory (ECC) - 8 memory controllers (MC) - >300-cycle latency access - 2 GDDR5 32-bit channels per MC - Up to 5.5 GT/s per channel - 352 GB/s max. theoretic bandwidth - Practical peak about 150-180 GB/s ECC on GDDR5/L2 for reliability ### Knights Landing: 2<sup>nd</sup> generation Intel® Xeon Phi<sup>TM</sup> #### Performance 3+ TeraFLOPS of double-precision peak theoretical performance per single socket node 3x Single-Thread Performance compared to Knights Corner Most of today's parallel optimizations carry forward to KNL simply by recompile #### Integration | ntel® Omni Path™ fabric integration | | | | |-------------------------------------|---------------------------------------|--|--| | | Over 5x STREAM vs. DDR4 | | | | | (Over ~400 GB/s vs ~90 GB/s) | | | | High- | Up to 16GB at launch | | | | performance | NUMA support | | | | on-package | Over 5x Energy Efficiency vs. GDDR5 | | | | memory | Over 3x Density vs. GDDR5 | | | | (MCDRAM) | In partnership with Micron Technology | | | | | Flexible memory modes including | | | | | cache and flat | | | #### Microarchitecture Over 8 billion transistors per die based on Intel's 14 nanometer manufacturing technology Binary compatible with Intel® Xeon® Processors with support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) 72 cores in a 2D Mesh architecture 2 cores per tile with 2 VPUs per core 1MB L2 cache shared between 2 cores in a tile (cache-coherent) Cores based on Intel® Atom™ (Silvermont) microarchitecture with many HPC enhancements 4 Threads / Core 2X Out-of-Order Buffer Depth Gather/scatter in hardware Advanced Branch Prediction High cache bandwidth 32KB Icache, Dcache 2 x 64B Load ports in Dcache 46/48 Physical/virtual address bits Multiple NUMA domain support per socket #### Server processor Standalone bootable processor (running host OS) and PCIe coprocessor Platform memory: up to 384GB DDR4 using 6 channels Reliability ("Intel server-class reliability") Power Efficiency (Over 25% better than discrete coprocessor) → Over 10 GF/W Density (3+ KNL with fabric in 1U) Up to 36 lanes PCIe\* Gen 3.0 #### **Availability** First commercial HPC systems in 2H'15 Knights Corner to Knights Landing upgrade program available today Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6 ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated PCIe\* Gen 3 I/O ### Knights Landing platform overview #### Single socket node - 36 tiles connected by coherent 2D-Mesh - Every tile is 2 OoO cores + 2 512-bit VPU/core + 1 MB L2 #### Memory - MCDRAM, 16 GB on-package; High BW - DDR4, 6 channels @ 2400 up to 384GB #### 10 & Fabric - 36 lanes PCle Gen3 - 4 lanes of DMI for chipset - On-package Omni-Path fabric ### Intel® Knights Landing die ### Knights Landing core architecture #### OoO core w/ 4 SMT threads - 2-wide decode/rename/retire - Up to 6-wide at execution - Int and FP RS OoO - 2 AVX-512 VPUs #### \$s/TLBs - 64-bit Dcache ports (2-load & 1-store) - 1st level uTLB w/ 64 entries - 2nd level dTLB w/ 256-4K, 128-2M, 16-1G pages #### Others - L1 (IPP) and L2 prefetcher. - Fast unaligned support - Fast gather/scatter support ### Knights Landing's on package HBM memory Maximizes performance through higher memory bandwidth and flexibility - Explicit allocation allowed with <u>open-sourced API</u> memkind, Fortran attributes, and C++ allocator - Mode chosen at boot time ### DDR and MCDRAM Bandwidth vs. Latency MCDRAM latency more than DDR at low loads but much less at high loads ### Memory Placement KNL specifics - Does the entire application fit MCDRAM? - Flat, cache or hybrid mode - Keep heavily used data in MCDRAM - Consider affinity ### **Knights Landing products** ### Knights Landing performance <sup>1</sup>Projected KNL Performance (1 socket, 200W CPU TDP) vs. 2 Socket Intel® Xeon® processor E5-2697v3 (2x145W CPU TDP) Significant performance improvement for compute and bandwidth sensitive workloads, while still providing good general purpose throughput performance ### Best practices for SIMD vectorization Exploiting the parallel universe Three levels of parallelism supported by Intel hardware Distant parallelism ## Task Level Parallelism (TLP) - Multi thread/task (MT) performance - Exposed by programming models - Execute tens/hundreds/thousands task concurrently ## Data Level Parallelism (DLP) - Single thread (ST) performance - Exposed by tools and programming models - Operate on 4/8/16 elements at a time ## Instruction Level Parallelism (ILP) - Single thread (ST) performance - Automatically exposed by HW/tools - Effectively limited to a few instructions Near parallelism Programmers responsibility to expose DLP/TLP ### Single Instruction Multiple Data (SIMD) #### Technique for exploiting DLP on a single thread - Operate on more than one element at a time - Might decrease instruction counts significantly #### Elements are stored on SIMD registers or vectors #### Code needs to be vectorized - Vectorization usually on *inner* loops - Main and remainder loops are generated ### Past, present, and future of Intel SIMD types ### Intel® AVX2/IMCI/AVX-512 differences | | Intel® Initial Many Core Instructions | Intel® Advanced Vector Extensions 2 <b>AVX2</b> | Intel® Advanced Vector Extensions 512 <b>AVX-512</b> | |-------------------------|------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------| | Introduction | 2012 | 2013 | 2015-2016 | | Products | Knights Corner | Haswell, Broadwell | Knights Landing, future Intel® Xeon® and Xeon® Phi™ products | | Register file | SP/DP/int32/int64 data types<br>32 x 512-bit SIMD registers<br>8 x 16-bit mask registers | SP/DP/int32/int64 data types<br>16 x 256-bit SIMD registers<br>No mask registers (instr. blending) | SP/DP/int32/int64 data types<br>32 x 512-bit SIMD registers<br>8 x (up to) 64-bit mask | | ISA features | Not compatible with AVX*/SSE* No unaligned data support Embedded broadcast/cvt/swizzle MVEX encoding | Fully compatible with AVX/SSE* Unaligned data support (penalty) VEX encoding | Fully compatible with AVX*/SSE* Fast unaligned data support Embedded broadcast/rounding EVEX encoding | | Instruction<br>features | Fused multiply-and-add (FMA) Partial gather/scatter Transcendental support | Fused multiply-and-add (FMA) Full gather | Fused multiply-and-add (FMA) Full gather/scatter Transcendental support (ERI only) Conflict detection instructions PFI/BWI/DQI/VLE (if applies) | Intel® AVX-512 is a major step in unifying the instruction set of Intel® MIC and Intel® Xeon® architecture ### Side effects of SIMD vectorization ``` float a[1024], b[1024], c[1024]; ... for (int i = 0; i < 1024; i++) c[i] = a[i] + b[i]; ``` #### Assumptions - 64-byte cache lines, 4-byte SP elements (float) - 32-byte (AVX2) and 64-byte (IMCI/AVX-512) SIMD registers - No hardware prefetcher, no ld+op instructions, arrays not cached #### Observations - Significant instruction count reduction (up to vector-length) - IPC decreases, but so does execution time as well - Usually translated into speedup - Compute-bound codes turn into memory-bound codes - If code already was memory bound, no benefits at all (other than energy reduction) | #Instructions | Scalar | <b>AVX2</b> (256-bit) | IMCI<br>AVX-512<br>(512-bit) | |--------------------------|-----------|-----------------------|------------------------------| | Loads (hit) to a[], b[] | 960 + 960 | 64 + 64 | 0 | | Loads (miss) to a[], b[] | 64 + 64 | 64 + 64 | 64 + 64 | | SP adds | 1024 | 128 | 64 | | Stores to c[] | 1024 | 128 | 64 | | Total (Reduction) | 4096 (x1) | 512 (x8) | 256 (x16) | ### Vectorization on Intel compilers Auto Vectorization - Libraries - Compiler knobs Guided Vectorization - Compiler hints/pragmas - Array notation Low level Vectorization - C/C++ vector classes - Intrinsics/Assembly Fine control Easy of use ### Rely on Intel® performance libraries Highly efficient SIMD implementation of common functions for multiple Intel® processors #### INTEL® INTEGRATED PERFORMANCE PRIMITIVES (INTEL® IPP) A library of optimized building blocks for media and data applications. Take advantage of the unique capabilities of Intel processor families using optimized low-level APIs with significant emphasis on signal processing and certain media-focused applications, with cross-OS support and an internal dispatcher capable of selecting the prime optimization path. #### INTEL® MATH KERNEL LIBRARY (INTEL® MKL) The fastest and most used math library for Intel® and compatible processors. Harness the power of today's processors—with increasing core counts, wider vector units, and more varied architectures. Includes highly vectorized and threaded linear algebra, fast Fourier Transforms, vector math, and statistics functions. Through a single API call, these functions automatically scale for future processor architectures by selecting the best code path for each. #### INTEL® DATA ANALYTICS ACCELERATION LIBRARY (INTEL® DAAL) Crunch more big data on the same node with Intel® DAAL for C++ and Java. The library provides highly optimized algorithmic building blocks to speed big data analytics performance on platforms from edge devices to servers. It encompasses data analysis stages (preprocessing, transformation, analysis, modeling, and decision making) for offline, streaming, and distributed analytics usages. Tight integration with popular data platforms (including Hadoop\* and Spark\*) enables highly efficient data access. All libraries available at no cost with **Community Licensing** (Intel® support not included) ### Intel® Distribution for Python\* - Performance-optimized Python Distribution for technical computing & data analysis - Performance accelerations powered by Intel<sup>®</sup> MKL - NumPy/SciPy packages accelerated with Intel<sup>®</sup> MKL - NumPy: fundamental package for scientific computation in Python. Support for large multidimensional arrays & matrices. High level mathematical functions - SciPy: science & engineering modules - Pandas, sympy, matplotlib, scikit-learn - Easy, Intuitive product experience easy installation, package management, access to performance - Python 2.7 & 3.5 - Windows & Linux. Mac OS in 2016 ## Access multiple options with our Python Distribution #### **Accelerate with native libraries** - NumPy, SciPy, Scikit-Learn, Theano, Pandas, pyDAAL - Intel MKL, Intel IPP, Intel DAAL #### **Exploit vectorization and threading** - Cython + Intel C++ compiler - Numba + Intel LLVM #### **Better/Composable threading** Cython, Numba Threading composability for MKL, CPython, Blaze/Dask, Numba #### Multi-node parallelism - Mpi4Py, Distarray - Intel native libraries: Intel MPI ## Integration with Big Data, ML platforms and frameworks Spark, Hadoop, Trusted Analytics Platform #### **Better performance profiling** Extensions for profiling mixed Python & native/JIT codes Join the Intel® Distribution for Python\* 2017 Beta ## Auto vectorization #### For C/C++ and Fortran Relies on the compiler for vectorization of inner loops - No source code changes - Enabled with -vec compiler knob (default in -02 and -03 optimization levels) | Opt. level | Description | |---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | -00 | Disables all optimizations. | | -01 | Enables optimizations for speed which are know to not cause code size increase. | | -O2/-O<br>(default) | <ul> <li>Enables intra-file interprocedural optimizations for speed, including:</li> <li>Vectorization</li> <li>Loop unrolling</li> </ul> | | -03 | <ul> <li>Performs -02 optimizations and enables more aggressive loop transformations such as:</li> <li>Loop fusion</li> <li>Block unroll-and-jam</li> <li>Collapsing IF statements</li> <li>This option is recommended for applications that have loops that heavily use floating-point calculations and process large data sets. However, it might incur in slower code, numerical stability issues, and compilation time increase.</li> </ul> | ## Auto vectorization: not all loops will vectorize #### Data dependencies between iterations - Proven Read-after-Write data (i.e., loop carried) dependencies - RaW dependency - Assumed data dependencies - Aggressive optimizations (e.g., IPO) might help ``` for (int i = 0; i < N; i++) a[i] = a[i-1] + b[i]; ``` #### Vectorization won't be efficient - Compiler estimates how better the vectorized version will be - Affected by data alignment, data layout, etc. ``` Inefficient vectorization ``` ``` for (int i = 0; i < N; i++) a[c[i]] = b[d[i]];</pre> ``` #### Unsupported loop structure - While-loop, for-loop with unknown number of iterations - Complex loops, unsupported data types, etc. - (Some) function calls within loop bodies - Not the case for SVML functions Function call within loop body ``` for (int i = 0; i < N; i++) a[i] = foo(b[i]); ``` ## Auto vectorization on Intel compilers <u>Polyhedron</u> benchmark suite Intel® Xeon Phi<sup>TM</sup> 7120A, 61 cores x 4 threads Intel® Fortran Compiler 15.0.1.14 [-03 -fp-model fast=2 -align array64byte -ipo -mmic] ## Validating vectorization success #### Generate <u>compiler report</u> about optimizations ``` -qopt-report[=n] Generate report (level [1..5], default 2) -qopt-report-file=<fname> Optimization report file (stderr, stdout also valid) -qopt-report-phase=<phase> Info about opt. phase: ``` ``` LOOP BEGIN at gas_dyn2.f90(193,11) inlined into gas_dyn2.f90(4326,31) remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 1 remark #15450: unmasked unaligned unit stride loads: 1 remark #15476: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 53 remark #15477: vector loop cost: 14.870 remark #15478: estimated potential speedup: 2.520 remark #15479: lightweight vector operations: 19 remark #15481: heavy-overhead vector operations: 1 remark #15488: --- end vector loop cost summary --- remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=4 LOOP END ``` ``` Loop nest optimizations loop Auto-parallelization par Vectorization vec OpenMP openmp offload Offload Interprocedural optimizations ipo Profile Guided optimizations pgo Code generation optimizations cq tcollect Trace analyzer (MPI) collection All optimizations (default) all ``` Vectorized loop ``` LOOP BEGIN at gas_dyn2.f90(2346,15) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed OUTPUT dependence between IOLD line 376 and IOLD line 354 remark #25015: Estimate of max trip count of loop=3000001 LOOP END ``` Non-vectorized loop ## Guiding vectorization: disambiguation hints ## Get rid of assumed vector dependencies Assume function arguments won't be aliased • <u>C/C++</u>: compile with -fargument-noalias C99 "restrict" keyword for pointers • Or compile with -restrict knob Ignore assumed vector dependencies with Intel-specific compiler directive ``` • <u>C/C++</u>: #pragma ivdep ``` • Fortran: !dir\$ ivdep ``` void v_add(float *c, float *a, float *b) { #pragma ivdep for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; }</pre> ``` ## Target architecture compiler options On which architecture do we want to run our program? | Option | Description | |---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | -mmic | Builds an application that runs natively on Intel® MIC Architecture. | | -xfeature<br>-xHost | Tells the compiler which processor features it may target, referring to which instruction sets and optimizations it may generate (not available for Intel® Xeon Phi <sup>TM</sup> architecture). Values for feature are: • COMMON-AVX512 (includes AVX512 FI and CDI instructions) • MIC-AVX512 (includes AVX512 FI, CDI, PFI, and ERI instructions) • CORE-AVX512 (includes AVX512 FI, CDI, BWI, DQI, and VLE instructions) • CORE-AVX2 • CORE-AVX-I (including RDRND instruction) • AVX • SSE4.2, SSE4.1 • ATOM_SSE4.2, ATOM_SSSE3 (including MOVBE instruction) • SSSE3, SSE3, SSE2 When using -xHost, the compiler will generate instructions for the highest instruction set available on the compilation host processor. | | -axfeature | Tells the compiler to generate multiple, feature-specific auto-dispatch code paths for Intel® processors if there is a performance benefit. Values for feature are the same described for -xfeature option. Multiple features/paths possible, e.g.: -axSSE2, AVX. It also generates a baseline code path for the default case. | Vectorized code will be different depending on the chosen target architecture ## Some Intel-specific compiler directives ## For C/C++ and Fortran | Directive | Description | |------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [no]block_loop | Enables or disables loop blocking for the immediately following nested loops. | | distribute, distribute_point | Instructs the compiler to prefer loop distribution at the location indicated. | | inline | Instructs the compiler to inline the calls in question. | | ivdep | Instructs the compiler to ignore assumed vector dependencies. | | loop_count | Indicates the loop count is likely to be an integer. | | optimization_level | Enables control of optimization for a specific function. | | parallel/noparallel | Facilitates auto-parallelization of an immediately following loop; using keyword always forces the compiler to auto-parallelize; noparallel pragma prevents auto-parallelization. | | [no]unroll | Instructs the compiler the number of times to unroll/not to unroll a loop | | [no]unroll_and_jam | Prevents or instructs the compiler to partially unroll higher loops and jam the resulting loops back together. | | unused | Describes variables that are unused (warnings not generated). | | [no]vector | Specifies whether the loop should be vectorised. In case of forcing vectorization that should be according to the given <u>clauses</u> . | ## Enforcing vectorization with SIMD directives #### Intel-specific idioms C/C++ (also part of Cilk™ Plus) - Enforcing loop vectorization ignoring all dependencies - #pragma simd in front of vectorizable loop - Simd <u>keyword</u> right after for/cilk for loop keyword - Declaring <u>vectorized functions</u> - attribute ((vector))/ declspec(vector) on Linux/Windows ``` void vadd(float *c, float *a, float *b) { #pragma simd for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; } SIMD loop</pre> ``` ``` __declspec(vector) void vadd(float c, float a, float b) { c = a + b; } ... for (int i = 0; i < N; i++) vadd(C[i], A[i], B[i]); SIMD function</pre> ``` #### Fortran !dir\$ simd,!dir\$ attributes vector All directive idioms accept additional clauses (e.g., define reductions, etc.) ## Improving vectorization: data layout Vectorization more efficient with unit strides - Non-unit strides will generate gather/scatter - Unit strides also better for data locality - Compiler might refuse to vectorize Layout your data as Structure of Arrays (SoA) As opposite to Array of Structures (SoA) Traverse matrices in the right direction - C/C++: a[i][:], Fortran: a(:,i) - Loop interchange might help - Usually the compiler is smart enough to apply it - Check compiler optimization report ``` Array of Structures (AoS) struct coordinate { float x, y, z; } crd[N]; ... for (int i = 0; i < N; i++) ... = ... f(crd[i].x, crd[i],y, crd[i].z);</pre> ``` ``` Consecutive elements in memory x0 y0 z0 x1 y1 z1 ... x(n-1) y(n-1) z(n-1) ``` ``` Structure of Arrays (SoA) struct coordinate { float x[N], y[N], z[N]; } crd; ... for (int i = 0; i < N; i++) ... = ... f(crd.x[i], crd.y[i], crd.z[i]);</pre> ``` ``` Consecutive elements in memory x0 x1 ... x(n-1) y0 y1 ... y(n-1) z0 z1 ... z(n-1) ``` ## Improving vectorization: data alignment (cont'd) | How to | Language | Syntax | Semantics | | | |-------------------|---------------------------------|--------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|--|--| | align data | C/C++ | <pre>void* _mm_malloc(int size, int n)</pre> | Allocate memory on heap aligned to <i>n</i> | | | | | C/C++ | <pre>int posix_memalign (void **p, size_t n, size_t size)</pre> | byte boundary. | | | | | C/C++ | <pre>declspec(align(n)) array (Windows)attribute(align(n)) array (Linux)</pre> | | | | | | C++11 | alignas(expression type) | | | | | | Fortran (not in common section) | !dir\$ attributes align:n::array | Alignment for variable declarations. | | | | | Fortran<br>(compiler option) | -align <i>n</i> byte | | | | | | C/C++ | #pragma vector aligned | Vectorize assuming all array data accessed are aligned (may cause fault otherwise). | | | | tell the compiler | Fortran | !dir\$ vector aligned | | | | | about it | C/C++ | assume_aligned(array, n) | Compiler may assume array is aligned to | | | | | Fortran | !dir\$ assume_aligned array:n | <i>n</i> byte boundary. | | | *n*=64 for Intel® Xeon Phi<sup>™</sup> coprocessors, *n*=32 for AVX, *n*=16 for SSE Padding might be necessary to guarantee aligned access to matrices ## Vectorization with multi-version loops ## Peel loop Alignment purposes Might be vectorized #### Main loop Vectorized Unrolled by x2 or x4 #### Remainder loop Remainder iterations Might be vectorized ``` LOOP BEGIN at gas dyn2.f90(2330,26) <Peeled> remark #15389: vectorization support: reference AMAC1U has unaligned access remark #15381: vectorization support: unaligned access used inside loop body remark #15301: PEEL LOOP WAS VECTORIZED LOOP BEGIN at gas dyn2.f90(2330,26) remark #25084: Preprocess Loopnests: Moving Out Store remark #15388: vectorization support: reference AMAC1U has aligned access remark #15399: vectorization support: unroll factor set to 2 remark #15300: LOOP WAS VECTORIZED remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 8 remark #15477: vector loop cost: 0.620 remark #15478: estimated potential speedup: 15.890 remark #15479: lightweight vector operations: 5 remark #15488: --- end vector loop cost summary --- remark #25018: Total number of lines prefetched=4 remark #25019: Number of spatial prefetches=4, dist=8 remark #25021: Number of initial-value prefetches=6 LOOP BEGIN at gas dyn2.f90(2330,26) <Remainder> remark #15388: vectorization support: reference AMAC1U has aligned access remark #15388: vectorization support: reference AMAC1U has aligned access remark #15301: REMAINDER LOOP WAS VECTORIZED ``` ## Improving vectorization: trip count hints #### Peel loop Alignment purposes Might be vectorized #### Main loop Vectorized Unrolled by x2 or x4 Remainder loop Remainder iterations Might be vectorized Vectorization can be seen as aggressive unrolling - Main loop usually unrolled by x2 or x4 - Peel and remainder loop are vectorized with masks - If trip count is low, vectorization might not be efficient - Remainder loop becomes the hotspot #### Take a look at remainder loops - Specify loop trip counts for efficient vectorization - #pragma/!dir\$ loop count (n1,[n2...]) - \* #pragma/!dir\$ loop\_count min(n1), max(n2), avg(n3) - Consider <u>safe padding</u> option (Intel<sup>®</sup> Xeon Phi<sup>™</sup> only) - Otherwise, remainder loops using gather/scatter loops - -qopt-assume-safe-padding to avoid it ## Low level (explicit) vectorization A.k.a "ninja programming" (C/C++ only) Vectorization relies on the programmer with some help from the compiler Might be convenient for low level performance tuning of critical hotspots Not portable among different SIMD architectures | SIMD C++ classes | <u>Intrinsics</u> | Inline assembly | |---------------------------------------|-------------------------------------------------|--------------------------------------------------------------| | <pre>#include <fvec.h></fvec.h></pre> | <pre>#include <xmmintrin.h></xmmintrin.h></pre> | m128 a,b,c;<br>_asm { | | F32vec4 a,b,c;<br>a = b + c; | m128 a,b,c;<br>a = _mm_add_ps(b,c); | movaps xmm0,b movaps xmm1,c addps xmm0,xmm1 movaps a, xmm0 } | # Design, analysis and verification tools Intel® Parallel Studio XE 2016 (Professional/Cluster Editions) ## Intel® Advisor XE 2016 ## A design/analysis tool for threading your code "What-if" analysis tool for thread design and prototyping - Analyze, design, tune, and check your threading design before implementation - Explore and test threading options without disrupting normal development - Predict performance scaling on Intel<sup>®</sup> Xeon<sup>®</sup> and Xeon Phi<sup>™</sup> architectures #### What's new in 2016 version? Tool completely redesign to add vectorization capabilities as well ## Intel® Advisor XE 2016 ## A design/analysis tool for vectorising your code #### Survey analysis - See what prevents vectorization - Detect vectorization issues - Source/assembly integration - Optimization reports - Automatic recommendations #### Trip-count analysis - How many iterations in a loop - Quantify peel/main/remainder #### Deeper analyses - Correctness analysis to see if a loop can be safely vectorized - Memory access pattern (MAP) to figure out actual vectorization stride Complete tutorial in latest Intel's magazine "The Parallel Universe" (Issue 22) ## Survey report: the right data at your fingertips Get all the data you need for high impact vectorization ## Source code and assembly integration ## Recommendations Get specific advice for improving vectorization ## Intel® VTune<sup>TM</sup> Amplifier XE 2016 #### Performance profiler for serial/parallel programs • GUI and command-line interfaces #### Event collection/instrumentation - No special recompiles - Local/remote event collection - Low overhead #### Analysis features - Quickly locate hotspots - Identify issues in source code - Threading analysis - Visualize thread behaviour - Find uarch bottlenecks Check <u>release notes</u> and "What's new in 2016?" for product updates ## Intel® VTune<sup>TM</sup> Amplifier XE analysis types ## Software user mode sampling and tracing collector Any processor, any virtual, no driver (about 5% overhead) | Analysis | Description | Sample | | | | | |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--| | Basic<br>hotspots | This analysis helps understand application flow and identify sections of code that get a lot of execution time (hotspots). It also captures the call stacks for each of these functions so you can see how the hot functions are called. | Basic Hotspots Hotspots by CPU Usage viewpoint (change) ↑ Analysis Target | | | | | | <u>Concurrency</u> | This analysis helps identify hotspot functions where processor utilization is poor, by providing information on how many threads were running (not waiting at a defined waiting or blocking API) at each moment during application execution. When cores are idle at a hotspot, you have an opportunity to improve performance by getting those cores working for you. | TBB Worke threadstart TBB Worke threadstart TBB Worke Thread C | | | | | | Locks and<br>waits | This analysis helps <b>identify the cause of ineffective processor utilization</b> . One of the most common problems is threads waiting too long on synchronization objects (locks). With this analysis you can <b>estimate the impact each synchronization object</b> has on the application and understand how long the application had to wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O. | Crouping: Sync Object / Function / Call Stack Wait Time by Thread Concurrency ★ ★ B ackgroundThreadProc 4.4175 BackgroundThreadProc 4.4175 Sleep 1.043s Brt_finalize 0.5765 Byideo:main_loop 0.467s | | | | | ## Intel® VTune<sup>TM</sup> Amplifier XE analysis types ### Hardware Event-Based Sampling (EBS) Higher resolution, system wide, lower overhead (about 2% overhead) | Analysis | Description | Sample | | | | |--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--| | Advanced<br>hotspots | This analysis is a fast and easy way to identify performance-critical code sections (hotspots). By default, it does not capture the function call stacks as the hotspots are collected, but it can be used to sample all processes on the system. | Advanced Hotspots Hotspots viewpoint (change) Analysis Target Analysis Type Collection Log M Summary Bottom-up Grouping: Core / H/W Context / Function / Call Stack CPU Time ▼ Effective Time by Utilization Bulleton Collection Collection Collection Collection Collection Collection CPU Time ▼ Effective Time by Utilization Collection Collection Collection CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Experiment CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Experiment CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Effective Time by Utilization CPU Time ▼ Experiment CPU Time ▼ Effective Time by Utilization CPU Time ▼ Experiment CPU Time ▼ Effective Time by Utilization CPU Time ▼ Experiment CPU Time ▼ Effective Time by Utilization CPU Time ▼ Experiment | | | | | General<br>exploration | This analysis is a good starting point to triage hardware issues in your application by understanding how efficiently your code is passing through the core pipeline. It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems. The list of events and metrics collected during the General Exploration analysis depends on your microarchitecture. | General Exploration General Exploration viewpoint (change) © d Analysis Target Analysis Type Collection Log Masummary About Definition of Comping: Function / Thread / H/W Context / Call Stack Function / Thread / H/W Context / Call Stack Function / Thread / H/W Context / Call Stack Clockticks CPI Rate Memor Gore Bound d A,74,006,111 1.250 d Thread (TID: 9860) 3,736,009,5604 1.217 | | | | | Memory<br>access<br>New! | Use this analysis to <b>identify memory-related issues</b> , like NUMA problems and bandwidth-limited accesses, and attribute performance events to memory objects (data structures). This analysis replaces the "Bandwidth analysis" present in previous versions of the tool. | Memory Access Memory Usage viewpoint (change) Memory Access Memory Usage viewpoint (change) | | | | ## Intel® Parallel Studio XE 2016 components | | Full Licensing (including Intel® Premier Support) | | Free Licensing | | | | | |--------------------------------------------------------|---------------------------------------------------|-------------------------|--------------------|----------------------|----------------------------|------------------------|--------------------------| | Component | Composer<br>Edition | Professional<br>Edition | Cluster<br>Edition | Student/<br>Educator | Open Source<br>Contributor | Academic<br>Researcher | Community<br>(Everyone!) | | Intel® C/C++ Compiler<br>(including Intel® Cilk™ Plus) | ✓ | ✓ | ✓ | ✓ | ✓ | | | | Intel® Fortran Compiler | ✓ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | | | | OpenMP 4.0 | ✓ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | | | | Intel® Threading Building Blocks (C++ only) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Intel® IPP Library (C/C++ only) | ✓ | ✓ | ✓ | $\checkmark$ | ✓ | ✓ | ✓ | | Intel® Math Kernel Library | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | Intel® Data Analytics Acceleration Library | ✓ | ✓ | ✓ | $\checkmark$ | ✓ | ✓ | ✓ | | Intel® MPI Library | | | ✓ | ✓ | | ✓ | | | Rogue Wave IMSL Library<br>(Fortran only) | Bundled<br>and Add-on | Add-on | Add-on | | | | | | Intel® Advisor XE | | ✓ | ✓ | ✓ | ✓ | | | | Intel® Inspector XE | | ✓ | $\checkmark$ | $\checkmark$ | ✓ | | | | Intel® VTune™ Amplifier XE | | ✓ | ✓ | $\checkmark$ | ✓ | | | | Intel® ITAC + MPI Performance Snapshot | | | ✓ | ✓ | | | | Beta Program Intel® Parallel Studio XE 2017 Compiler 17.0 is part of Intel® Parallel Studio XE 2017 Beta program of IPSXE-2107 started end of March 2016 - To join, please register at: <a href="http://bit.ly/psxe2017beta">http://bit.ly/psxe2017beta</a> - More information: - Overview beta program : <a href="https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-beta">https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-beta</a> - Release notes page: <u>https://software.intel.com/en-us/articles/intel-parallel-studio-xe-2017-beta-release-notes</u> ## Intel® Parallel Studio XE — Improvements - Annotated Source Listings - modified copy of source with line numbers and compiler diagnostics inserted - Code Alignment for Functions and Loops - Optimization Reports - More precise non-vectorization reasons - Significant Improvement in Variable Names and Memory References - Additional Diagnostic Messages ## How to get ready for Intel® AVX-512? Start optimizing your application today for current generation of Intel® Xeon® processors and Intel® Xeon<sup>TM</sup> Phi coprocessors - and/or - Compile with latest compiler toolchains - Intel compiler (v15.0+): -xCOMMON-AVX512, -xMIC-AVX512, -xCORE-AVX512 - GNU compiler (v4.9+): -mavx512f, -mavx512cd, -mavx512er, -mavx512pf Tune your AVX-512 kernels on non-existing silicon - Run your kernels on top of the <a href="Intel® Software Development emulator">Intel® Software Development emulator</a> (SDE) - Emulate (future) Intel® Architecture Instruction Set Extensions (e.g. Intel® MPX, ...) - Tools available for detailed analysis - Instruction type histogram - Pointer/misalignment checker - Also possible to debug the application being emulated ## Key new features for software adaptation to KNL #### Large impact: Intel® AVX-512 instruction set - 32 512-bit FP/Int vector registers, 8 mask registers, HW gather/scatter - Slightly different from future Intel® Xeon™ architecture AVX-512 extensions - Backward compatible with SSE, AVX, AVX-2 - Apps built for HSW and earlier can run on KNL (few exceptions like TSX) - Incompatible with 1st Generation Intel® Xeon Phi™ (KNC) ## Medium impact: new, on-chip high bandwidth memory (HBM) - Creates heterogeneous (NUMA) memory access - Can be used transparently too however ## Minor impact: differences in floating point execution/rounding New HW-accelerated transcendental functions like exp() ## Pre-Order Developer Platform for Intel® Xeon Phi™ Processor *Today!* Unleash your code's potential ## For Code/Application Developers - Academic - Scientific applications - Physics - Big data analytics - Life sciences –Genomics - Life Sciences -Molecular Dynamics - Finance - Oil & Gas - Manufacturing - Modeling, - Simulation - Visualization **XEON PH** Leading edge platform capabilities, performance to deliver multithreaded, vectorized software for today's HPC workloads! http://xeonphideveloper.com COLFAX DEVELOPER ACCESS PROGRAM FOR INTEL® XEON PHI® PROCESSOR DEVELOPER ACCESS PROGRAM (DAP) FOR INTEL® XEON PHI™ PROCESSOR CODENAMED KNIGHTS LANDING (KNL) What is DAY for Intel® Year Pier Processor Conferenced Rights Landing (DRLD) What is a lower of Secretary Intelligence Conferenced Rights Landing (DRLD) What is a lower of Secretary Intelligence Conference Confe #### Support & Training - Online community access - Support from Colfax/Local OEMs - Training: Hands on Webinars, optimization guidance, whitepapers, videos, How to guides #### **Highly-Parallel Performance** - Intel® Xeon Phi™ Processor, 512-bit SIMD vectors with 2 VPU/core, 16GB MCDRAM integrated memory, - Binary-compatible with Intel® Xeon® processors #### Software Tools & Libraries - CentOS 7.2 - Includes Intel Parallel Studio XE 2016 1 year license - Featuring the new Intel Vector Advisor for parallelization - Access to Intel Libraries ## Online resources Intel® Software Development Products, performance tuning, etc. • Documentation library All available documentation about Intel software • <u>HPC webinars</u> Free technical webinars about HPC on Intel platforms Modern code Intel resources about code modernization Forums Public discussions about Intel SIMD, threading, ISAs, etc. #### Intel® Xeon Phi<sup>TM</sup> resources • <u>Developer portal</u> Programming guides, tools, trainings, case studies, etc. • <u>Solutions catalog</u> Existing Intel® Xeon Phi<sup>TM</sup> solutions for known codes Other resources (white papers, benchmarks, case studies, etc.) • <u>Go parallel</u> BKMs for Intel multi- and many-core architectures <u>Colfax research</u> Publications and material on parallel programming <u>Bayncore labs</u> Research and development activities (WIP) ## Online resources Detailed information about available Intel products http://ark.intel.com/ Encoding scheme of Intel processor numbers www.intel.com/products/processor number ## Recommended books High performance parallelism pearls: multi-core and many-core approaches (Vol. 2), by James Reinders and Jim Jeffers, Morgan Kaufmann, 2015 > High performance parallelism pearls: multi-core and many-core approaches, by James Reinders and Jim Jeffers, Morgan Kaufmann, 2014 Intel® Xeon Phi<sup>™</sup> Coprocessor High Performance Programming, by Jim Jeffers and James Reinders, Morgan Kaufmann, 2013 #### Coming up! Intel® Xeon Phi™ High Performance Programming: Knights Landing Edition 2nd Edition 2nd Edition, 2016