

## A High Performance Platform for Real-Time X-ray Imaging

# **Agenda**

Synchrotron Tomography at KIT Hardware & Software Platform Optimizing Tomography Speed

## **Architectures**

NVIDIA GT200 NVIDIA Fermi NVIDIA Kepler

AMD VLIW5 AMD GCN





Example for 3D X-Ray imaging.
The functional groups of a flightless
weevil are colored

## **Authors**

Suren A. Chilingaryan, KIT Michele Caselle, KIT Thomas van de Kamp, KIT Andreas Kopmann, KIT Alessandro Mirone, ESRF Uros Stevanovic, KIT Tomy dos Santos Rolo, KIT Matthias Vogelgesang, KIT



# In collaboration with ESRF: European Synchrotron Radiation Facility Polygone Scientifique Louis Néel, 6 rue Jules Horowitz, 38000 GRENOBLE

# **ANKA Synchrotron at KIT**







ANKA synchrotron (left) and schematics of TOPO-TOMO beamline (right).

The rotating sample in front of a pixel detector is penetrated by X-rays produced in the synchrotron. Absorption at different angles is registered by camera and 3D map of sample denisity is reconstructed.

# **Examples**







Heads of a newt larva showing bone formation and muscle insertions (top) and a stick insect (bottom), acquisition time 2s.

4D tomogram of wheat weevil

# **UFO Project**



## <u>U</u>Itra <u>Fast X-ray Imaging of Scientific Processes with <u>O</u>n-Line Assessment and Data-Driven Process Control</u>

## **Goals**

- >Increase user throughput
- High speed tomography
- >Tomography of Temporal Processes
- Allow Interactive Quality Assessment
- >Enable Data Driven Control
  - >Auto-tunning Optical System
  - >Tracking Dynamic Processes
  - Finding Area of Interest



# **Current Hardware Configuration**



PCO.edge PCO.dimax PCO.4000



CameraLink

850MB/s



Ethernet (10 Gb/s)

LSDF
Large Scale Data Facility

External PCIe x16 (8 GB/s)

SFF8088 (2.4 GB/s)



**External GPU Box** 



**SuperMicro 7046GT-TRF** (Dual Intel 5520 Chipset)

CPU: 2 x Xeon X5650 (total 12 cores at 2.66 Ghz)

GPUs: 4 x GTX590 External

Memory: 96 GB / 12 DDR3 slots (192GB max)

Network: Intel 82598EB (10 Gb/s)

Camera Link Frame Grabber (850 MB/s)

Storage: Areca ARC-1880-ix-12 SAS Raid

16 x Hitachi A7K200 (Raid6)

8 x Intel SSD 510 (Raid0)



JBOD JBOD

## SAS Attached Storage



# Next Generation: High-speed Programmable Camera

"made in house"





Readout mother card

FPGA Virtex 6

Peltier cells

Flex electrical cable

( $max \ f = 183MHz$ )

Daughter card

CMOS pixel sensor

Vacuum or  $N_2$ Optical body

and lens

First Camera Prototype

- High speed CMOS sensor
- → 1Mpix, 5000 fps, 10 bits
- → Self-trigger & Data compression
- On-line elaborations and control
- Full Programmability
- Direct connection to Infiniband-cluster

# Next Generation: Processing Cluster









## **UFO Framework**





# **Synchrotron Tomography**





The sample is evenly rotated and the pixel detector registers series of parallel 2D projections of the sample density at different angles.

# Tomographic Reconstruction



Filtered back-projection is used to produce 3D images from a manifold of two dimensional projections. Vertical slices are processed independently. For each slice all projections are smeared back onto the cross section along the direction of incidence yielding an integrated image.





# **Pipelined Architectures**





- 1. Reading data from fast SSD Raid-0 (random reads are effective)
- 2. Scheduling and preprocessing using SIMD instructions of x86 CPUs
- 3. Reconstructing on GPUs
- 4. Storing to Raid on magnetic disks (sequential writes are effective)



# Performance of Tomographic Reconstruction





# **Basic Implementation**





## **Extension Box**





1 x PCIe x16 2.0 4 x GTX590 8 GPU cores

External GPU Enclosure by One Stop Systems

# Initialization (spuose) 20 15 10 5 1 2 3 4 5 6 7 8





Non-NUMA

8 GPUs, Speed-up

# **Back Projection Explained**





- 1. For each position we compute:  $x \bullet cos(\alpha) y \bullet sin(\alpha)$
- 2. Compute linear interpolation between 2 neighboring bins
- 3. Sum over all projection
- 4. The sum is the value of (x,y)

 $x \bullet \cos(\alpha) - y \bullet \sin(\alpha)$ For each texel of output volume and for each projection we perform a single linear interpolation



# Fermi: Balancing texture fetches with computations



## **Computational Units**

| GT280            | GTX580            | Speedup |
|------------------|-------------------|---------|
| 240 x 1.3<br>GHz | 512 x 1.55<br>GHz | 2.5 x   |
| 48 GT/s          | 49.4 GT/s         | 1.0 x   |

**Texture Engine** 



$$v = x \bullet cos(\alpha) - y \bullet sin(\alpha)$$

$$max_{x,y

$$N\sqrt{2} < 1.5 \text{ N}$$$$

Each block of threads accesses actually only 3 • N / 2 bins per projection



# Standard Version Texture engine is heavily loaded



Fermi-optimized Version

Both texture & computations engines are used

# Pixel to thread mapping





| Px. | Fetches/px. | Regs | ShMem | Occup. | Ind. Instr |
|-----|-------------|------|-------|--------|------------|
| 1   | 0.09375     | 26   | 1536  | 66%    | 1          |
| 4   | 0.046875    | 32   | 3072  | 66%    | 4          |

Processing 4 pixels per thread reducing amount of texture fetches and hides operation latencies with multiple independent operations (see Better Performance at Lower Occupancy presented by V. Volkov at GTC2010)



# Processed by a single thread block (16x16) 48 bins of a projection required for current block 16 of the projections processed in a single pass thr (1,1) thr (2,1) thr (3,1) thr (1,2) thr (2,2) thr (3,3)

Legend

Processing in multiple passes,

16 projections each

# **Optimizing shared memory reads**



Wrap 2, first half-wrap Wrap 1, second half-wrap Wrap 1, first half-wrap

Block of 16x16 threads

1 2 3 4 5 6 7 8 9 10111213141516

1 2 3 4 5 6 7 8 9 10111213141516

1 2 3 4 5 6 7 8 9 10111213141516

1 2 3 4 5 6 7 8 9 10111213141516

Less than 6 shared memory

Up to 16 shared Memory positions per half-wrap

Wrap 1, first half-wrap

We have better shared memory performance using this layout



# Reducing computation costs with oversampling



Linear interpolation is slow, and nearest neighbor is not precise enough

| Method     | Fetches/px. | Regs | ShMem | Occup. | Reads. | Flops. |
|------------|-------------|------|-------|--------|--------|--------|
| Linear     | 0.046875    | 32   | 3072  | 66%    | 2      | 7      |
| Oversample | 0.1875      | 42   | 12288 | 50%    | 1      | 4      |



With oversampling the texture engine is used to interpolate 4 positions for each projection bin and near-neighbor interpolation is used then.



# **Kepler: Fast Texture Engine is Back**



## **Simple Texture Method**

| Texture Cache Hit Rate | 89 %       |
|------------------------|------------|
| Texture Throughput     | 79.3 GT/s  |
| Theoretical Throughput | 128.8 GT/s |



| Block of 16x16 thread | Block | of | 16x16 | threads |
|-----------------------|-------|----|-------|---------|
|-----------------------|-------|----|-------|---------|

|                                   | GT580                      | GTX680                     | Increase |
|-----------------------------------|----------------------------|----------------------------|----------|
| Texture<br>Engine                 | 49.4 GT/s                  | 128.8 GT/s                 | 2.6 x    |
| Computational Units               | 16 x 32 x<br>1.55 GHz      | 8 x 192 x<br>1.006 GHz     | 1.94 x   |
| Int multipl., bit ops., type conv | 16 x 16 x<br>1.55 GHz      | 8 x 32 x<br>1.006 GHz      | 0.65 x   |
| Shared<br>Memory                  | 48 KB                      | 48 KB                      | 1        |
| Blocks per SM                     | 8                          | 16                         | 2        |
| Registers                         | 32K per SM,<br>63 per thr. | 64K per SM,<br>63 per thr. |          |

- 1. Up to 16 bins are accessed per half-wrap,
- 2. All threads are accessing a single texture row

# **Optimizing cache efficiency**





# **Faster rounding**





IEEE 754 single-precision floating point number

$$f = -1^{s} \cdot 2^{e-127} \cdot (1 + \sum_{i} b_{i} \cdot 2^{i-23})$$

Only 23 significant positions, for small positive numbers:

$$F + 2^{23} = 2^{23} \cdot (1 + \sum_{i} b_{i} \cdot 2^{i-23})$$

i.e. no fractional part

round(f) = 
$$f + 2^{23} - 2^{23}$$
  
(int)f =  $f + 2^{23} - 0x4B000000$ 

#### Kepler Oversampling Algorithm

- 1. New stage pre-computing per-block offsets
- 2. Offsets are exchanged using shuffle instruction
- 3. Faster rounding is not used due overlap of rounding and floating point operations



## **AMD Architectures**



Radeon HD 5970

### VLIW5

Requires quintuples of independent operations in command flow:

**Block size**: 16x16 => 8x8

Points per thread: 2 => 8



### Radeon HD 7970

## **GCN**

No special tunning required



- Only a single chip running in dual chip configurations
- Memory/Computations overlapping in beta and have not worked in my setup.
- Many functions are not optimal, for instance CopyRect family functions are slow
- Compiler Doesn't support local arrays, manual unrolling is required

# **Back Projection Kernels**





## **Kepler**

Uses texture engine, but processes 16 projections at once and 16 points per thread to enhance cache hit rate



## **Cypress**

Computes 16 points per thread in order to provide sufficient flow of independent instructions to VLIW engine



## Cypress Fast

Oversampling algorithm with 16 points per thread.



## GT200

Uses texture engine



Caches textures in shared memory and performs interpolations using computation nodes



## Fermi/GCN Fast

Oversampling uses texture engine to interpolate values at 4 predefined points and then uses near-neighbor interpolation to avoid costly computations



## **Kepler Fast**

Rounding optimization to get over performance limits of Kepler and usage of new shuffle instructions to exchange data between threads in wrap

# **Summary**



- GPU computing fits extremely well the needs of Synchrotron Imaging
- However, special care required to get to really high speeds
  - Pipelined architecture is efficient way to hide I/O time
  - The architecture-specific optimizations are often required
- We develop a platform for high speed time resoluted X-ray Imaging with possibility of real-time control
- Open-source image processing framework is designed
  - GPU/CPU processing with OpenCL
  - Integration with scripting languages using Gobject-introspection
  - Available from http://ufo.kit.edu/framework
- A programmable camera is currently under design to enable real-time control
  - Up to 1 Mpix at 5000 frames per second
  - Direct connection to Infiniband cluster
  - Programmable integrated logic for real-time control
- A chain of filters for parallel-beam tomography has been developed
  - Throughputs of up to 500 MB/s can be handled with a single PC
  - A clustered solution is under development