

### SFO17-403: Optimizing the Design and Implementation of KVM/ARM Christoffer Dall



#### Linaro Connect San Francisco 2017



#### ""Efficient, isolated duplicate of the real machine""

–Popek and Golberg [Formal requirements for virtualizable third generation architectures '74]





## Virtualization



#### OS Kernel

#### Hardware

Native



Hypervisor

Hardware

#### **Virtual Machines**





# Hypervisor Design

#### Type 1 (Standalone)







# Hypervisor Design

#### Type 1 (Standalone)



### Type 2 (Hosted)









## Hypervisor Design







### **ARM Virtualization Extensions**

#### **ELO**

EL1

EL2

#### User

### Kernel

Hypervisor







EL2



### **ARM VE and Hypervisors**











## KVM/ARM









## KVM/ARM



## ARMv8.1 VHE

- Virtualization Host Extensions
- Supports running unmodified OSes in EL2 without using EL1





WORKING TOGETHER

### VHE: Backwards Compatible

- HCR\_EL2.E2H complete enables and disables VHE
- When disabled, completely backwards compatible with ARMv8.0 •
- Example: Xen disables VHE •





### VHE: Expands Functionality of EL2

- Expanded EL2 functionality
- New registers: TTBR1\_EL2, CONTEXTIDR\_EL2 •
- New virtual EL2 timer





- TGE: Trap General Exceptions
- Routes all exceptions to EL2 •

VHE no longer disables EL0 stage 1 MMU

### VHE: Support Userspace in ELO







### VHE: EL2&0 Translation Regime

- Same page table format as EL1
- Used in EL0 with TGE bit set •





### VHE: System Register Redirection

#### $HCR\_EL2.E2H == 0$











### **VHE: System Register Redirection**

#### $HCR\_EL2.E2H == 1$

#### mrs x0, TCR EL1









## **VHE Register Redirection**

#### mrs x0, TCR EL12







### More VHE Register Redirection

- bits

Some registers change bit position to be similar between EL1 and EL2

Example: CNTHTCL\_EL2 changes layout to match CNTKCTL\_EL1 with extra







### Legacy KVM/ARM without VHE



#### EL2

EL1





## KVM/ARM with VHE

# Linux Run VM

#### EL2







# **Experimental Setup**

- AMD Seattle B0
- 64-bit ARMv8-A
- 2.0 GHz AMD A1100 CPU
- 8-way SMP
- 16 GB RAM

• 10 GB Ethernet (passthrough)

\*Measurements obtained using Linux in EL2. See BKK16 talk.





## VHE Performance at First Glance

#### **CPU Clock Cycles**

#### Hypercall

\*Measurements obtained using Linux in EL2. See BKK16 talk.

| non-VHE | VHE*  |
|---------|-------|
| 3.181   | 3.045 |





 Avoid saving/restoring EL1 register state









- Legacy KVM/ARM design • enabled/disabled virtualization features on every transition
- Virtual/Physical interrupts
- Stage 2 memory translation







- Leave virtualization features enabled
- Host EL2 never uses stage 2 translations and always has full hardware access.







- Don't context switch the timer on every exit from the VM
- Completely reworks the timer code
- 20 patches on list







- Reduce run loop work •
- Do work in vcpu\_load and vcpu\_put instead •
- Called when entering/exiting run-loop •
- Called when preempted/scheduled •
- **Requires VHE** •







• Rewrite the world switch code

kvm arch vcpu ioctl run

while (1) {

• • •

• • •

}

• • •

• • • if (has vhe() /\* static key \*/ ret = kvm vcpu vhe run(vcpu); else ret = kvm\_call\_hyp(\_\_kvm\_vcpu\_run, vcpu);





### Microbenchmark Results

| <b>CPU Clock Cycles</b> | non-VHE | VHE OPT * | <b>x86</b> |
|-------------------------|---------|-----------|------------|
| Hypercall               | 3.181   | 752       | 1.437      |
| I/O Kernel              | 3.992   | 1.604     | 2.565      |
| I/O User                | 6.665   | 7.630     | 6.732      |
| Virtual IPI             | 14.155  | 2.526     | 3.102      |

\*Measurements obtained using Linux in EL2. See BKK16 talk.





# Application Workloads

#### Application

Kernbench

Hackbench

Netperf

Apache

Memcached

#### Description

Kernel compile

Scheduler stress

Network performance

Web server stress

Key-Value store

## **Application Workloads**





## Conclusions

- Optimize and redesign KVM/ARM for VHE •
- Reduce hypercall overhead by more than 75% •
- Better cycle counts than x86 for key hypervisor operations •
- Network benchmark overhead reduced by 50% •

Key-value store workload overhead reduced by more than 80%





### Upstream Status

Timer patches on list

Core optimization patches coming soon



