logo

High-Performance Virtual Machine Based Fault Tolerance COLO

Overview

This solution is applicable to Open Stack Nova. The open source code developed in the ORBIT project integrates a fault tolerance feature based on the COLO (COarse Grain LOck Stepping) technique as well as the utilization of a hybrid approach (COLO + Checkpointing) at several layers of the cloud stack: QEMU, LibVirt and OpenStack.

This page provides all the information needed to install and use the results of this particular project outcome.

 

Page Structure:

  • Background and Goals
  • Demo
  • Manuals and Source Code

Background and Goals

Virtual machine (VM) replication is a well known technique for providing application-agnostic software-implemented hardware fault tolerance „non-stop service“. COLO is a high availability solution. Both primary VM (PVM) and secondary VM (SVM) run in parallel. They receive the same request from client, and generate response in parallel too. If the response packets from PVM and SVM are identical, they are released immediately. Otherwise, a VM checkpoint (on demand) is conducted.

Kvm-colo

DEMO: Fault Tolerance using COLO

  • Project Leaders: REDHAT
  •  

Non-distributed applications are vulnerable to the failure of the host, and while it is possible to buy expensive higher-reliability hardware, this is still no guarantee against environmental issues. This software developed as part of the ORBIT project allows an individual Virtual machine to continue running, uninterrupted even if the host it is running on fails.

Traditional Checkpointing

Checkpointing is a technique to add fault tolerance into computing systems. It basically consists of saving a snapshot of the application’s state, so that it can restart from that point in case of failure. This is particularly important for long running application that are executed in vulnerable computing system. However in traditional checkpointing a primary machine has to pause every so often to take snapshots of current system state. This causes slowdown in system because of packets delays and increased CPU usage.

COLO Approach

In the colo approach both VMs run, both generate packets. The primary VM compares and only releases snapshots when the packets match. This causes potentially less checkpoints

COLO and Checkpoint Hybrid Approach

This approach suggests a merge between traditional checkpoiting and COLO. The vm would switch to traditional checkpoint mode after a run of short COLO checkpoints. Switch back to COLO occasionally to see if the guest behavior changed.

DEMO: Open Stack Colo Integration

DEMO: Fault Tolerant OpenStack Integration

Manuals and Source Code

The code is open source. You can find the code in a GitHub repository

Prerequisites

COLO is currently not supported upstream, however, you can get patches for both QEMU and libvirt. These should be setup and installed before you try to use the OpenStack patches. QEMU should be installed using QEMU’s normal installation process, adding --enable-colo --enable-qourum during the configure stage.

These patches are based on the Juno version of Nova. Make sure that your OpenStack setup is running or is compatible with Nova running Juno.

Installation

To be able to use COLO through OpenStack you need to install the COLO patches. If you are feeling brave, you could try to apply the patches yourself, otherwise, replace your current Nova source code by cloning this repository.

There is also a few minor patches for the python-novaclient and Horizon integrating COLO/FT. See the links for their specific installation instructions.

Setting up

There are a few different configurations that are required to setup the COLO implementation properly. Edit your nova.confwith the following changes:

[DEFAULT] # Change to the FT scheduler which schedules a list of pairs or sets of # hosts rather than a list of single hosts scheduler_driver = nova.scheduler.fault_tolerance_scheduler.FaultToleranceScheduler

# Disable VNC (more specifically -nographics in QEMU). This currently creates # some mismatching of the PCI addresses in the primary and secondary guest vnc_enabled = false

# Disable force config drive since it's not working with live migrations # Disable it by removing force_config_drive = always

[libvirt] # *Optional* Change path of the block replication drives # For example if you want to use a RAM-based filesystem block_replication_path=/path

If you’re modifying and running an OpenStack installation. Restart all of the nova components when the FT patches and configuration has been added.

Usage

Enabling COLO in a new instance is done by modifying the flavor extra specs

Example using the python-novaclient

nova flavor-key 1 set ft:enabled=1

To recover when a failure has happened you can either use the API or the nova client with the COLO patches

API

POST http://[nova-api host]:8774/v2/[project ID]/servers/[instance ID]/action

#Header
Content-Type: application/json
X-Auth-Token: [token]

#Payload { "failover": null }

Nova client

nova failover [instance name/ID]

Disclaimers

Quorums are at the time of writing not supported in libvirt. Since COLO depends on the quorum functionality in QEMU this is handled a bit differently from the rest of the modifications. Quorum is added to the disk command line through libvirt’s QEMU commandline XML rather than using the usual disk XML.

These patches are not reviewed and accepted by the OpenStack community.

Github

https://github.com/orbitfp7/nova/tree/fault-tolerance