Posted: February 21st, 2023
Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen†, Eric Jul†, Christian Limpach, Ian Pratt, Andrew Warfield
University of Cambridge Computer Laboratory † Department of Computer Science 15 JJ Thomson Avenue, Cambridge, UK University of Copenhagen, Denmark
firstname.lastname@cl.cam.ac.uk {jacobg,eric}@diku.dk
Abstract Migrating operating system instances across distinct phys- ical hosts is a useful tool for administrators of data centers and clusters: It allows a clean separation between hard- ware and software, and facilitates fault management, load balancing, and low-level system maintenance.
By carrying out the majority of migration while OSes con- tinue to run, we achieve impressive performance with min- imal service downtimes; we demonstrate the migration of entire OS instances on a commodity cluster, recording ser- vice downtimes as low as 60ms. We show that that our performance is sufficient to make live migration a practical tool even for servers running interactive loads.
In this paper we consider the design options for migrat- ing OSes running services with liveness constraints, fo- cusing on data center and cluster environments. We intro- duce and analyze the concept of writable working set, and present the design, implementation and evaluation of high- performance OS migration built on top of the Xen VMM.
1 Introduction
Operating system virtualization has attracted considerable interest in recent years, particularly from the data center and cluster computing communities. It has previously been shown [1] that paravirtualization allows many OS instances to run concurrently on a single physical machine with high performance, providing better use of physical resources and isolating individual OS instances.
In this paper we explore a further benefit allowed by vir- tualization: that of live OS migration. Migrating an en- tire OS and all of its applications as one unit allows us to avoid many of the difficulties faced by process-level mi- gration approaches. In particular the narrow interface be- tween a virtualized OS and the virtual machine monitor (VMM) makes it easy avoid the problem of ‘residual de- pendencies’ [2] in which the original host machine must remain available and network-accessible in order to service
certain system calls or even memory accesses on behalf of migrated processes. With virtual machine migration, on the other hand, the original host may be decommissioned once migration has completed. This is particularly valuable when migration is occurring in order to allow maintenance of the original host.
Secondly, migrating at the level of an entire virtual ma- chine means that in-memory state can be transferred in a consistent and (as will be shown) efficient fashion. This ap- plies to kernel-internal state (e.g. the TCP control block for a currently active connection) as well as application-level state, even when this is shared between multiple cooperat- ing processes. In practical terms, for example, this means that we can migrate an on-line game server or streaming media server without requiring clients to reconnect: some- thing not possible with approaches which use application- level restart and layer 7 redirection.
Thirdly, live migration of virtual machines allows a sepa- ration of concerns between the users and operator of a data center or cluster. Users have ‘carte blanche’ regarding the software and services they run within their virtual machine, and need not provide the operator with any OS-level access at all (e.g. a root login to quiesce processes or I/O prior to migration). Similarly the operator need not be concerned with the details of what is occurring within the virtual ma- chine; instead they can simply migrate the entire operating system and its attendant processes as a single unit.
Overall, live OS migration is a extremelely powerful tool for cluster administrators, allowing separation of hardware and software considerations, and consolidating clustered hardware into a single coherent management domain. If a physical machine needs to be removed from service an administrator may migrate OS instances including the ap- plications that they are running to alternative machine(s), freeing the original machine for maintenance. Similarly, OS instances may be rearranged across machines in a clus- ter to relieve load on congested hosts. In these situations the combination of virtualization and migration significantly improves manageability.
NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 273
We have implemented high-performance migration sup- port for Xen [1], a freely available open source VMM for commodity hardware. Our design and implementation ad- dresses the issues and tradeoffs involved in live local-area migration. Firstly, as we are targeting the migration of ac- tive OSes hosting live services, it is critically important to minimize the downtime during which services are entirely unavailable. Secondly, we must consider the total migra- tion time, during which state on both machines is synchro- nized and which hence may affect reliability. Furthermore we must ensure that migration does not unnecessarily dis- rupt active services through resource contention (e.g., CPU, network bandwidth) with the migrating OS.
Our implementation addresses all of these concerns, allow- ing for example an OS running the SPECweb benchmark to migrate across two physical hosts with only 210ms un- availability, or an OS running a Quake 3 server to migrate with just 60ms downtime. Unlike application-level restart, we can maintain network connections and application state during this process, hence providing effectively seamless migration from a user’s point of view.
We achieve this by using a pre-copy approach in which pages of memory are iteratively copied from the source machine to the destination host, all without ever stopping the execution of the virtual machine being migrated. Page- level protection hardware is used to ensure a consistent snapshot is transferred, and a rate-adaptive algorithm is used to control the impact of migration traffic on running services. The final phase pauses the virtual machine, copies any remaining pages to the destination, and resumes exe- cution there. We eschew a ‘pull’ approach which faults in missing pages across the network since this adds a residual dependency of arbitrarily long duration, as well as provid- ing in general rather poor performance.
Our current implementation does not address migration across the wide area, nor does it include support for migrat- ing local block devices, since neither of these are required for our target problem space. However we discuss ways in which such support can be provided in Section 7.
2 Related Work
The Collective project [3] has previously explored VM mi- gration as a tool to provide mobility to users who work on different physical hosts at different times, citing as an ex- ample the transfer of an OS instance to a home computer while a user drives home from work. Their work aims to optimize for slow (e.g., ADSL) links and longer time spans, and so stops OS execution for the duration of the transfer, with a set of enhancements to reduce the transmitted image size. In contrast, our efforts are concerned with the migra- tion of live, in-service OS instances on fast neworks with only tens of milliseconds of downtime. Other projects that
have explored migration over longer time spans by stop- ping and then transferring include Internet Suspend/Re- sume [4] and µDenali [5].
Zap [6] uses partial OS virtualization to allow the migration of process domains (pods), essentially process groups, us- ing a modified Linux kernel. Their approach is to isolate all process-to-kernel interfaces, such as file handles and sock- ets, into a contained namespace that can be migrated. Their approach is considerably faster than results in the Collec- tive work, largely due to the smaller units of migration. However, migration in their system is still on the order of seconds at best, and does not allow live migration; pods are entirely suspended, copied, and then resumed. Further- more, they do not address the problem of maintaining open connections for existing services.
The live migration system presented here has considerable shared heritage with the previous work on NomadBIOS [7], a virtualization and migration system built on top of the L4 microkernel [8]. NomadBIOS uses pre-copy migration to achieve very short best-case migration downtimes, but makes no attempt at adapting to the writable working set behavior of the migrating OS.
VMware has recently added OS migration support, dubbed VMotion, to their VirtualCenter management software. As this is commercial software and strictly disallows the publi- cation of third-party benchmarks, we are only able to infer its behavior through VMware’s own publications. These limitations make a thorough technical comparison impos- sible. However, based on the VirtualCenter User’s Man- ual [9], we believe their approach is generally similar to ours and would expect it to perform to a similar standard.
Process migration, a hot topic in systems research during the 1980s [10, 11, 12, 13, 14], has seen very little use for real-world applications. Milojicic et al [2] give a thorough survey of possible reasons for this, including the problem of the residual dependencies that a migrated process re- tains on the machine from which it migrated. Examples of residual dependencies include open file descriptors, shared memory segments, and other local resources. These are un- desirable because the original machine must remain avail- able, and because they usually negatively impact the per- formance of migrated processes.
For example Sprite [15] processes executing on foreign nodes require some system calls to be forwarded to the home node for execution, leading to at best reduced perfor- mance and at worst widespread failure if the home node is unavailable. Although various efforts were made to ame- liorate performance issues, the underlying reliance on the availability of the home node could not be avoided. A sim- ilar fragility occurs with MOSIX [14] where a deputy pro- cess on the home node must remain available to support remote execution.
NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association274
We believe the residual dependency problem cannot easily be solved in any process migration scheme – even modern mobile run-times such as Java and .NET suffer from prob- lems when network partition or machine crash causes class loaders to fail. The migration of entire operating systems inherently involves fewer or zero such dependencies, mak- ing it more resilient and robust.
3 Design
At a high level we can consider a virtual machine to encap- sulate access to a set of physical resources. Providing live migration of these VMs in a clustered server environment leads us to focus on the physical resources used in such environments: specifically on memory, network and disk.
This section summarizes the design decisions that we have made in our approach to live VM migration. We start by describing how memory and then device access is moved across a set of physical hosts and then go on to a high-level description of how a migration progresses.
3.1 Migrating Memory
Moving the contents of a VM’s memory from one phys- ical host to another can be approached in any number of ways. However, when a VM is running a live service it is important that this transfer occurs in a manner that bal- ances the requirements of minimizing both downtime and total migration time. The former is the period during which the service is unavailable due to there being no currently executing instance of the VM; this period will be directly visible to clients of the VM as service interruption. The latter is the duration between when migration is initiated and when the original VM may be finally discarded and, hence, the source host may potentially be taken down for maintenance, upgrade or repair.
It is easiest to consider the trade-offs between these require- ments by generalizing memory transfer into three phases:
Push phase The source VM continues running while cer- tain pages are pushed across the network to the new destination. To ensure consistency, pages modified during this process must be re-sent.
Stop-and-copy phase The source VM is stopped, pages are copied across to the destination VM, then the new VM is started.
Pull phase The new VM executes and, if it accesses a page that has not yet been copied, this page is faulted in (“pulled”) across the network from the source VM.
Although one can imagine a scheme incorporating all three phases, most practical solutions select one or two of the
three. For example, pure stop-and-copy [3, 4, 5] involves halting the original VM, copying all pages to the destina- tion, and then starting the new VM. This has advantages in terms of simplicity but means that both downtime and total migration time are proportional to the amount of physical memory allocated to the VM. This can lead to an unaccept- able outage if the VM is running a live service.
Another option is pure demand-migration [16] in which a short stop-and-copy phase transfers essential kernel data structures to the destination. The destination VM is then started, and other pages are transferred across the network on first use. This results in a much shorter downtime, but produces a much longer total migration time; and in prac- tice, performance after migration is likely to be unaccept- ably degraded until a considerable set of pages have been faulted across. Until this time the VM will fault on a high proportion of its memory accesses, each of which initiates a synchronous transfer across the network.
The approach taken in this paper, pre-copy [11] migration, balances these concerns by combining a bounded itera- tive push phase with a typically very short stop-and-copy phase. By ‘iterative’ we mean that pre-copying occurs in rounds, in which the pages to be transferred during round n are those that are modified during round n− 1 (all pages are transferred in the first round). Every VM will have some (hopefully small) set of pages that it updates very frequently and which are therefore poor candidates for pre- copy migration. Hence we bound the number of rounds of pre-copying, based on our analysis of the writable working set (WWS) behavior of typical server workloads, which we present in Section 4.
Finally, a crucial additional concern for live migration is the impact on active services. For instance, iteratively scanning and sending a VM’s memory image between two hosts in a cluster could easily consume the entire bandwidth avail- able between them and hence starve the active services of resources. This service degradation will occur to some ex- tent during any live migration scheme. We address this is- sue by carefully controlling the network and CPU resources used by the migration process, thereby ensuring that it does not interfere excessively with active traffic or processing.
3.2 Local Resources
A key challenge in managing the migration of OS instances is what to do about resources that are associated with the physical machine that they are migrating away from. While memory can be copied directly to the new host, connec- tions to local devices such as disks and network interfaces demand additional consideration. The two key problems that we have encountered in this space concern what to do with network resources and local storage.
NSDI ’05: 2nd Symposium on Networked Systems Design & ImplementationUSENIX Association 275
For network resources, we want a migrated OS to maintain all open network connections without relying on forward- ing mechanisms on the original host (which may be shut down following migration), or on support from mobility or redirection mechanisms that are not already present (as in [6]). A migrating VM will include all protocol state (e.g. TCP PCBs), and will carry its IP address with it.
To address these requirements we observed that in a clus- ter environment, the network interfaces of the source and destination machines typically exist on a single switched LAN. Our solution for managing migration with respect to network in this environment is to generate an unsolicited ARP reply from the migrated host, advertising that the IP has moved to a new location. This will reconfigure peers to send packets to the new physical address, and while a very small number of in-flight packets may be lost, the mi- grated domain will be able to continue using open connec- tions with almost no observable interference.
Some routers are configured not to accept broadcast ARP replies (in order to prevent IP spoofing), so an unsolicited ARP may not work in all scenarios. If the operating system is aware of the migration, it can opt to send directed replies only to interfaces listed in its own ARP cache, to remove the need for a broadcast. Alternatively, on a switched net- work, the migrating OS can keep its original Ethernet MAC address, relying on the network switch to detect its move to a new port1.
In the cluster, the migration of storage may be similarly ad- dressed: Most modern data centers consolidate their stor- age requirements using a network-attached storage (NAS) device, in preference to using local disks in individual servers. NAS has many advantages in this environment, in- cluding simple centralised administration, widespread ven- dor s
SOLUTION
The problem addressed by the authors is the live migration of operating system instances across distinct physical hosts in data center and cluster environments. The authors argue that live migration of virtual machines is an important tool for administrators because it allows a clean separation between hardware and software, facilitates fault management, load balancing, and low-level system maintenance.
The approach designed by the authors is to carry out the majority of migration while OS instances continue to run, achieving impressive performance with minimal service downtimes. They demonstrate the migration of entire OS instances on a commodity cluster and record service downtimes as low as 60ms. They also introduce and analyze the concept of writable working set and present the design, implementation, and evaluation of high-performance OS migration built on top of the Xen VMM.
The strengths of this paper include its clear and concise writing style, thorough analysis and evaluation of the proposed scheme, and its practical implications for data center and cluster computing communities. The weaknesses include a relatively narrow focus on the Xen VMM and the lack of comparison with other virtualization platforms.
The authors evaluate the performance of the proposed scheme by conducting experiments using a variety of workloads, including microbenchmarks, web serving, and database serving. They measure the service downtime and total migration time, as well as the performance impact on running applications during migration. The authors also compare their scheme with other live migration schemes and demonstrate its superiority in terms of service downtime and migration time.
In conclusion, the authors present a well-designed and thoroughly evaluated scheme for live migration of virtual machines in data center and cluster environments. Their approach achieves impressive performance with minimal service downtimes, making live migration a practical tool even for servers running interactive loads. The paper has important practical implications for the management and maintenance of large-scale computing environments.
Place an order in 3 easy steps. Takes less than 5 mins.