ALGORITHMS FOR DATA MIGRATION

Diploma

ABSTRACT

This thesis is concerned with the problem related to data storage and management. A large storage server consists of several hundreds of disks. To balance the load across disks, the system computes data layouts that are typically adjusted according to the workload. As workloads change over time, the system recomputes the data layout, and rearranges the data items according to the new layout. We identify the problem of computing an efficient data migration plan that converts an initial layout to a target layout.

We define the data migration problem as follows: for each item, there are a set of disks that have the item (sources) and a set of disks that want to receive the item (destinations). We want to migrate the data items from the sources to destinations. The crucial constraint is that each disk can participate in only one transfer at a time. The most common objective has been to minimize the makespan, which is the time when we finish all the migrations. The problem is NP-hard, and we develop polynomial time algorithms with constant factor approximation guarantees and several other heuristic algorithms. We present the performance evaluation of the different methods through an experimental study.

We also consider the data migration problem to minimize the sum of completion times over all migration jobs or storage devices. Minimizing the sum of completion times of jobs is one of the most common objectives in scheduling literature. On the other hand, since a storage device may run inefficiently while the device is involved in migrations, another interesting objective is to minimize the sum of completion times over all storage devices. We present hardness results and constant factor approximation algorithms for these objectives.

In addition, we consider the case when we have a heterogeneous collection of machines. We assume that heterogeneity is modeled by a non-uniform speed of the sending machine. For the basic problem of multicasting and broadcasting in the model, we show that Fastest Node First scheme gives a approximation ratio of 1.5 for minimizing the makespan. We also prove that there is a polynomial time approximation scheme.


Introduction

Data Migration Problem

A large storage server consists of several hundreds of disks, connected using a dedicated network, called a Storage Area Network. Those disks may be used to store multimedia files in video-on-demand servers, or search indexes in search engine clusters, for example. Disks typically have constraints on storage as well as the number of clients that can access data from a single disk simultaneously. For example, suppose that the system consists of disks which have 36 GB of storage capacity and can support a transfer rate of 20 MB/s (load capacity). When we consider MPEG-2 movies of 100 mins each with bandwidth requirement of 4 Mbps, the disk can store 12 movies and support 40 streams.

The ability to match resources with demand through layout affects the performance of a storage system. When we deal with bandwidth intensive applications, e.g., video-on-demand, the demand for popular items can be very high and therefore, the system needs to replicate the data items on several different disks.