-
oa A Data Locality and Skew Aware Task Scheduler for MapReduce in Cloud Computing
- الناشر: Hamad bin Khalifa University Press (HBKU Press)
- المصدر: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2011 Issue 1, نوفمبر ٢٠١١, المجلد 2011, CSP21
ملخص
Inspired by the success and the increasing prevalence of MapReduce, this work proposes a novel MapReduce task scheduler. MapReduce is by far one of the most successful realizations of large-scale, data-intensive, cloud computing platforms. As compared to traditional programming models, MapReduce automatically and efficiently parallelizes computation by running multiple Map and/or Reduce tasks over distributed data across multiple machines. Hadoop, an open source implementation of MapReduce, schedules Map tasks in the vicinity of their input splits seeking diminished network traffic. However, when Hadoop schedules Reduce tasks, it neither exploits data locality nor addresses data partitioning skew inherent in many MapReduce applications. Consequently, MapReduce experiences a performance penalty and network congestion as observed in our experimental results.
Recently there has been some work concerned with leveraging data locality in Reduce task scheduling. For instance, one study suggests a locality-aware mechanism that inspects Map inputs and predicts corresponding consuming reducers. The input splits are subsequently assigned to Map tasks near the future reducers. While such a scheme addresses the problem, it targets mainly public-resource grids and doesn't fully substantiate the accuracy of the suggested prediction process. In this study, we propose Locality-Aware Skew-Aware Reduce Task Scheduler (LASAR), a practical strategy for improving MapReduce performance in clouds. LASAR attempts to schedule each reducer at its center-of-gravity node. It controllably avoids scheduling skew, a situation where some nodes receive more reducers than others, and promotes effective pseudo-asynchronous Map and Reduce phases resulting in earlier completion of submitted jobs, diminished network traffic, and better cluster utilization.
vWe implemented LASAR in Hadoop-0.20.2 and conducted extensive experimentations to evaluate its potential. We found that it outperforms current Hadoop by 11%, and by up to 26% for the utilized benchmarks. We believe LASAR is applicable to several cloud computing environments and multiple essential applications, including but not limited to shared environments and scientific applications. In fact, a large body of work observed partitioning skew in many of critical scientific applications. LASAR paves the way for these applications, and others, to get effectively ported to various clouds.