Zhiguang Wang, Chul Gwon, Tim Oates, Adam Iezzi
Abstract. As the use of cloud computing rises, controlling cost becomes increasingly important. Yet there is evidence that 30% – 45% of cloud spend is wasted (Weins 2017). Existing tools for cloud provisioning typically rely on highly trained human experts to specify what to monitor, thresholds for triggering action, and actions. In this paper, we explore the use of reinforcement learning (RL) to acquire policies to balance performance and spend, allowing humans to specify what they want as opposed to how to do it, minimizing the need for cloud expertise. Empirical results with tabular, deep, and dueling double deep Q-learning with the CloudSim (Calheiros et al. 2011) simulator show the utility of RL and the relative merits of the approaches. We also demonstrate effective policy transfer learning from an extremely simple simulator to CloudSim, with the next step being transfer from CloudSimto an AmazonWeb Services physical environment.
Cloud computing has become an integral part of how businesses and other entities are run today, permeating our daily lives in ways that we take for granted. Streaming music and videos, e-commerce, and social networks primarily utilize resources based on the cloud. The ability to provision compute nodes, storage, and other IT resources using a pay-as-you-go model, rather than requiring significant up-front investment for infrastructure, has transformed the way that organizations operate. The potential drawback is that groups leveraging the cloud must also effectively provision their resources to optimize the tradeoffs between their costs and the performance required by their service level agreements (SLAs).
Whereas organizations typically hire cloud experts to determine the optimal strategies for provisioning their cloud resources, in-depth understanding of not only cloud management, but also the business domain are required to effectively perform this task. Sadly, the tools that users have to manage cloud spend are relatively meager. Consider Amazons Auto Scaling service. Users create an Auto Scaling Group (ASG), which is a collection of instances described in a Launch Configuration, and then choose from a rather limited set of options for managing that infrastructure. For example, one simple option is to specify the desired capacity for the ASG, which results in machines being replaced in the group when they fail a periodic health check….
Complete technical paper available as a PDF.