Total votes: 0
Print: Print Article
Please login to rate or to leave a comment.
Published: 14 Apr 2010
This article is taken from the book Azure in Action. The authors provide the history and current state of data centers, followed by a discussion of how Azure is architected and how Microsoft runs the cloud operating system.
About the book
This is a sample chapter of the book
Azure in Action. It has been published with
permission of Manning.
Chris Hay and Brian H. Prince
Get 30% discount
DotNetSlacker readers can get 30% off the full print book or ebook at
www.manning.com using the promo code dns30 at checkout.
This article is taken from the book Azure in Action. The authors provide the history and current state of data centers, followed by a discussion of how Azure is architected and how Microsoft runs the cloud operating system.
Betting the house on the cloud
When Azure was first announced at the PDC in 2008, Microsoft wasn't a recognized player in the cloud industry. They were the underdog to the giants of Google and Amazon, who had been offering cloud services for years by that time. Building and deploying Azure was a big bet for Microsoft. It was a major change in culture, from where Microsoft had been, and where it needed to go in the future. Up until now, Microsoft had been a product company. They would design and build a product, burn it to CD, and sell it to customers. Over time the product would be enhanced, but the product was installed and operated in the client's environment. The trick was to build the right product at the right time, for the right market.
With the addition of Ray Ozzie to the Microsoft culture, there was a giant shift toward services. They weren't abandoning the selling of products, but they were expanding their expertise and portfolio to offer their products as services. Every product team at Microsoft was asked if what they were doing could be enhanced and extended with services. They wanted to do much more than just put Exchange in a data center and rent it to customers. This became a fundamental shift in how Microsoft developed code, how they shipped it, and how they marketed and sold to their customers.
This wasn't just some executive whim, thought up during some exclusive executive retreat at some resort we will never be able to afford to even drive by. It was based on the trends and patterns the leaders saw in the market, in the needs of their customers, and the continuing impact of the Internet on our world.
They saw that people needed more flexible use of their resources, more flexible than even the advances in virtualization were providing. Companies needed to easily respond to their product's sudden popularity as social networking spread the word. Modern businesses were screaming that they can't wait six months for an upgrade to their infrastructure. They needed it now.
Customers were also becoming more sensitive to the massive power consumption and heat generated by their data centers. Power and cooling bills were often the largest component of their total data center cost. Coupling this with a concern over global warming, customers were starting to talk about the greening of IT. They wanted to reduce the carbon footprint that these beasts produced. Reduce not only the power and cooling but also the waste of lead, packing materials, and the massive piles of soda cans from the server administrators.
The data centers of yore
Microsoft is continually improving all of the important aspects of their data centers. They closely manage all of the costs of a data center, including power, cooling, staff, local laws, risks of disaster, availability of natural resources, and many other factors. While doing this, they have now designed their fourth generation of data centers. They didn't just show up at this party; they built on their deep expertise in building and running global data centers over the past few decades.
The first generation of data centers is still the most common in the world. Think of the special room with servers in it. It has racks, cable ladders, raised floors, cooling, UPSs, maybe a backup generator, and it is cooled to the temperature that could safely house raw beef. The focus is placed on making sure the servers are running, and absolutely no thought or concern for the operating costs of the data center. These data centers were optimized for the capital cost of building it, rarely thinking beyond the day the center opens.
Just as a side note, the collection of servers under your desk does not qualify as a generation 1 data center. Please be careful not to kick a cord loose while you do your work.
Generation 2 data centers take all of the knowledge learned through running generation 1 data centers, and apply a healthy dose of thinking about what happens on the second day of operation. Tactics are used to reduce the ongoing operational costs of the center through optimizing for sustainability and energy efficiency.
To meet these goals, Microsoft powers their Quincy, Washington data center with clean hydroelectric power. Their data center in San Antonio, Texas uses recycled civic gray water to cool the data center, reducing the stress on the water sources and infrastructure in the area.
The latest data centers of Azure
Even with the advances found in generation 2 data centers, a company cannot find the efficiencies and scale needed to combat rising facility costs, let alone meet the demands that the cloud would generate. The density of the data center would need to go up dramatically, and the costs of operations would have to plummet. The first generation 3 data center, located in Chicago, Illinois, went online on June 20th, 2009. It is considered a mega data center by Microsoft, which is a class designation that defines how large the data center is. The Chicago data center looks like a large parking structure, with parking spaces and ramps for tractor trailers. Servers are placed into containers, called Cblox, and parked in the structure. There is an additional subtopic that looks more like a traditional data center. This area is for high maintenance workloads that can't run in Azure.
Cblox are made out of the shipping containers that you see on ocean going vessels and on eighteen wheelers on the highways. They are built very sturdily and follow a standard size and shape that are easy to move around. A Cblox can hold anywhere from 1,800 to 2,500 servers each. This is a massive increase in data center density, 10 times more than a traditional data center. The Chicago mega data center holds about 360,000 servers and is the only primary consumer from a dedicated nuclear power plant core from Chicago Power & Light. How many of your data centers are nuclear powered?
Each parking spot in the data center is rooted by a refrigerator sized device that acts as the primary interconnect to the rest of the data center. Microsoft defined a standard coupler that provides power, cooling, and network access to the container. Using this interconnect and the super dense containers, they can add massive amounts of capacity in a matter of hours. Just compare how long it would take your company to plan, order, deploy, and conFigure 2,500 servers. It would take at least a year, and a lot of people, not to mention how long it would take just to recycle all of the cardboard and extra parts you always seem to have after racking a server. Their goal with this strategy is to make it as cheap and easy as possible to expand capacity as demand increases.
The containers are built to Microsoft's specifications by a vendor, and delivered on site, ready for burn in tests, and allocation into the Fabric. The container includes networking gear, cooling infrastructure, servers, racks, and are sealed against the weather.
Not only are the servers now packaged and deployed in containers, but the needed generators and cooling machinery are designed to be modular as well. In order to set up an edge data center, one that is located close to a large demand population, all they need is the power and network connections, and a level paved surface. The trucks with the power and cooling equipment first show up, and are deployed. Then the trucks with the computing containers back in and drop their trailers, leaving the containers on the wheels that were used to deliver them. The facility is protected by a secure wall and doorway with monitoring equipment. The use of laser fences is pure speculation and just a rumor, as far as we know. The perimeter security is important, because the edge data center doesn't have a roof! Yes, no roof! This reduces the construction time and the cooling costs. A roof isn't needed because the containers are completely sealed.
Microsoft opened a second mega data center, the first outside the United States, in Dublin, Ireland on July 1st, 2009. When Azure became commercially available in January, 2010 there were six known Azure data centers including Washington, Texas, Chicago, Ireland, Belgium, Singapore, and Hong Kong. While Microsoft will not list where all of their data centers are for security reasons, they say they have more than 10 and fewer than 100 data centers. Microsoft already has data centers all over the world to support their existing services, such as Virtual Earth, Bing Search, XBox Live, and others. If we just assume there are only 10, and each one is as big as Chicago, then Microsoft will need to manage 3.5 million servers as part of Azure. That is a lot of work.
How many server huggers do you need?
Data centers are staffed with IT Pros to care and feed the servers. Data centers need a lot of attention, ranging from hardware maintenance, to backup, disaster recovery, and monitoring. Think of your company. How many people are allocated to manage your servers? Depending on how optimized your IT center is, the ratio of person to server can be anywhere from 1:10 to 1:100 in the typical company. With that ratio, Microsoft would need 35,000 server managers. Hiring that many server administrators would be hard, considering Microsoft employs roughly 95,000 people already.
To address this demand, they designed Azure for as much automation as possible with a strategy called Lights out operations. This strategy seeks to centralize and automate as much of the work as possible, by reducing complexity and variability. This results in a person to server ratio closer to 1:30,000 or higher.
Microsoft is achieving this level of automation with mostly off the shelf software, their own. They are literally eating their own dog food. They are using System Center Operations Manager and all of the related products to oversee and automate the management of the underlying machines. They have built out custom automation scripts and profiles, much like any customer could do.
One key strategy in effectively managing a massive amount of servers is to provision them with identical hardware. In traditional data centers where we have worked, each year brought the latest and greatest of server technology, resulting in a wide variety of technology and hardware diversity. We even gave each server a distinct name, such as Protoss, Patty, and Zelda. With this many servers, you can't name your servers, you have to number them. Not just by server, but by rack, room, and facility. Diversity is usually a great thing, but not when you are managing millions of boxes.
The hardware in each Azure server is optimized for power, cost, density, and management. The optimization process drives out which exact motherboard, chipset, and every other component needs to be in the server. This is truly bang for your buck in action. Then they stick to that server recipe for a specific lifecycle, only moving to a new bill of materials when there are significant advantages to doing so.
Data center: the next generation
Microsoft isn't done. They have already spent years planning the fourth generation of data centers. Much like the edge data center mentioned above, the whole data center is outside. While the containers make it easy to scale out the computing resources as demand increases, prior generations of data centers still had to have the complete data center shell built out and provisioned. This meant provisioning the cooling and power systems as if the data center was at maximum capacity from day 1. The older systems were too expensive to expand dynamically. The fourth generation data centers are using an extendable spine of infrastructure that the computing containers need, so that both the infrastructure and the computing resources are easily scaled out (see figure 1). All of this is outside, in a field of grass, without a roof. They will be the only data centers in the world that needs a grounds crew.
Figure 1: Generation 4 data centers are built on extensible spines. This makes it easy to not only add computational capacity, but the required infrastructure as well, including power and cooling.
Ok, we're impressed. They have a lot of servers, and some of them are even outside, and they have found a way to manage them all in an effective way. But how does the cloud really work?
Windows Azure, an operating system for the cloud
Think of the computer on your desk today. When you write code for that computer, you don't have to worry about which sound card it uses, which type of printer it is connected to, or which or how many monitors are used for the display. You don't worry, to a degree, about the CPU, about memory, or even how storage is provided (SSD, carrier pigeon, or hard disk drive). The operating system on that computer provides a layer of abstraction away from all of those gritty details, frees you up to focus on the application you need to write, and makes it easy to consume the resources you need. The desktop operating system protects you from the details of the hardware, allocates time on the CPU to the code that is running, makes sure that code is allowed to run, plays traffic cop by controlling shared access to resources, and generally holds everything together. Now think of that enterprise application you want to deploy. You need DNS, networking, shared storage, load balancers, plenty of servers to handle load, a way to control access and permissions in the system, and plenty of other moving parts. Modern systems can get complicated. Dealing with all of that complexity by hand is like compiling your own video driver; it doesn't actually provide any value to the business.
Windows Azure, the cloud operating system, does the same work as the desktop operating system but on a grander scale and for distributed applications (see figure 2). In this subtopic, we are going to drill a little deeper into this Fabric, and see how it really works. The Fabric consists of the thousands of servers running, working together as a cohesive unit.
Figure 2: The Fabric Controller is like the kernel of your desktop operating system. It is responsible for many of the same tasks including resource sharing, code security, and management.
Windows Azure takes care of the whole platform so you can focus on your application. The term Fabric is used because of the similarity to a woven blanket. Each thread on its own is weak and can't do a lot. When they are woven together into a fabric, the whole blanket becomes strong and warm. Understanding the relationships between your code, Azure, and the Fabric Controller will help you get the most out of the platform.
Managing the assets of the cloud
In Azure, you will not need to worry about which hardware, which node, what underlying operating system, or even how the nodes are load balanced or clustered. Those are just gritty details best left to someone else. You just need to worry about your application and if it is operating effectively. How much time do you spend wrangling with these details for your on-premises projects? I bet it is at least 10 to 20% of the total project cost just in meetings alone. There are savings to be gained by abstracting away these issues.
In fact, Azure manages much more than just servers. There are plenty of other assets that are managed. In addition to servers, Azure manages routers, switches, IP addresses, DNS servers, load balancers, and dynamic VLANs. There is a lot of complexity in managing these assets in a static data center. It is doubly more complex when you are managing multiple data centers that need to operate as one cohesive pool of resources, in a dynamic and real-time way.
If the Fabric is the operating system, then the Fabric Controller is the kernel.
The Fabric Controller
Operating systems have at their core a kernel. This kernel is responsible for being the traffic cop in the system. It manages the sharing of resources, schedules the use of precious assets (CPU time), allocates work streams as appropriate, and keeps an eye on security as well. The Fabric has a kernel called the Fabric Controller (FC). It handles all of the jobs a normal operating system's kernel would handle. Figure 3 shows the relationship between Azure, the Fabric, and the Fabric Controller
Figure 3: The relationship between Azure, the Fabric, and the Fabric Controller. The Fabric is an abstract model of the massive number of servers in the Azure data center. The Fabric Controller manages everything.
The Fabric Controller (FC) handles all of the jobs a normal operating system's kernel would handle. It manages the running servers, deploys code, and makes sure that everyone is happy and has a seat at the table.
The Fabric Controller is just an application
The FC is an Azure application in and of itself, running multiple copies of itself for redundancy sake. It is largely written in managed code. The FC contains the complete state of the fabric internally, and this is in real time replicated to all of the nodes that are part of the FC. In case one of the primary nodes goes offline, the latest state information is available to the remaining nodes, which then elect a new primary node.
The FC manages a state machine for each service deployed, setting a goal state based on what the service model for the service requires. Everything the FC does is in an effort to reach this state and then to maintain that state once it is reached. We will go into detail of what the service model is in the next few pages, but for now, just think of it as a model that defines the needs and expectations that your service has.
How the Fabric Controller works: driver model
The FC follows a driver model, just like a normal operating system. Windows has no idea how to specifically work with your video card. It does know how to speak to a video driver, which in turn knows how to work with a specific video card. The FC works with a series of drivers, for each type of asset in the Fabric. This includes the machines, as well as the routers, switches, and load balancers.
While the variability of the environment is very low today, over time new types of each asset are likely to be introduced. While the goal is to reduce unnecessary diversity, there will be business needs to offer breadth in the platform. Perhaps you might get a software load balancer for free, but you will have to pay a little bit more per month to use a hardware load balancer. A customer would do this to meet a specific need, and the FC would have different drivers for each type of load balancer in the Fabric.
Figure 4: How the life cycle of an Azure service progresses to a running state. Each role on your team has a different set of responsibilities.
The FC uses these drivers to communicate the commands it needs to send to each device to reach the desired running state. This might be the command to create a new VLAN to a switch or allocate a pool of virtual IP addresses. These commands help the FC move the state of the service towards the goal state. Figure 4 shows the progression of state, from the developer writing the code, and defining the service model, to the FC allocating and managing the resources the service requires.
One of the key jobs of the FC is to allocate resources to services. It does this by analyzing the service model of the service, including the fault and update domains, and the availability of resources in the Fabric. Using a greedy resource allocation algorithm, it finds which nodes can support the needs of each instance in the model. Once it has reserved the capacity, it updates the FC data structures in one transaction. Once this is done, the goal state of each node has been changed, and the FC starts moving each node towards its goal state by deploying the proper images and bits, starting up services, and issuing other commands through the driver model to all of the resources needed for the change.
The FC is also responsible for managing the health of all of the nodes in the Fabric as well as the health of the services running. If it detects a fault in a service, it will try to remediate that fault, perhaps by restarting the node or taking it offline and replacing it with a different node in the Fabric.
When a new container is added to the data center, the FC performs a series of burn-in tests to ensure that the hardware delivered is working properly. Part of this process results in the new resources being added into the inventory for the data center, making it available to be allocated by the FC.
If hardware is ever to be determined to be faulty, either during installation or during a fault, the hardware is flagged as unusable in the inventory and left alone until later. Once a container has enough failures, the remaining workloads are moved to different containers and then the whole container is taken offline for repair. Once the problems have been fixed, the whole container is retested and returned into service.
The service model and you
The driving force behind what the FC does is the service model that you define for your service (see figure 5). You define the service model in an indirect manner. When you are developing a service, you define the following:
- Some configuration on what the pieces to your service are
- How the pieces communicate
- Expectations you have about the availability of the service
Figure 5: The Service Model consists of several different pieces of information. This model helps Azure run your application correctly.
The service model is broken into two pieces of configuration and deployed with your service. Each piece focuses on a different aspect of the model.
Your solution in Visual Studio will contain these two pieces of configuration in different files, both found in the Azure Service project in your solution:
- Service definition file (
- Service configuration file (
The service definition file defines what the roles and their communication endpoints are in your service. This would include public HTTP traffic for a website, or the endpoint details for a web service. You can also conFigure your service to use local storage (which is different from Azure storage) and any custom configuration elements of the service configuration file. The service definition cannot be changed at runtime. Any change would require a new deployment of your service. Your service is restricted in using only the network endpoints and resources that are defined in this model. You can think of this piece of the configuration as defining what the infrastructure of your service is, and how the parts fit together.
The service configuration file includes the entire configuration needed for the role instances in your service. Each role has its own dedicated part of the configuration. The contents of the configuration file can be changed at runtime, which removes the need to redeploy your application when some part of the role configuration changes. You can also access the configuration in code, in a similar manner that you might read a web.config file in an ASP.NET application.
Adding a custom configuration element
In many applications we store connections strings, default settings, and secret passwords (please don't!) in the app.config or web.config. You will often do the same with an Azure application. We first need to declare the format of the new configuration setting in the
.csdef file. We do this by adding a
ConfigurationSettings node inside the role we want the configuration to belong to. We do this to define the schema of the
.cscfg file for that role. This essentially strongly types the configuration file itself. If there is an error in the configuration file during a build you will receive a compiler warning. This is a great feature because there is nothing worse than deploying code when there is a simple little problem in a configuration file.
Now that we have told Azure the new format of our configuration files, namely that we want a new setting called
WelcomeBannerText, we can add that node to the service configuration file. We add the following XML into the appropriate role node in the
During runtime, we want to read in this configuration data and use it for some purpose. Remember that all configuration settings are stored as strings and must be casted to the appropriate type as needed. In this case, we want a string to assign to our text label control, so we can use it as is.
Having lines of code like this all over your application can get messy and hard to manage. Many times developers will consolidate their configuration access code into one class. This class's only job is to be a façade into the configuration system.
Centralizing file-reading code
It is a best practice to take your entire configuration file reading code wherever it is sprinkled and move it into a
ConfigurationManager class of your own design. Many people use the term service instead of manager, but I think the term service is too overloaded and manager is just as clear. This centralizes all of the code that knows how to read the configuration in one place, making it easier to maintain. More importantly, it removes the complexity of reading the configuration from the relying code, which is the principle of separation of concerns. This also makes it easier to mock out the implementation of the
ConfigurationManager class for easier testing purposes (see figure 6). Over time, when the APIs for accessing configuration change or if when your configuration lives changes, you will have only one place to go to make the changes you need.
Figure 6: A well-designed ConfigurationManager class can centralize the busy work of working with the configuration system.
Reading configuration data in this manner may look very familiar to you. You have probably done this for your current applications, reading in the settings stored in a
web.config or an
app.config file. When migrating an existing application to Azure, you might be tempted to keep the configuration settings where they are. While keeping them in place reduces the amount of change to your code as you migrate it to Azure, it does come at a cost. Unfortunately, the configuration files that are part of your roles are frozen and are read only at run time. You cannot make changes to them once your package is deployed. If you want to change settings at runtime, you will need to store those settings in the
.cscfg file. Then, when you want to make a change, you only have to upload a new
.cscfg file or click conFigure on the service management page in the portal.
The FC takes these configuration files and builds a sophisticated service model that it uses to manage your service. At this time, there are about three different core model templates that all other service models inherit from. Over time, Azure will expose more of the service model to the developer, so that they can have a more fine grained control over the platform their service is running on.
The many sizes of roles
Each role defined in your service model is basically a template for a server you want deployed in the Fabric. Each role can have a different job, and a different configuration. Part of that configuration includes local storage and the number of instances of that role that should be deployed. How these roles connect and work together is part of why the service model exists.
Since each role might have different needs, there are a variety of virtual machine sizes that you can request in your model. Table 1 lists each VM size. Each step up in size basically doubles the resources of the step below it.
Table 1: The available sizes of the Azure virtual machines
Dedicated CPU Cores
Local Disk Space
Each size is basically a slice of how big a physical server is. This makes it easy to allocate resources and keeps the numbers round. Since each physical server has eight CPU cores, allocating an Extra Large VM to a role is like dedicating a whole physical machine to that instance. This gives you all the CPU, RAM, and disk available on that machine. Which size you want is defined in the ServiceDefinition.csdef file on a role by role basis. The default size, if you don't declare one, is small.
If you are using Visual Studio 2010, you can define the role configuration in the properties screen of the role in the Azure project, as shown in Figure 7.
Figure 7: Configuring your role doesn’t have to be a gruesome XML affair. You can easily do it in Visual Studio 2010 when you view the properties sheet for the role you want to configure.
The service model is also used to define fault domains and update domains, which we'll look at next.
It's not my fault
Fault domains and update domains determine what portions of your service can be offline at a single time, but for different reasons. They are the way that you define your uptime requirements to the Fabric Controller and how you describe how your service updates will happen when you have new code to deploy.
Fault domains are used to make sure that a set of elements in your service is not tied to a single point of failure. Fault domains are based more on the physical structure of the data center than it is on your architecture. Your service would typically want to have three or more fault domains. If you had one fault domain, all of the parts of your service could possibly be running on one rack in the same container connected to the same switch. This would lead to a high likelihood of catastrophic failure for your service if there is any failure in that chain. If that rack fails, or the switch in use fails, then your service is completely offline. By breaking your service into several fault domains, the FC will make sure those fault domains do not share any dependent infrastructure to protect your service against single points of failure. In general, you will define three fault domains, meaning only about a third can become unavailable because of a single fault. In a failure scenario, the Fabric Controller will immediately try to deploy your roles to new nodes in the Fabric to make up for the failed nodes.
At this time, the Azure SDK and service model does not let you define your own number of fault domains and is believed to default to three domains.
The second type of domain defined in the service model is the update domain. The concept of an update domain is similar to a fault domain. An update domain is the unit of update you have declared for your service. An update is rolled out across your service, one update domain at a time. Cloud services tend to be big and tend to always need to be available. The update domain allows a rolling update to be used to upgrade your service, without having to bring the entire service down. These domains are usually defined to be orthogonal to your fault domains. In
this manner, if an update is being pushed out while there is a massive fault, you won't lose all of your resources, just a piece of them.
You can define the number of update domains for your service in your
servicedefinition.cfg file as part of the
serviceDefinition tag at the top of the file.
If you do not define your own update domain setting the service model will default to five update domains. Your role instances are assigned to update domains as they are started up, and the Fabric Controller tries to keep the domains balanced with regards to how many instances are in each domain.
A service model example
If we had a service running on Azure, we might need six role instances to handle the demand on our service but request nine instances instead. We do this because we want a high degree of tolerance in our architecture. As shown in figure 8, we would have three fault domains and three update domains defined. This way, if there is a fault, only a third of our nodes are affected. Also, only a third of the nodes will ever be updated at one time, controlling the amount of nodes taken out of service for updates as well as reducing the risk of any update taking down the whole service.
Figure 8: Fault and update domains help increase fault tolerance in your cloud service. In this figure, we have three instances of each of three roles.
In this scenario, the broken switch might take down the first fault domain but the other two fault domains would not be affected and be able to keep operating. The FC can manage these fault domains because of the detailed models it has for the Azure data center assets.
The cloud is not about perfect computing, it is about deploying services and managing systems that are fault tolerant. You need to plan for the faults that will happen. The magic of cloud computing makes it easy to scale big enough that a few node failures doesn't really impact your service.
All of this talk about service models and an overlord FC is nice, but at the end of the day, the cloud is built from individual pieces of hardware. And there is a lot of hardware, and it all needs to be managed in a hands-off way. There are several approaches to apply updates to a service that is running. You can perform both manual and automated rolling upgrades, or you can perform a full static upgrade (also called a VIP swap).
Rolling out new code
No matter how great your code is you will have to perform an upgrade at some point in time if for no other reason than to deploy a new feature a user has requested. It is important that you have a plan for updating the application and have a full understanding of the moving parts. There are different approaches for different upgrade scenarios.
There are two broad approaches to rolling out an upgrade, either a static upgrade or a rolling upgrade. You should carefully plan your application architecture to avoid the necessity of a static upgrade because it impacts the uptime of your service, and can be more complicated to roll out, where a rolling upgrade keeps your service up and running the whole time. You should always consider performing the upgrade in the staging environment first to make sure the deployment goes well. Once a full battery of end to end and integration tests are passed, you can proceed with your plans for the production environment.
Another gotcha is if the number of endpoints for a role has changed you will not be able to do either type of rolling upgrade, you will be forced to use the VIP swap upgrade strategy.
A static upgrade is sometimes referred to as a forklift upgrade because you are touching everything all at once. You usually need to do a static upgrade when there is a significant change in the architecture and plumbing of your application. Perhaps there is a whole new architecture of how the services are structured and the database has been completely redesigned. In these cases, it can be hard to upgrade just one piece at a time because of interdependencies in the system. This type of upgrade is required if you are changing the service model in any way.
This approach is also called a VIP swap because the Fabric Controller is just swapping the virtual IP addresses that are assigned to your resources. When a swap is done, the old staging environment becomes your new production environment and your old production environment becomes your new staging environment (see figure 9). This can happen pretty fast, but your service will have downtime while it is happening, so you should plan on that. The one great advantage to this approach is that you can easily swap things back to the way they were if things aren't working out. Your upgrade plan should consider how long the new staging (aka old production) environment should stay around. You might want to keep it around for a few days until you know the upgrade has been successful. At that point, you can completely tear down the environment to save resources and money.
Figure 9: Performing a VIP swap upgrade is as easy as clicking the arrows. If things go horribly awry, you can always swap back to the way things were. It’s like rewind for your environment.
You can perform a VIP swap by logging into the Azure portal and going to the summary screen for your service. At this point, you need to deploy your new version to the staging environment. Once everything is all set up and you are happy with it, you can click the circular button in the middle. The changeover will only take a few minutes.
You can use the service management API to perform the swap operation as well. This is one case where you really want to make sure you have named your deployments clearly, at least more clearly than I did in this example.
VIP swaps are nice, but some customers need more flexibility in the way they perform their rollouts.
If your roles are carrying state and you don't want to lose that state as you completely change to a new set of servers, then rolling upgrades are for you. Another key scenario is when you only want to upgrade the instances of a specific role instead of all of the roles. For example, you might want to deploy an updated version of the website, without impacting the processing of the shopping carts that is being performed by the backend worker roles. You can, of course, choose to upgrade all of the roles or just a specific role. Remember that when doing a rolling upgrade you cannot change the service model of the service you are upgrading. This means if you have changed the structure of the service configuration, the number of endpoints, or the number of roles, you will have to do a VIP swap instead.
When you perform an automatic rolling upgrade, the Fabric Controller drains the traffic to the set of instances that are in the first update domain (they are numbered starting with zero) by removing them from the load balancer's configuration. Once the traffic is drained, the instances are stopped, the new code is deployed, and then they are restarted. Once they are back up and running, they are added back into the load balancer's list of machines to route traffic to. At this point, the Fabric Controller moves onto the next update domain in the list. It will proceed in this fashion until all of the update domains have been serviced. Each domain should only take a few minutes.
If your scenario requires that you control how the progression moves from one domain to the next, you can choose to do a manual rolling upgrade. When you choose this option, the Fabric Controller will stop after a domain and wait for your permission to move onto the next. This gives you a chance to check the status of the machines and the environment before moving forward with the rollout.
To perform a rolling upgrade, you will need to go to the service detail screen on the Azure portal and click the upgrade button for the deployment you want to upgrade. I did this on my production deployment in Figure 10.
Figure 10: Performing a rolling upgrade is easy. Choose Upgrade on the service detail page and choose your options. You can upgrade all of the roles or just one role in an upgrade.
At the service detail screen, you can choose the automatic roll or the manual roll for the upgrade, and if you are going to upgrade all the roles in the package or just one of them. As in a normal deployment, you will also need to provide a service package, configuration, and a deployment name.
If you choose a single role to upgrade, then only the instances for that role in each domain will be taken offline for upgrading. The other role instances will be left untouched.
You can perform a rolling upgrade by using the service management API as well. When using the management API, you will have to store the package in blob storage before starting the process. As with a VIP swap, you will need to POST a command to a specific URL. You will need to customize the URL to match the settings for the deployment you want to upgrade.
The body of the command will need to contain the following elements. It will need to be changed to supply the parameters that match your situation. This sample is to do a fully automatic upgrade on all the roles.
Performing a manual rolling update with the service management API is a little trickier, and requires several calls to the
WalkUpgradeDomain method. The upgrades are performed in an asynchronous nature, so the first command will start the process. As it is being performed you can check on the status by using
Get Operation Status with the operation id that was supplied to you when you started the operation.
We have covered how to upgrade running instances, and we have talked about what the Fabric is. Now we will go one level deeper and explore the underlying environment.
The bare metal
No one outside of the Azure team truly knows the nature of the underlying servers and other hardware, and that is ok, since it is all abstracted away by the cloud OS. But we can still look at how our instances are provisioned and how automation is used to do this without hiring the entire population of southern Maine to manage it.
Each instance is really a virtual machine running Windows Server 2008 Enterprise Edition x64 bit, on top of Hyper-V. Hyper-V is Microsoft's enterprise virtualization solution, and it is available to anyone. Hyper-V is based on a hypervisor, which manages and controls the virtual servers running on the physical server. One of the virtual servers is chosen to be the host OS. The host OS is a virtual server as well, but it has the additional responsibilities of managing the hypervisor and the underlying hardware.
Hyper-V has two features that help in maximizing the performance of the virtual servers, while reducing the overall cost of running those servers. Both features need to be supported by the physical CPU.
The first is core and socket parking. Hyper-V can monitor the utilization of each core and CPU as a whole (which is in a socket on the motherboard). Hyper-V will move the processes around on the cores to consolidate the work to as few cores as possible. Any cores not needed at that time will be put into a low energy state. They can come back online quickly if needed but will consume much less power while they wait. Hyper-V can do this at the socket level as well. If it notices that each CPU socket is only being used at 10%, for example, it can condense the workload to one socket, and park the unused sockets, placing them in a low energy state. This helps data centers use less power, and require less cooling. In Azure you have exclusive access to your assigned CPU code. Hyper-V will not condense your core with someone else's. It will still, however, turn off cores that are not in use.
A special blend of spices
The version of Hyper-V used by Azure is a special version that the team created through the removal of anything they didn't need. Their goal was to reduce the footprint of Hyper-V as much as possible to make it faster and easier to manage. Since they knew exactly the type of hardware and guest operating systems that will run on it, they could rip out a lot of code. For example, they removed support for 32bit hosts and guest machines, support needed for other types of operating systems, and support for hardware they weren't supporting at all.
Not stopping there, they further tuned the hypervisor scheduler for better performance while working with cloud data center workloads. They wanted the scheduler to be more predictable in its use of resources and fairer to the different workloads running, since each would be running at the same priority level. They also enhanced it in certain ways to support a heavier I/O load on the VM Bus.
Creating instances on the fly
When a new server is ready to be used, it is booted. At this point, it is a naked server with a bare hard drive. We can see this process in Figure 11.
Figure 11: The structure of a physical server and virtual instance servers in Azure.
During boot up, the server locates a maintenance OS on the network using standard PXE protocols. PXE stands for pre-execution environment, and it is a process for booting to an operating system image that can be found on the network. It then downloads the image and boots to it (#1). This maintenance OS is based on Windows PE and is a very thin OS is used by many IT organizations for low-level troubleshooting and maintenance. The tools and protocols for this are available on any Windows server and are used by a lot of companies to easily distribute machine images and automate deployment.
The maintenance OS connects with the FC and acts as an agent for the FC to execute any local commands needed to prepare the disk and machine. The agent prepares the local disk and streams down a Windows Server 2008 Server Core image to the local disk (#2). This image is a Virtual Hard Drive (VHD) and a common file format to store the contents of hard drives for virtual machines. VHDs are basically large files representing the complete or partial hard drive for a virtual machine. The machine is then reconFigured to boot from this Core VHD. This image becomes the host OS that will manage the machine and interact with the hypervisor. The host OS is Windows Server 2008 Core because it has almost all but the most necessary modules removed from the operating system. This is a version you may be running in your own data center.
The Azure team worked with the Windows Server team to develop the technology needed to boot a machine natively from a VHD that is stored on the local hard drive. The Windows 7 team liked the feature so much they added it to their product as well. Being able to boot from a VHD is a key component of the Azure automation.
Once the machine has rebooted using the host OS image, the maintenance OS is removed and the FC can start allocating resources from the machine to services that need to be deployed. A base OS image is selected from the prepared image library that will meet the needs of the service that is being deployed (#3). This image (a VHD file) is streamed down to the physical disk. The core OS VHDs are marked as read only, allowing multiple service instances to share a single image. A differencing VHD is stacked on top of the read-only base OS VHD to allow for changes specific to that virtual server (#4). It is possible that different services will have different base OS images, based on the service model applied to that service.
On top of the base OS image is an application VHD that contains additional requirements for your service (#5). It is attached to the base OS image, the bits for your service are downloaded to the application VHD (#6), and then the stack is booted. As it starts up, it reports its health to the FC. The FC then enrolls it into the service group, configuring the VLAN assigned to your service, updating the load balancer, IP allocation, and DNS configuration. Once this is all completed, the new node is ready to service requests to your application.
Much of the image deployment can be made before the node is needed, cutting down on the time it can take to start a new instance and add it to your service.
Each server can contain several virtual machines. This allows for the optimal use of computing resources and the flexibility to move instances around as needed. As a second or third virtual server is added, it may use the base OS VHD that has already been downloaded (#7) or it can download a different base OS VHD based on its needs. This second machine then follows the same process of downloading the application VHD, booting up, and enrolling into the cloud.
All of these steps are coordinated by the FC and usually accomplished in a few minutes.
Image is everything
If the key to the automation of Azure is Hyper-V, then the base virtual machine images and their management is the cornerstone. Images are centrally created, also in an automated fashion, and stored in a library, ready to be deployed by the FC as needed.
A variety of images are managed, allowing for the smallest footprint each role might need. If a role doesn't need IIS, then there is an image that doesn't have IIS installed. This is an effort to shrink the size and runtime footprint of the image, but also to reduce any possible attack surfaces or patching opportunities.
All images are deployed using an xcopy deployment model. This model keeps deployment really simple. If they relied on complex scripts and tools, then they would never truly know what the state of each server would be and it would take a lot longer to deploy an instance. Again, diversity is the devil in this environment.
This same approach is used when deploying patches. When the operating system needs to be patched, they don't download and execute the patch locally as you might on your workstation at home. This would lead to too much risk in having irregular results on some of the machines.
Instead, the patch is applied to an image and the new image is stored in the library. The FC then plans a rollout of the upgrade, taking into account the layout of the cloud and the update domains defined by the various service models that are being managed.
The updated image is copied in the background to all of the servers used by the service. Once the files have been staged out to the local disk, which can take some time, each update domain group is restarted in turn. In this way, the FC knows exactly what is on the server. The new image is merely wired up to the existing service bits that have already been copied locally. The old image is kept local for a period of time as an escape hatch in case something goes wrong with the new image. If that happens, the server is reconFigured to use the old image and rebooted, according to the update domains in the service model.
This process dramatically reduces the service window of the servers, increasing uptime and reducing the cost of maintenance on the cloud. We have covered how images are used to manage the environment. Now we are going to look at what we can see when we look inside a running role instance.
The innards of the Windows Azure web role virtual machine
Your first experience with roles in Azure is likely to be with the web role. To help us develop our web applications more effectively, it's worth looking in more detail at the virtual machine that our web role is hosted in. In this subtopic we will look at the following items:
- The details of the VM
- The hosting process of our web role (RDRoleHost)
- How the RoleEnvironment class communicates with Windows Azure
- Other processes running in Windows Azure
Exploring the virtual machine details
We can use the power of native code execution to see some of the juicy details about the virtual machine that our web role runs on. Figure 12 shows an ASP.Net web page that shows some of the internal details including the machine name, domain name, and the username the code is running under.
Figure 12: Using native code, we can see some of the machine details of a web role in Windows Azure. We can see that Microsoft is using Windows Server 2008 x64. Notice the username the process is running as is a GUID.
If we want, we can easily generate the above web page by creating a simple ASPX page with some labels that represent the text as shown below:
Finally, we can display the internal details of the Virtual Machine using the code-behind in Listing 1.
Listing 1: Code-behind for machine details
You can now, of course, deploy your web page to Windows Azure and see the inner details of your web role.
The machine details in figure 12 provide us with some interesting facts:
- Web roles run on Windows 2008 Enterprise Edition x64 edition
- They run quad core AMD processors and you have 1 core assigned
- The domain name of the web role is CIS
- My VM has been running for an hour
- The Windows directory lives on the D:\ drive
- The web application lives on the E:\ drive
This is just the beginning; feel free to experiment and discover whatever information you need to satisfy your curiosity about the internals of Windows Azure by using similar calls in Listing 1.
Now that we're having a good rummage around the virtual machine, it might be worth having a look at what processes are actually running on the virtual machine. To do that, we will build an ASP.NET web page that will return all the processes in a pretty, little grid as shown in figure 13.
Figure 13: The process list of a Windows Azure virtual machine. The two processes that start with RD are related to “Red Dog,” which was the code name for Azure while it was being developed.
To generate the page shown in figure 13, create a new web page in your web role with a GridView component called processGridView:
Next add the following code to the
page_load event of your codebehind:
The code above will list all the processes on a server and bind the returned list to a GridView on a web page, as displayed in figure 13. If we look at the process list displayed in figure 13, there are two Windows Azure-specific services that we are interested in:
We will now spend the next couple of subtopics looking at these processes in more detail.
The hosting process of our website (RDRoleHost)
If you were to look at the process list for a live web role as shown in Figure 13, or if you were to fire up your web application in Windows Azure and look at the process tab, you will notice that the typical IIS worker process (w3wp.exe) is not present when your web server is running.
You will also notice that, if you stop your IIS Server (as shown next), then your web server continues to run.
We know from prerequisites of installing the Windows Azure SDK that web roles are run under IIS7. So, why can't we see our roles in IIS, or restart the server using
Hostable web core
Although Windows Azure uses IIS7.0, it makes use of a new feature, called hostable web core, which allows you to host the IIS runtime in-process. In the case of Windows Azure, the RDRoleHost.exe process hosts the IIS7.0 runtime. If you were to look at process list on the live server or on the development Fabric, you will see that, as you interact with the web server, the utilization of this process changes.
Why is it run in-process rather than using plain old IIS?
The implementation of the web role is quite different from a normal web server. Rather than a system administrator managing the running of the web servers, our data center overlord, the Fabric Controller, performs that task. Therefore, the Fabric Controller needs the ability to interact and report on the web roles in a consistent manner. Rather than attempting to use the WMI routines of IIS, the Windows Azure team opted for a custom WCF approach.
This custom in-process approach also allows your application instances to interact with the RDRoleHost processing using a custom API via the RoleEnvironment class.
Other processes running in Windows Azure
Now that we have an idea of how IIS7.0 is used to host your web role, it's worth looking at some of the other processes hosted in Windows Azure.
The RDAgent Process is responsible for collecting the health status of the role and for collecting management information on the VM including:
- Server Time
- Machine Name
- Disk Capacity
- OS Version
- Performance Statistics (CPU Usage, Disk Usage)
The role instance and the RDAgent processes will again use named pipes to communicate with each other. If the instance needs to notify the Fabric Controller of its current state of health then, this will be communicated from the web role to the RDAgent process using the named pipe.
All of the information collected by the RDAgent process is ultimately made available to the Fabric Controller to allow it to determine how to best run the data center. The Fabric Controller will use the RDAgent process as a proxy between itself, the VM and the instance. If the Fabric Controller decides to shut down an instance, it will instruct the RDAgent process to perform this task.
Hopefully, you have learned a little bit of how Azure is architected and how Microsoft runs the cloud operating system. Microsoft has spent billions of dollars and millions of work hours in building these data centers and the operating system that runs them.
Windows Azure truly is an operating system for the cloud, abstracting away the details of the massive data centers, servers, networks, and other gear so you can simply focus on your application. The Fabric Controller controls what is happening in the cloud and acts as the kernel in the operating system.
With the power of the Fabric Controller and the massive data centers, you can define the structure of your system and dynamically scale it up or down as needed. The infrastructure makes it easy to do rolling upgrades across your infrastructure, leading to minimal downtime of your service.
The service model consists of the service definition and service configuration files and describes, to the Fabric Controller, how your application should be deployed and managed. This model is the magic behind the data center automation. Fault and update domains are used in the model to describe how the group of servers running your application should be separated to protect against failures and outages.
The amazing thing is that Azure was built primarily from off-the-shelf parts and software. The team didn't code this from the ground up but were able to tune the behavior of Hyper-V and of Windows to suit their needs. They leveraged their deep experience in running large data centers and worked with the Microsoft product teams to add the features they needed. These innovations, like the ability to boot natively from a VHD, will trickle into the mainstream as part of future versions of the products.
Get 30% discount
DotNetSlacker readers can get 30% off the full print book or ebook at
www.manning.com using the promo code dns30 at checkout.
Manning Publication publishes computer books for professionals--programmers, system administrators, designers, architects, managers and others. Our focus is on computing titles at professional levels. We care about the quality of our books. We work with our authors to coax out of them the best writi...
This author has published 33 articles on DotNetSlackers. View other articles or the complete profile here.
Please login to rate or to leave a comment.