How microsoft.com Finally Achieved 100% Web Site Availability
Todd Wanke is microsoft.com's research and development manager in
charge of the innovations that make the Web site more stable and
effective. Systems operation manager Todd Weeks, who keeps the Microsoft
site up and running, implements these solutions in the real world test bed
of Microsoft Web servers. Together, the duo is known within Microsoft as
"Todd Squared."
If you've ever struggled with the traditional "round robin" Domain Name
System (DNS), you know the tough decision you face with a troublesome Web
server. The cause isn't always obvious: someone published bad content, or
maybe a router is failing or a disc drive is going down. If you take the
server down to debug it properly and, thus, solve and eliminate the
problems, you leave a percentage of your Web visitors in the dark with
nothing to connect with except a generic server error.
Here's how it works. Each server on your Web site typically has a
specific Internet Protocol (IP) address. That address is listed around the
world on DNS servers- a sort of distributed global address book for Web
sites.
When someone types, for example, http://www.microsoft.com/ into a
Web browser, it searches for an IP address. If more than one IP address is
available, the DNS server mixes up its responses to balance the load.
Unfortunately, when you take a misbehaving server offline, the DNS servers
still pass out its address, unaware of the server's status. This can mean
visitors receive error messages, or are served pages with objects and
graphics missing, resembling the online equivalent of a gap-tooth grin.
That's not good business when global customers depend on your site for
support, important patches and updates, and vital news and information
every hour of the day, each day of the week. Virtually every major Web
site faces this problem: How do you hide server failures from users?
Wanke: We needed to figure out why our servers were crashing,
which requires them to be taken offline for debugging. For every machine,
you have to have an IP address. This problem, for years, has been
unsolvable.
Weeks: Just running CHKDSK on a 36GB server, which is standard
after a crash, can take up to three hours. We couldn't have a server fail,
or do maintenance on a server, or be able to leave a server down and try
to figure out why it went down without it affecting customers.
Killing the Round Robin
The way the microsoft.com Web servers, which run Microsoft Internet
Information Server (IIS), are arranged, there are between four and six
servers mirroring core Web site content on each network segment. There are
four segments, which offer an additional layer of redundancy in case of
network failure (see Figure 1).
Figure 1 The microsoft.com Web site consists of several clusters of
servers, each with a number of segments consisting of 4-6 servers
containing mirrored content. In the past, each server had a dedicated IP
address. Now, with the Single IP load balancing solution, servers use
virtual IP addresses, which hide server failure from end users. The
result: Up to 100 percent availability.
To eliminate the dependency on round robin, we incorporated Valence
Research's Convoy Cluster Software, which has since become the
Windows NT 4.0 Enterprise Edition Load Balancing Service (WLBS).
WLBS works as an NDIS driver, which resides on all of our Web servers.
Here it can help detect failures in Windows NT, TCP/IP, and IIS.
Even more critical to the solution is a controller version of our
monitoring program called HttpMon 3.x, which pings each Web server once
per minute to look for failures in the application layer - something
that's more common than system failures but much harder to detect.
Application-layer problems can range from IIS being overloaded by
requests, to a full-fledged crash requiring a server reboot. For the most
part, HttpMon looks for errors consistent with RFC 1945; 200 is OK, 500 is
a server error.
After determining that a server is having serious problems (after a
test fails x number of times, as set in a config file), HttpMon and WLBS
take the server out of rotation and divert its traffic to other machines.
A minute later, HttpMon checks again to see if the server has recovered.
If it has, it is returned to service. If the server still doesn't respond,
a technician is alerted to attend to the machine, determine the nature of
the problem so it can be corrected, and bring it back online.
To prevent a domino effect where all of the servers on a segment are
overloaded and subsequently removed from service during a period of
unusually high traffic, there is what's called a "water level" of two
servers per segment. This means that if there are three servers running
and one of them is removed from service, HttpMon will not remove the final
two servers under any circumstances. Of course, by the time two or more
servers are down, Windows NT events are occurring, kicking off pagers for
server technicians who can diagnose the problem and restore order.
Suspenders and a Belt
We switched on the Single IP solution on Friday, June 26, 1998. This
was right after the launch of Windows 98, and we expected a lot of Web
traffic to follow the launch event.
Systems engineers are believers in both suspenders and a belt, at least
when it comes to their mission-critical servers. (For the record, neither
of us tend to wear either...) For this reason, Microsoft didn't just
switch to one IP address per segment when it went to Single IP. Each
server kept its individual IP address, and HttpMon was programmed to share
a pool of IP addresses between all of the servers on the segment.
That way, if for some reason the solution didn't work as planned, there
was a fallback position.
Weeks: Prior to the Windows 98 launch, we had some severe
content problems on the site. A couple of the content tree owners
completely refreshed their entire sites, and caused some major problems on
the site.
Wanke: We actually have 20 IP addresses. If the solution
totally went bad during the Windows 98 launch, we could remove Single IP
and we could be back up with round robin DNS in an hour.
Weeks: With the Windows 98 launch, we wanted to switch to
Single IP - but with as much fail-safe as possible.
Not only did Single IP work, it yielded a first for microsoft.com - the
100 percent availability day. In fact, seven of the first 14 days were 100
percent days, and on many other days the site achieved the stated goal of
99.8 percent availability or better.
Wanke: Until Single IP, we were just like everyone else: we
never had a 100 percent day. Never.
There was another unexpected side effect.
Weeks: With round robin DNS when you rebooted a box, for about
two minutes while IIS is starting up, the box would be taking up to 100
hits per second. The performance monitor counters, all of the IIS counters
on the box would just go ballistic. We called it 'Earthquake mode.'
The Single IP solution put an end to Earthquake mode. Now, HttpMon
waits until the server is fully started and returning pages before adding
it back into the pool of active servers.
Disaster Strikes; Few Notice
At just before midnight on Wednesday July 8, a router failed, taking
down a significant portion of the infrastructure that connects Microsoft's
Web servers to its Internet customers. The problem persisted for much of
the next day.
Single IP made the day much more bearable for customers visiting the
Microsoft Web site. Less than 8 percent of the traffic that hit the
Microsoft Web servers during this time was affected by this massive
network failure, which under the old system would have affected 12 percent
of hits to the site (see Figure 2). A future planned innovation to the
Single IP solution called data clustering will shield end users from even
these types of network failures
Figure 2 A look at server availability on www. microsoft.com.
Starting with June 26, you can see how actual server availability compares
with availability seen by site visitors. On July 8-9, a catastrophic
network hardware failure caused a dip in availability, which still stayed
above 92 percent.
Now that Single IP has proved itself in the field of battle, we plan to
pull back on the ratio of Virtual IP addresses (VIPs) to Dedicated IP
addresses (DIPs). Not having a one-to-one ratio of VIPs to DIPs gives us a
mixture of fail-safe and ease of maintenance, since you don't have to add
a new IP to every server when you add a server to a cluster. We're
confident now that the solution works, but a layer of redundancy should
never be totally eliminated.
Weeks: Single IP is a real success story.
Wanke: It's over. Gamepoint. We won.
The Future of Single IP
We aren't keeping the Single IP solution to ourselves. Already two
enterprise customers - in addition to Microsoft's internal customers,
including microsoft.com, MSN, and MSNBC - are beta-testing the solution.
Microsoft plans to make WLBS available as an additional component of
Windows NT Server 4.0 Enterprise Edition; an announcement with information
on pricing and availability is planned for late 1998. In the meantime, the
Convoy Cluster Software is still available from Valence Research.
© 1999 Microsoft Corporation. All rights reserved.
The information contained in this document represents the current view
of Microsoft Corporation on the issues discussed as of the date of
publication. Because Microsoft must respond to changing market conditions,
it should not be interpreted to be a commitment on the part of Microsoft,
and Microsoft cannot guarantee the accuracy of any information presented
after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.
BackOffice, BackOffice logo, Microsoft, and Windows NT are trademarks
or registered trademarks of Microsoft Corporation.
Other product or company names mentioned herein may be the trademarks
of their respective owners.
Microsoft Corporation • One Microsoft Way • Redmond, WA 98052-6399 •
USA