TechNet Home Page   All Products  |   Support  |   Search  |   microsoft.com Home  
Microsoft
  TechNet Home  |   Site Map  |   Events  |   Downloads  |   Personalize  |   Worldwide  |   Advanced Search  |
Navigate
Index
Search TechNet

Navigate by Product
Application Center
BizTalk Server
Commerce Server
Exchange Server
Host Integration Server
Internet Security & Acceleration Server
Office
Site Server Commerce
Small Business Server
SQL Server
Systems Management Server
Visio
Windows 2000 Professional
Windows 2000 Server
Windows 98/95/CE
Windows NT
Windows Web Srvcs (IIS)
Technical Support

DLL Help
Downloads
Online Support
Search the Knowledge Base (KB)
Service Packs
Submit an Incident
Top IT Topics

Drivers
E-Commerce
Interoperability
Intranet
Networking & RAS
Reliability
Security
Technology Solutions
Talk

Discuss with Peers
Feedback Central
Technical Chats
User Groups
Training

Career Center
Certified Professionals
IT Training & Certification
Online Bookstore
Online Seminars
Support WebCasts
TechNet Events
TechNet Columns

Ask the Dev Team
Editor's Note
Professor Windows
Puzzler
Security
The Mole: Inside Microsoft
TechNet for Education
TechNet Top Questions
The Cable Guy
Tricks & Traps
What's New This Month
About TechNet

TechNet Subscription
Free Bi-Weekly Updates
Join TechNet
Our Privacy Policy
Site Guide
TechNet Briefings
Developer

Questions or Comments?Questions or Comments?


Single IP: A Server Solution

How microsoft.com Finally Achieved 100% Web Site Availability

Todd Wanke is microsoft.com's research and development manager in charge of the innovations that make the Web site more stable and effective. Systems operation manager Todd Weeks, who keeps the Microsoft site up and running, implements these solutions in the real world test bed of Microsoft Web servers. Together, the duo is known within Microsoft as "Todd Squared."

If you've ever struggled with the traditional "round robin" Domain Name System (DNS), you know the tough decision you face with a troublesome Web server. The cause isn't always obvious: someone published bad content, or maybe a router is failing or a disc drive is going down. If you take the server down to debug it properly and, thus, solve and eliminate the problems, you leave a percentage of your Web visitors in the dark with nothing to connect with except a generic server error.

Here's how it works. Each server on your Web site typically has a specific Internet Protocol (IP) address. That address is listed around the world on DNS servers- a sort of distributed global address book for Web sites.

When someone types, for example, http://www.microsoft.com/ into a Web browser, it searches for an IP address. If more than one IP address is available, the DNS server mixes up its responses to balance the load. Unfortunately, when you take a misbehaving server offline, the DNS servers still pass out its address, unaware of the server's status. This can mean visitors receive error messages, or are served pages with objects and graphics missing, resembling the online equivalent of a gap-tooth grin.

That's not good business when global customers depend on your site for support, important patches and updates, and vital news and information every hour of the day, each day of the week. Virtually every major Web site faces this problem: How do you hide server failures from users?

    Wanke: We needed to figure out why our servers were crashing, which requires them to be taken offline for debugging. For every machine, you have to have an IP address. This problem, for years, has been unsolvable.
    Weeks: Just running CHKDSK on a 36GB server, which is standard after a crash, can take up to three hours. We couldn't have a server fail, or do maintenance on a server, or be able to leave a server down and try to figure out why it went down without it affecting customers.

Killing the Round Robin

The way the microsoft.com Web servers, which run Microsoft Internet Information Server (IIS), are arranged, there are between four and six servers mirroring core Web site content on each network segment. There are four segments, which offer an additional layer of redundancy in case of network failure (see Figure 1).

Figure 1 The microsoft.com Web site consists of several clusters of servers, each with a number of segments consisting of 4-6 servers containing mirrored content. In the past, each server had a dedicated IP address. Now, with the Single IP load balancing solution, servers use virtual IP addresses, which hide server failure from end users. The result: Up to 100 percent availability.

To eliminate the dependency on round robin, we incorporated Valence Research's Convoy Cluster Software, which has since become the Windows NT 4.0 Enterprise Edition Load Balancing Service (WLBS). WLBS works as an NDIS driver, which resides on all of our Web servers. Here it can help detect failures in Windows NT, TCP/IP, and IIS.

Even more critical to the solution is a controller version of our monitoring program called HttpMon 3.x, which pings each Web server once per minute to look for failures in the application layer - something that's more common than system failures but much harder to detect. Application-layer problems can range from IIS being overloaded by requests, to a full-fledged crash requiring a server reboot. For the most part, HttpMon looks for errors consistent with RFC 1945; 200 is OK, 500 is a server error.

After determining that a server is having serious problems (after a test fails x number of times, as set in a config file), HttpMon and WLBS take the server out of rotation and divert its traffic to other machines. A minute later, HttpMon checks again to see if the server has recovered. If it has, it is returned to service. If the server still doesn't respond, a technician is alerted to attend to the machine, determine the nature of the problem so it can be corrected, and bring it back online.

To prevent a domino effect where all of the servers on a segment are overloaded and subsequently removed from service during a period of unusually high traffic, there is what's called a "water level" of two servers per segment. This means that if there are three servers running and one of them is removed from service, HttpMon will not remove the final two servers under any circumstances. Of course, by the time two or more servers are down, Windows NT events are occurring, kicking off pagers for server technicians who can diagnose the problem and restore order.

Suspenders and a Belt

We switched on the Single IP solution on Friday, June 26, 1998. This was right after the launch of Windows 98, and we expected a lot of Web traffic to follow the launch event.

Systems engineers are believers in both suspenders and a belt, at least when it comes to their mission-critical servers. (For the record, neither of us tend to wear either...) For this reason, Microsoft didn't just switch to one IP address per segment when it went to Single IP. Each server kept its individual IP address, and HttpMon was programmed to share a pool of IP addresses between all of the servers on the segment.

That way, if for some reason the solution didn't work as planned, there was a fallback position.

    Weeks: Prior to the Windows 98 launch, we had some severe content problems on the site. A couple of the content tree owners completely refreshed their entire sites, and caused some major problems on the site.
    Wanke: We actually have 20 IP addresses. If the solution totally went bad during the Windows 98 launch, we could remove Single IP and we could be back up with round robin DNS in an hour.
    Weeks: With the Windows 98 launch, we wanted to switch to Single IP - but with as much fail-safe as possible.

Not only did Single IP work, it yielded a first for microsoft.com - the 100 percent availability day. In fact, seven of the first 14 days were 100 percent days, and on many other days the site achieved the stated goal of 99.8 percent availability or better.

    Wanke: Until Single IP, we were just like everyone else: we never had a 100 percent day. Never.

There was another unexpected side effect.

    Weeks: With round robin DNS when you rebooted a box, for about two minutes while IIS is starting up, the box would be taking up to 100 hits per second. The performance monitor counters, all of the IIS counters on the box would just go ballistic. We called it 'Earthquake mode.'

The Single IP solution put an end to Earthquake mode. Now, HttpMon waits until the server is fully started and returning pages before adding it back into the pool of active servers.

Disaster Strikes; Few Notice

At just before midnight on Wednesday July 8, a router failed, taking down a significant portion of the infrastructure that connects Microsoft's Web servers to its Internet customers. The problem persisted for much of the next day.

Single IP made the day much more bearable for customers visiting the Microsoft Web site. Less than 8 percent of the traffic that hit the Microsoft Web servers during this time was affected by this massive network failure, which under the old system would have affected 12 percent of hits to the site (see Figure 2). A future planned innovation to the Single IP solution called data clustering will shield end users from even these types of network failures

Figure 2 A look at server availability on www. microsoft.com. Starting with June 26, you can see how actual server availability compares with availability seen by site visitors. On July 8-9, a catastrophic network hardware failure caused a dip in availability, which still stayed above 92 percent.

Now that Single IP has proved itself in the field of battle, we plan to pull back on the ratio of Virtual IP addresses (VIPs) to Dedicated IP addresses (DIPs). Not having a one-to-one ratio of VIPs to DIPs gives us a mixture of fail-safe and ease of maintenance, since you don't have to add a new IP to every server when you add a server to a cluster. We're confident now that the solution works, but a layer of redundancy should never be totally eliminated.

    Weeks: Single IP is a real success story.
    Wanke: It's over. Gamepoint. We won.

The Future of Single IP

We aren't keeping the Single IP solution to ourselves. Already two enterprise customers - in addition to Microsoft's internal customers, including microsoft.com, MSN, and MSNBC - are beta-testing the solution. Microsoft plans to make WLBS available as an additional component of Windows NT Server 4.0 Enterprise Edition; an announcement with information on pricing and availability is planned for late 1998. In the meantime, the Convoy Cluster Software is still available from Valence Research.

© 1999 Microsoft Corporation. All rights reserved.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

BackOffice, BackOffice logo, Microsoft, and Windows NT are trademarks or registered trademarks of Microsoft Corporation.

Other product or company names mentioned herein may be the trademarks of their respective owners.

Microsoft Corporation • One Microsoft Way • Redmond, WA 98052-6399 • USA




Send this document
to a colleague
Printer-friendly
version
 
  Last updated January 12, 2000
  © 2001 Microsoft Corporation. All rights reserved. Terms of use.

Welcome to S.E.A.D.S. Support pages. Your comments welcome
seads_llc@bellsouth.net 

Return to S.E.A.D.S. Home page, Return to S.E.A.D.S. Support pages. Return to the September 11 Dedication pages.