How MS SQL Failover Clustering Work
Friday, February 27th, 2009 | Author:

The clustered nodes use a “heartbeat” signal to check whether each node is alive, at both the operating system level and the SQL Server level. At the operating system level, the nodes in the cluster are in constant communication, validating the health of all the nodes.

After installing a SQL Server failover cluster, the node hosting the SQL Server resource uses the Service Control Manager to check every 5 seconds whether the SQL Server service appears to be running. This “LooksAlive” check does not impact the performance of the system, but also does not do a thorough check; the check will succeed if the service appears to be running even though it might not be operational. Because the LooksAlive check does not do a thorough check, a deeper check must be done periodically; this “IsAlive” check runs every 60 seconds.

The IsAlive check runs a SELECT @@SERVERNAME Transact-SQL query against SQL Server to determine whether the server can respond to requests. Although a reply to the IsAlive query confirms that the SQL Server service is available for requests, it does not guarantee that all user databases are available, or that the user databases are operating within necessary performance/response-time requirements.

If the IsAlive query fails, the IsAlive health check is retried five times and then it attempts to reconnect to the instance of SQL Server. If all five retries fail, the SQL Server resource fails. Depending on the failover threshold configuration of the SQL Server resource, the failover cluster will attempt to either restart the resource on the same node or it will fail over to another available node. The IsAlive query tolerates a few errors, but ultimately it fails if its threshold is exceeded.

During failover of the SQL Server instance, SQL Server resources start up on the new node. Windows clustering starts the SQL Server service for that instance on the new node and SQL Server goes through the recovery process to start the databases. After the service is started and the master database is online, the SQL Server resource is considered to be up. Now the user databases will go through the normal recovery process, which means that any completed transactions in the transaction log are rolled forward (the Redo phase), and any incomplete transactions are rolled back (the Undo phase). In SQL Server 2005 Enterprise Edition, each user database will be available to the user once the Redo phase completes; for the other editions, as with all previous versions, each user database is unavailable until the Undo phase completes. The length of the recovery process depends on how much activity must be rolled forward or rolled back upon startup. The ‘recovery interval’ sp_configure option of the server can be set to a low number to avoid longer Redo recovery times and to speed up the failover process. The Undo recovery time can be reduced by using shorter transactions so that any uncommitted transactions do not have much to roll back.