Routing "convergence" time, or the delay to reroute, is often cited as
one of the key limitations for currently deployed IP networks to provide
new services and scale to larger sizes. In this talk, we present an
analysis of ISIS routing protocol behavior in one ISP network to
demonstrate where the problems lie and how to fix them. This analysis is
based on ISIS packet traces collected over multiple week-long periods on
several major ISP backbone networks. The analysis focuses on two
aspects.
First, by observing the number of link-state packets generated and
analyzing how many represent real physical changes, we show that only a
few routers and links are responsible for the majority of the churn.
Some of the links, when they fail, become unstable and cause continuous
churn.
Second, we analyze in detail the sequence of events and delays that
result in convergence taking multiple seconds. We correlate the routing
state learned from tracing ISIS routing packets with observations of the
effects of routing upon a continuous stream of UDP test traffic while
the convergence progresses. This allows us to detect link failures very
quickly and to see precisely when the route changes.
We have observed that during a single convergence event our stream was
first black-holed, then sub-optimally routed, then looped, and finally
converged to the new, optimum route. All of these steps took several
seconds due to link failure detection times, link-state packet
propagation times, and the surprising effect of some of the ISIS timers.
By combining fixes for these problems with the results of our earlier
lab experiments (as reported at NANOG 20), we conclude with a recipe for
achieving subsecond IGP convergence in networks such as this ISP's all
point-to-point core network.