Contrary to some, traceroute is very real – I should know, I helped ...

3 hours ago

The Register

Systems Approach A few weeks ago I stumbled onto an article titled "Traceroute isn’t real," which was reasonably entertaining while also not quite right in places.

I assume the title is an allusion to birds aren’t real, a well-known satirical conspiracy theory, so perhaps the article should also be read as satire. You don’t need me to critique the piece because that task has been taken on by the tireless contributors of Hacker News, who have, on this occasion, done a pretty good job of criticism.

One line that jumped out at me in the traceroute essay was the claim "it is completely impossible for [MPLS] to satisfy the expectations of traceroute."

Not only is this something I know to be incorrect, but I have a vivid memory of how we came to make MPLS support traceroute when we were designing the Tag Switching header among my colleagues at Cisco in 1996.

(MPLS, or Multiprotocol Label Switching, is the IETF standard that followed fairly directly from the design of Tag Switching, and the headers are nearly identical.)

Firsthand retelling of the technical history of MPLS

CONTEXT

This was a heated debate, which is why I remember it so well today. It was a classic “design by committee” situation and we know how those things generally turn out (48-byte cells, anyone?), although I think this one was better than most in the end. So let’s wind our time machine back to 1996 and I will reconstruct the process that led to the MPLS header being what it is today, complete with its configurable support of traceroute.

Designing labels at a router company

I joined Cisco in 1995 to be part of the team that was tasked with figuring out how the new and exciting (at the time) technology of ATM could be “integrated” into the IP-centric product line of Cisco. There were plenty of ideas already floating around, with IP-over-ATM standards developing at the IETF and the ATM Forum.

By early 1996 there were half a dozen engineers at Cisco sharing ideas on what this “integration” might look like when Yakov Rekhter sent around a two-page document outlining the basic ideas of Tag Switching. When I read it, the idea seemed like a qualitative improvement on everything else I had seen or discussed, and my colleagues agreed.

We fairly quickly lined up executive support to flesh out those two pages into an architecture and proceed to implementing it on the Cisco product line of both routers and ATM switches. We started working through the details that would need to be nailed down before any sort of implementation could start. One essential detail was the packet header format for tag-switched packets.

It’s important at this point to acknowledge some of the related ideas that were around at the time. After Yakov’s two-pager paper had won support of our design team, but before we had said much about it in public, a startup called Ipsilon came out of stealth mode with a flurry of announcements. They had also figured out a way to combine IP routing with ATM switching, cleverly calling their approach IP Switching.

Their design was quite different from ours, but they made a splash with it, including the then-novel idea of publishing several informational RFCs to describe the protocols that made their system work. It’s fair to say that the executive support for Tag Switching was much easier to obtain thanks to the amount of buzz around Ipsilon.

We later realized that the central idea of Tag Switching, which was to associate fixed-length labels with variable-length IP prefixes from the routing table, had been invented and published by Girish Chandranmenon and George Varghese in SIGCOMM 1995. They called it “threaded indices." That paper definitely pre-dated Yakov’s two-pager, so I think they can be considered the true inventors of this core aspect of Tag Switching and MPLS.

But neither Yakov’s paper nor the 1995 SIGCOMM paper addressed the issue of how you encode a fixed-length label in an IP packet.

We had a big base of ISPs who bought the fastest routers they could get their hands on in 1996 and they had opinions

Ipsilon’s approach relied on the ATM cell header to carry fixed-length labels, which was a fine idea if you were happy to send all your traffic around in 48-byte cells, but that was not what most of our customers wanted. Of course, there was nothing like a single customer viewpoint, but we had a big base of ISP customers who bought the fastest routers they could get their hands on in 1996 and they had opinions.

Many of them hated ATM with a passion – this was the height of the nethead vs bellhead wars – and one reason for that was the “cell tax.” ATM imposed a constant overhead (tax) of five header bytes for every 48 bytes of payload (over 10 percent), and this was the best case. A 20-byte IP header, by contrast, could be amortized over 1500-byte or longer packets (less than 2 percent).

Even with average packet sizes around 300 bytes (as they were at that time) IP came out a fair bit more efficient. And the ATM cell tax was in addition to the IP header overhead. ISPs paid a lot for their high-speed links and most were keen to use them efficiently.

So a problem we faced with Tag Switching/MPLS was that we were about to introduce a “label tax” by putting an additional header on top of the IP header to carry our fixed-length labels.

There was an incentive to keep that header as small as possible–for some members of our design committee, that was the most important consideration. But we needed to fit quite a few things aside from a label into the header. Labels were intended to simplify packet forwarding, so you couldn’t (normally) ask a router to look beyond the label header. Hence, any field that influenced forwarding had to be in the label header.

One such field was a “class of service” modeled on the “type of service” (ToS) found in the IP header. ToS usage was not standardized at this point, but it was used for things like marking routing protocol packets for priority handling on arrival at an overloaded router. (These bits would get thoroughly redefined in the later work on Differentiated Services.)

The obvious choice would have been to include a full byte of ToS in the label header. But the pressure to minimize the header along with the lack of widespread usage of ToS led to us compromising on three bits, initially called “Class of Service” and later renamed to “Experimental” in RFC 3032.

This was in recognition of the fact that any attempt to offer different classes of service to IP traffic was decidedly an experiment in 1996. This decision would prove rather painful when the Diff-Serv standards emerged (using six bits of the ToS byte) and we tried to map them onto MPLS. (As an aside, I think my work at the intersection of MPLS and Diff-Serv was probably my most productive contribution to the IETF.)

The other field that we quickly decided was essential for the tag header was time-to-live (TTL). It is the nature of distributed routing algorithms that transient loops can happen, and packets stuck in loops consume forwarding resources – potentially even interfering with the updates that will resolve the loop. Since labelled packets (usually) follow the path established by IP routing, a TTL was non-negotiable. I think we might have briefly considered something less than eight bits for TTL – who really needs to count up to 255 hops? – but that idea was discarded.

Route account

Which brings us to traceroute. Unlike the presumed reader of “Traceroute isn’t real,” we knew how traceroute worked, and we considered it an important tool for debugging. There is a very easy way to make traceroute operate over any sort of tunnel, since traceroute depends on packets with short TTLs getting dropped due to TTL expiry.

You copy the IP TTL into the label header as the packet enters the tunnel (when the label header is added); decrement the TTL in the outer label header at every hop; and then copy the outer TTL back to the inner header (IP TTL) when exiting the tunnel. This means that the TTL does exactly what it would have done if there were no tunnel, and if it was going to expire mid-tunnel, that is what happens.

ISPs didn’t love the fact that random end users can get a picture of their internal topology by running traceroute

There is the small matter of what to do with your “ICMP time exceeded” message in the middle of a tunnel, which RFC 3032 explains in detail. In other words, MPLS doesn’t prevent traceroute from working. Interestingly, the earlier tunneling protocol GRE allows the same treatment as MPLS but doesn’t require it (ie, GRE can break traceroute, or not).

But there is another twist to this story.

ISPs didn’t love the fact that random end users can get a picture of their internal topology by running traceroute. And MPLS (or other tunnelling technologies) gave them a perfect tool for obscuring the topology.

First of all you can make sure that interior routers don’t send ICMP time exceeded messages. But you can also fudge the TTL when a packet exits a tunnel. Rather than copying the outer (MPLS) TTL to the inner (IP) TTL on egress, you can just decrement the IP TTL by one. Hey presto, your tunnel looks (to traceroute) like a single hop, since the IP TTL only decrements by one as packets traverse the tunnel, no matter how many router hops actually exist along the tunnel path. We made this a configurable option in our implementation and allowed for it in RFC 3032.

We also had an internal joke about giving ISPs the option to increment the TTL on egress, so that a tunnel would appear to have negative hop count. No-one wanted their network looking inefficient by having too many hops. (This is a terrible idea given the real purpose of TTL in discarding looping packets, but we had a good laugh anyway.)

How TCP's congestion control saved the internet It's time to mark six decades of computer networking Um, what ever did happen with network automation? How not to write about network security – and I'm speaking from experience

Anyway, the non-support of traceroute over tunnels is a choice by operators, not a baked-in feature/bug of MPLS (or other tunnel technologies).

There is plenty more to this story, such as how we came to think of labels as a stack, but that can wait for another time. Part of me wishes we hadn’t worked so hard to keep the minimal MPLS label header down to 32 bits. But we didn’t break traceroute except for ISPs who wanted it broken, and we managed to deploy MPLS into the networks of almost every ISP without them complaining about the label tax.

We didn’t get everything right by any means but we made a set of trade-offs that worked for most of our stakeholders. ®