NoQ: back-pressured/blocking IP networks

The context/the issue

Congestion, or more generally the issue of network resources management, is currently solved in IP networks thanks to the following feedback loop:

excess packets are dropped at congested points
TCP senders detect drops, and react by throttling down

This scheme works great in almost all networks. However, it looks like everyone wants to run TCP/IP on every network invented ever, even the weirdest ones. In some of them, this approach has some short-comings. For instance in fast long distance networks (so-called "grid" networks). it tends to waste expensive link capacity.

NoQ proposal

NoQ, for "Network of Queues", is an experiment on back-pressured or "blocking" IP networks, as opposed to the "dropping" IP networks you already know. The linux code available below allows disabling of TCP's congestion control on a network interface basis. Every TCP socket routed through interface(s) designated by the user (you) will send packets at wire-rate, ignoring TCP's Congestion WiNDow (cwnd). Protection against overflow relies instead on Ethernet PAUSE frames sent by congested receivers or switches, back-pressuring upstream queues, up to the sender if needed. With the standard off-the-shelf hardware I used so far, this is surprinsingly working very well. Thanks to a few linux patches, I get constant wire-rate and no packet drops at all.
NoQ is admittedly very close to a scheme called "hop by hop flow control". The main points of NoQ are:

Focus on buffers, and not on hops anymore: When each network node ("hop") had only one big shared memory, there was not much difference, but things have changed a bit. For instance, since some link-layer devices started to perform (transparent) routing, networks have become a complex mess from a buffering point of view (among other points of view), so overflow protection cannot confuse buffers and "hops" anymore. The name "Network of Queues" is a reminder of that.
Stay as close to TCP/IP as possible: Congestion management is very far from being the only issue in a network. So let's stay modest, let's not reinvent all these wheels and tweak only congestion management. As a nice side-effect, applications won't even notice the change (most never knew what a network is anyway).

NoQ does surely not scale to networks of millions of nodes. But there are a couple of other, smaller networks out there besides the Internet.

More details - publication

To get detailed results, appropriate references, know the hardware I used, and more generally speaking understand what this stuff is all about, please read the 6-page paper i will present in PDPTA'04 in June 2004.

Erratum on the PDPTA'04 paper

At end of section II-C: congestion inside the linux sender host is not handled like network congestion, and the congestion window is not decreased. This is partly false. It is still true that handling of congestion inside the host is much optimized compared to packet drops in the network: it is detected immediatly and not after a round-trip time, and the pseudo-dropped packet is retransmitted before anything else. But the congestion window is reduced just like for any other external packet drop. So the reason for lack of fairness is not clear yet (sentence about this in middle of section IV-A should also be corrected). Another consequence is that congestion window reduction events in figure 4, 5 and 7 are still present. But they are harmless and invisible thanks to a queue in the sender at the lower IP level, and the fact that this queue is big enough to compensate the window reductions.

Since my patches make TCP simply ignore the congestion window, this whole issue becomes irrelevant once they are applied.

Make it run

You will find in this directory the linux patches needed to reproduce the experiments described in the paper above. These patches are relatively short, and hopefully easy to read. If not please complain.

Supported hardware & software

Patches are currently available for linux IPv4, versions 2.4.25 and 2.4.26, and for e1000 and bcm5700 gigabit Ethernet drivers, supporting Intel PRO/1000 (8254x) and Broadcom tigon3 (BCM57xx) chipsets used by many manufacturers including 3Com. Type lspci to know which chipset you have.
I am working on the 2.4 branch and not on 2.6 because 2.4 runs fine and this branch is moving fast enough for me. I don't know yet for sure if this work can be easily redone for 2.6. If evolutions in the TCP code are not too numerous (which I suspect), porting to 2.6 would only be less than a couple of days.
You will also notice that I don't support the tg3 driver from linux developers, but instead the alternate bcm5700 driver from Broadcom that linux developers rejected. The main reason is that I did not find convenient to go into the computer room to manually reboot frozen machines, especially painful when the computer room is 100 km away. Things may change in the future... On the other hand, downloading the bcm5700 driver source from Broadcom's flashy web site is a bit painful. But you can also get it from Debian.

If you are interested in testing this code in some (not too far) configuration from the above one, please drop me an email, I may do something quickly.

Instructions

Apply the patches found here, compile and install kernels, reboot, etc.
Enable Ethernet flow control at both ends of every link of your gigabit Ethernet network. Check relevant documentation.
When loading e1000 (resp. bcm5700) module, use the NoQ-command line option DisableCwnd=1 (resp. disable_cwnd=1) to disable TCP's congestion control for all sockets routed to this interface. As a safety measure, before disabling TCP's cwnd the driver will check at "up" time that flow control has successfully been enabled on the link. Detailed logs and warning are provided in /var/log/kern.log. Sometimes the driver can't really tell about the state of 802.3x, so you can force disabling with DisableCwnd=2.
Play with your usual applications, noticing in "netstat -s" that you do not drop packets anymore, even in congested states.
Please send me anything you observed!

Limitation: Only "flat" Ethernet networks are currently supported, you should not send cwnd-disable'ed traffic across routers for the moment (except in specific experimental configurations, see the publication above).
Warning 1: Do not disable TCP's congestion window on some Ethernet production network you share with other users, unless the network administrator is a really good friend of yours.
Warning 2: I highly recommend the "dual-networking" approach usually adopted by most clusters: on one hand a gigabit workload network and on the other hand a cheap and reliable 10-100 Mb/s control network for rsh/ssh, etc.

Testing 802.3x

802.3x (Ethernet flow control/PAUSE frames) is a well standardized but seldom used technology, so you will probably discover many quirks when trying to use it, as I did.

The two patched drivers also provide a "receive_busywait" command line option, which allows "downgrading" a gigabit receiver to any lower throughput. This is a dirty but very convenient hack to create articial congestion and thus test flow control. This hack is made possible thanks to NAPI, which suppressed an intermediate "backlog" queue in packet reception.

iperf in UDP mode is a nice tool to check that 802.3x flow control is working correctly, since it reports packet losses in a convenient way. To avoid drops inside the UDP sender, just ensure that you have a socket send buffer (SND_BUF) lower than the txqueuelen of your interface (see man ifconfig). Iperf is a nice tool anyway.

To be continued

This investigation is very far from complete, work on routers and other non-Ethernet issues is going on. Please read the paper above for perspectives.

Who did this?

I am a Ph.D intern in the INRIA RESO team, and I am co-sponsored by SunLabs Europe. I am currently looking for a job.

Thanks in advance for sending your feedback to marc.herbert@free.fr.

$Revision: 1.3 $ Last modified: Tue Jan 11 23:52:11 CET 2005