NoQ: back-pressured/blocking IP networks
The context/the issue
Congestion, or more generally the issue of network resources
management, is currently solved in IP networks thanks to the following
feedback loop:
- excess packets are dropped at congested points
- TCP senders detect drops, and react by throttling down
This scheme works great in almost all networks. However, it looks
like everyone wants to run TCP/IP on every network invented ever,
even the weirdest ones. In some of them, this approach has some
short-comings. For instance in
fast long distance
networks (so-called "grid" networks).
it tends to waste expensive link capacity.
NoQ proposal
NoQ, for "Network of Queues", is an experiment on back-pressured or
"blocking" IP networks, as opposed to the "dropping" IP networks
you already know. The linux code available below allows disabling
of TCP's congestion control on a network interface basis. Every
TCP socket routed through interface(s) designated by the
user (you) will send packets at wire-rate, ignoring TCP's
Congestion WiNDow (cwnd). Protection against overflow relies instead on
Ethernet PAUSE frames sent by congested receivers or switches,
back-pressuring upstream queues, up to the sender if needed. With
the standard off-the-shelf hardware I used so far, this is surprinsingly
working very well. Thanks to a few linux
patches, I get constant wire-rate and no packet drops at all.
NoQ is admittedly very close to a scheme called "hop by
hop flow control". The main points of NoQ are:
- Focus on buffers, and not on hops anymore
- When each network node ("hop") had only one big shared
memory, there was not much difference, but things have changed a
bit. For instance, since some link-layer devices started to
perform (transparent) routing, networks have become a
complex mess from a buffering point of view
(among other points of view),
so overflow protection cannot confuse buffers
and "hops" anymore. The name "Network of Queues" is a reminder
of that.
- Stay as close to TCP/IP as possible
- Congestion management is very far from being the only issue in a
network. So let's stay modest, let's not reinvent all these wheels and
tweak only congestion management. As a nice side-effect,
applications won't even notice the change (most
never knew what
a network is anyway).
NoQ does surely not scale to networks of millions of nodes. But
there are a couple of other, smaller networks out there besides the
Internet.
More details - publication
To get detailed results,
appropriate references, know the hardware I used, and
more generally speaking understand what this stuff is all about,
please read the 6-page
paper
i will present in
PDPTA'04
in June 2004.
Erratum on the PDPTA'04 paper
At end of section II-C: congestion inside the linux sender
host is not handled like network congestion, and the congestion
window is not decreased
. This is partly false. It is still
true that handling of congestion inside the host is much
optimized compared to packet drops in the network: it is
detected immediatly and not after a round-trip time, and the
pseudo-dropped packet is retransmitted before anything else. But
the congestion window is reduced just like for any other
external packet drop. So the reason for lack of fairness is not
clear yet (sentence about this in middle of section IV-A should
also be corrected). Another consequence is that congestion
window reduction events in figure 4, 5 and 7 are still
present. But they are harmless and invisible thanks to
a queue in the sender at the lower IP level, and the fact that
this queue is big enough to compensate the window reductions.
Since my patches make TCP simply ignore the congestion window, this
whole issue becomes irrelevant once they are applied.
Make it run
You will find in this directory the linux
patches needed to
reproduce the
experiments described in the paper above. These patches are
relatively short, and hopefully easy to read. If not please
complain.
Supported hardware & software
Patches are currently available for linux IPv4, versions 2.4.25
and 2.4.26, and
for e1000 and bcm5700 gigabit Ethernet
drivers, supporting
Intel PRO/1000
(8254x) and
Broadcom tigon3 (BCM57xx) chipsets used by many manufacturers
including 3Com. Type lspci to know which chipset you
have.
I am working on the 2.4 branch and not on 2.6 because 2.4 runs
fine and this branch is moving fast enough for me. I don't know
yet for sure if this work can be easily redone for 2.6.
If evolutions in the TCP code are not too numerous
(which I suspect), porting to 2.6 would only be less than a
couple of days.
You will also notice that I don't support
the tg3 driver from linux developers, but instead the
alternate bcm5700 driver from Broadcom that
linux developers rejected. The
main reason is that I did not find convenient to go into the
computer room to
manually reboot frozen machines, especially painful when the computer
room is 100 km away. Things may change in the future...
On the other hand, downloading the bcm5700 driver source from
Broadcom's flashy web site is a bit painful. But you can also
get it from Debian.
If you are interested in testing this code in some (not too far)
configuration from the above one, please
drop me an email, I may
do something quickly.
Instructions
- Apply the patches found here, compile and
install kernels, reboot, etc.
- Enable Ethernet flow control at both ends of every link of
your gigabit Ethernet network. Check relevant documentation.
- When loading e1000 (resp. bcm5700)
module, use the NoQ-command line option
DisableCwnd=1 (resp. disable_cwnd=1) to
disable TCP's congestion control for all sockets routed to this
interface. As a safety measure, before disabling TCP's cwnd the driver will check
at "up" time that flow control has successfully been enabled on the
link. Detailed logs and warning are provided in
/var/log/kern.log. Sometimes the driver can't really tell
about the state of 802.3x, so you can force disabling with
DisableCwnd=2.
- Play with your usual applications, noticing in "netstat -s"
that you do not drop packets anymore, even in congested states.
- Please send me
anything you observed!
- Limitation
- Only "flat" Ethernet networks are currently
supported, you should not send cwnd-disable'ed
traffic across routers for the moment (except in specific experimental
configurations, see the publication above).
- Warning 1
- Do not disable TCP's congestion window on
some Ethernet production network you share with other users, unless the
network administrator is a really good friend of yours.
- Warning 2
- I highly recommend the "dual-networking" approach usually
adopted by most
clusters: on one hand a gigabit workload network
and on the other hand a cheap and reliable 10-100 Mb/s
control network for rsh/ssh, etc.
Testing 802.3x
802.3x (Ethernet flow control/PAUSE frames) is a well
standardized but seldom used technology, so you will probably
discover many quirks when trying to use it,
as I did.
The two patched drivers also provide a "receive_busywait" command
line option, which allows "downgrading" a gigabit receiver to
any lower throughput. This is a dirty but very
convenient hack to create articial congestion and thus test flow
control. This hack is made possible thanks to
NAPI,
which suppressed an intermediate "backlog" queue in packet
reception.
iperf in
UDP mode is a nice tool to check that 802.3x flow control
is working correctly, since it reports packet losses in a
convenient way. To avoid drops inside the UDP sender,
just ensure that you have a socket send buffer
(SND_BUF) lower than the txqueuelen of
your interface (see man ifconfig).
Iperf is a nice tool anyway.
To be continued
This investigation is very far from complete, work on routers and
other non-Ethernet issues is going on. Please read the paper above
for perspectives.
Who did this?
I am a Ph.D intern in the INRIA RESO team,
and I am co-sponsored by SunLabs Europe.
I am currently looking for a job.
Thanks in advance for sending your feedback to marc.herbert@free.fr.
$Revision: 1.3 $
Last modified: Tue Jan 11 23:52:11 CET 2005