Digital Communication II Explanations

During my time supervising the University of Cambridge Computer Laboratory Part II Digital Communications II course, various questions or confusions arose. Here is an in depth explanation on the issues raised. To the best of my knowledge it is correct, but if in doubt, ask the lecturer.

Scalability of Routing Protocols

Link state protocols maintain state about all the connections in the network. This enables them to run Djikstra's algorithm and arrive at the shortest route to a packet's destination. Once the network has settled down ("converged"), the number of updates between link state routers should theoretically be small, as the only time an update is made is when a route goes down or comes back up. Convergence is fast as these updates are broadcast to every router throughout the network. However, two problems arise. The first is that with a large and complex topology, there will be an enormous amount of information to store and process -- this tends to become infeasible when we're talking about the entire Internet. The second is that we cannot guarantee that the update packets from any particular router reach any or all of the rest in the network (e.g. if a link goes down an update packet might get lost). Hence there has to be periodic refresh of the soft state that specifies what links are functioning. Such refreshes are costly in terms of bandwidth.

Distance-Vector protocols cause each router to advertise what destinations they are aware of, and what the metric (be it latency, hop count, etc.) is to that destination from them. The routing table is a list of destinations, each with the next hop the packet is to be forwarded to, and the metric. A route is therefore made up of the composition of the next hops from each of the routers that are on the route, in essence "distributed Djikstra". The storage and processing overhead is less than link state, making it more suited to complex networks. Convergence times for D-V are higher than link state (e.g. several hours for BGP) due to the information being propagated out by normal routing updates, cascading through the routers. D-V routers regularly transmit the entire contents of their routing table to their immediate neighbours (not the entire network). As a concrete example, BGP is a D-V protocol, and is hard state, i.e. only updates are transmitted, and these over a TCP connection, where packets are guaranteed not to get lost. Hence the amount of traffic generated is not actually that great. Another good reason for BGP to be D-V is that policy information cannot be conveyed in link state protocols: the links are either up or down. Providers do not want to disclose their policies, and therefore do not broadcast them (as would be needed by some sort of link state protocol), instead choosing to publish metrics and destinations that they can reach. BGP provides a mechanism for using policies which are highly necessary in today's Internet.

The exam question that I set actually has a model answer that says it could be argued both ways -- see above. My general feeling would be to say that link state is not used in the core, and therefore is not as scalable, but provided you outline clearly your reasons (to show that you know what you're talking about), I think you'd have got the marks anyway.

TCP Vegas: Doesn't Bandwidth Decrease to Zero?

TCP Vegas is a modification to TCP to attempt to smooth the value of the congestion window, cwin. This is done by comparing the bandwidth currently being achieved, with the maximum theoretically achievable (calculated using the best RTT that the sender has seen on this connection). If the difference between the two is large, i.e. achieved is much lower than maximum, Vegas reduces the size of cwin.

This has lead to the belief that Vegas would end up reducing cwin to zero (!). In practice, Vegas calculates the difference by:
diff = cwin/baseRTT - cwin/currRTT
Where baseRTT is the minimum ever achieved, and currRTT is the most recently measured RTT.
In this way if cwin is decreased, the difference will also decrease (as baseRTT is smaller than currRTT, and therefore given that the fractions have common numerators, the first will be decreased by a greater amount than the second). Hence cwin is stabilised.

The Packet-Pair Model

This is used to provide an estimate for the available bandwidth on a network. Simplistically, this involves sending two packets, with zero (or very little) spacing between them in time, and at the destination examining the spacing between them when they arrive. The spacing becomes greater as traffic from other flows on the network is "mixed in". Hence, the magnitude of the gap gives an indication of network load.

Unfortunately, the relationship between gap length and available bandwidth is complicated. Hu and Steenkiste's paper "Estimating Available Bandwidth Using Packet Pair Probing" is worth looking at for more details.

Routing in ATM Networks

IP can be routed over ATM using an ATM Adaptation Layer (generally AAL5), which is then carried as usual on circuits with VCIs. To set up a VCI, the ATM network must ascertain what switches must be traversed in order to reach the final IP destination.

ATM routing itself is generally done using the Private Network-Network Interface (PNNI). This is a hierarchical summarisation protocol, that assigns each ATM switch to a peer group. A field similar to IP's subnet mask enables the peer group ID to be derived from the ATM switch's node ID. The lower the peer group ID, the higher in the hierarchy it is. Routing is similar to that used in the Open Shortest Path First (OSPF) protocol.
Cisco's explanation of PNNI may be useful.

When a VCI is to be established, the IP address must be resolved to an ATM node ID. This takes place in a similar fashion to the ARP on ethernet, where if the destionation node is within the same Local IP Subnet (LIS), then a request is sent to the network to obtain the destination node ID. Because ATM has no notion of broadcast, a dedicated server must exist per LIS that will resolve IP addresses to ATM node IDs. The protocol for these requests is ATMARP (ATM Address Resolution Protocol), along with inATMARP (inverse ATMARP). If the destination is not on the LIS, then there must be a dedicated IP router that connects two or more LISs that can route the call accordingly.
RFC 1577 describes this in more detail.

The 3*MSS in Fast Recovery

Bear in mind that if the cwin is too small or the loss it too large you may not have 3 packets to use to re-synchronise your link's clocking, and therefore you will need to decrease your cwin to 1 MSS and slow start up to ssthresh.

The Need for Duplicate ACKs to "Clock the Link"

When fast recovery takes place, the congestion window (cwin) size decreases to half of what the previous send window was, plus 3*MSS (as mentioned above). The slow start threshold (ssthresh) is also set to half the previous send window, i.e. once the missing packet(s) has/have been transmitted, the connection will already be above the threshold below which slow start takes place, and instead will continue linear increase, probing the link to find at what size of cwin the congestion begins.

By only descending to the previous swin/2, the assumption is being made that the congestion is not significant enough to cause packet loss with this low a cwin. This assumption can be justified if 3 duplicate ACKs are received, as this indicates a transient packet loss, rather than a sudden and long term change in the congestion of the link. If the duplicate ACKs are not received, it is more likely that the link has suddenly become severely congested compared to what it was. Therefore we must assume that the new value of cwin that we must now find could be much lower than the previous swin/2, hence we set cwin = MSS.

Another interesting point stemming from this is that we also set cwin = MSS if a TCP connection has not been used for a significant period of time. Were we to use the old value of cwin we might find that the congestion conditions had substantially changed, hence we "start from scratch" and slow start up.

The usage of the duplicate ACKs to infer whether the congestion on the link has changed substantially is "clocking the link". The origin of this is the fact that the sender is aware of when to put another packet out onto the link by when an ACK is received from the destination. Hence the ACKs "clock" the link by signalling the sender.

Switch Utilisation -- Karol 1987 Paper

The 58% value for the switch utilisation is calculated assuming that the number of switch ports (N) is infinite. An Appendix in the paper (claims to!) show(s) that the steady-state rate of arrival of new packets at input queues which are not blocked on waiting for the switch fabric, becomes Poisson-like. This enables an expression to be derived for the average number of packets that are waiting in each queue to pass through the fabric, in terms of the throughput.

We then note that F, the mean steady-state number of input queues that are free (i.e. not blocked on waiting for the fabric), divided by N, gives the throughput (p).
Using these two equations we can show that:
1 - p = (p^2)/2(1-p)
which solves to give
p = 2 - sqrt(2) = 0.586, i.e. about 58%.

If you're really interested I've got a printed copy of the paper (the DTG has an online subscription the relevant digital library).

Explicit Congestion Notification (ECN)

Explicit Congestion Notification (ECN) is used by routers to indicate to transmitters that they are causing congestion on the network. The mechanism is that a packet or packets on the offending flow have their ECN bit set by the router as they pass through it. When these packets reach the receiver, the ACK that is sent back also has the ECN bit set.

In the DECbit version of ECN (now obsolete), if the transmitter received more than 50% of its ACKs with the ECN bit set, it would perform multiplicative decrease of its congestion window (i.e. react in the same way as it would do to a packet loss when using fast retransmit and fast recovery).

In the newer version of ECN, (RFC 3168), this is changed slightly in that a router need only set the ECN bit in a single packet (and hence, the receiver, in a single ACK), to cause the transmitter to perform multiplicative decrease.

The advantage to ECN over timeouts is that transmitters are notified much more quickly of their causing congestion, whilst not actually requiring a wait for a timeout, or retransmissions.

CSMA/CD vs CSMA/CA

RTS/CTS on Wireless Networks

Crossbar Switches and Clos Networks

In a crossbar switch, each input has a wire which has an interconnection (cross point) with every other output. A diagram essentially looks like n input wires running vertically crossing n output wires that run horizontally. At any one time one cross point on any one input wire (and any one output wire) can be active. This means that the switch is fully connected, but evidently does block when (e.g.) two inputs try to send a packet to the same output simultaneously.

In switches with a large number of inputs and outputs, the number of cross points becomes very large, and hence they take up a significant amount of space/power. A Clos network is an arrangement of crossbar switches (crossbars) that requires fewer cross points but still achieves fully-connectedness. Evidently, the are still subject to blocking. Wikipedia's article on Clos networks explains this further.

Banyan Switches

In contrast to crossbar switches, Banyan switches (so called because they look a little like the paths made by the roots of a Banyan tree) use Beta elements. These consist of two inputs and two outputs, with a cross connect between them. The cross connect can either be set in the cross or bar configurations, i.e. either connecting input 0 to output 0, and input 1 to output 1, or input 0 to output 1, and input 1 to output 0. Some buffer memory must also be present for packets arriving at the inputs that do not fit the current state of the cross connect. The switch is non-blocking, so we do require buffer memory. This can be solved somewhat by using a Batcher stage before the Banyan switch (see below). For information on how the Beta elements are connected together, see the Packet Switching page of the networks course at Tel Aviv University (warning: Java applet present).

Batcher-Banyan Switches

Knockout Switches

Using Multiple Hash Tables for Longest Prefix Matching

Also known as "How do I do 2004/8/3 part (b)(iii)?"
Essentially, you just need to have a hash table per prefix length. This can be further optimised, as explained in this rather nice paper on using hash tables for routing at SIGCOMM 1997.