Monthly Archives: September 2005

Back to the Future

After a week working on customer projects I got back at kernel hacking, initially going back in time some 20 years to work on merging an old LLC (802.2) series of patches I wrote in the 2.6.0-test1 days that were forward ported by Jochen Friedrich, that also fixed some bugs that were preventing him from using an HP printer that only talks LLC2 (connection oriented) over Token Ring, now his CUPS backend works just fine 8)

Then back to the future to implement DCCPv6, that was rather easy as most of the infrastructure was already in place for TCP, I just had to tweak the timewait and request sock hierarchies to work just like the inet_sock one, i.e. having a pointer to where in the per protocol slab objects the IPv6 specific bits are located, in fact I used an offset variable in struct inet_timewait_sock, not a pointer like struct inet_sock’s pinfo6, that probably I’ll convert to offsets too so as to save precious bytes in 64bit architectures.

Ah, the LLC2 tests were made using openssh patched to work with PF_LLC sockets, ‘ssh -5 00:04:76:3B:53:C1’ and look ma, no TCP/IP 8)

DCCPv6 was tested using a supercharged, get{name,addr}info enabled ttcp tool, another one that looks like an animal at Jurassic Park 8)

Thinking out loud…

The DCCP specs emphasise the 2 “half connection” concept, where most people see one connection, represented in Linux by a “struct sock”, that has a series of mechanisms (sk_lock, sk_backlog, timers) shared by both the TX and RX path, i.e. if one is doing a TCP sendmsg the lock has to be grabbed with lock_sock() (TX path) that in turn will make the RX path potentially put packets in the backlog to be processed only when the TX path queues or sends the packet (struct sk_buff) down to the next layer (tcp_transmit_skb for instance).

Yes, there is underlying queueing on the hardware card, driver, qdiscs, etc but following the DCCP recomendations I guess a packet on the RX queue that has information (data and ack) for both the TX and RX half connection should really not be delayed to be sent to the TX half connection just because there is a (potentially) big queue for the RX half connection, or if the TX half connection is busy sending a packet (in the dccp_sendmsg -> ccid rate limiter -> dccp_transmit_skb path) would have _another_ sk_backlog, and when the TX path finishes, _without_ waiting for any RX backlog processing would process its backlog, i.e. we would have a sk_tx_backlog/lock_tx_sock/release_tx_sock(TX half connection) triple and another for the RX half connection, increasing the paralelisation in the full connection (RX + TX half connections).

DATAACK packets, that are of interest to both half connections would perhaps be shared skbs, being in two queues at the same time, using part of skb->cb to have a next pointer for the second queue…

This gets more interesting in scenarios where the TX rate from A to B is roughly equal to the one from B to A, i.e. mostly no quiescent half connection, when as much as possible one half connection wouldn’t be stepping on the toes of the other.

A good chunk of struct sock/inet_sock would be shared, id lookup, others that exist today would be used for, say, the TX half connection, while others would be duplicated in dccp_sock, humm… time to look at the code and check if this is all nonsense…


As mentioned in Ian’s blog we did what may well be one of the first DCCP connections in the Internet, from .br to .nz, whee! Now I’m bored to write more than these few lines, but tomorrow I’ll finish the timeoffset code, with sk_buff tstamp converted to it and DCCP using it, lets see if we get better timing calcs out of this.

Real Ramblings

So far I’ve been reporting what I had done, but I’ll now try to do what this blog title says: write ramblings about what I’m thinking about doing, lets start…

About the TCP pluggable congestion control and my intention of using it in DCCP for the CCIDs, one of the things I’ve read in the DCCP drafts was that one of the differences of DCCP from TCP was that the congestion decisions and access to the congestion state (cwnd, sstresh, etc) were not sprinkled all over the protocol definition, but clearly separated in the CCIDs, leaving the core to the things that are common to all CCIDs.

In the Linux implementation I basically took most of the TCP code not related to congestion control and made it generic, being used by DCCP (any other INET transport level can take advantage of this infrastructure, SCTP for instance) , and got a CCID3 implementation (implements RFC 3448) from a different tree, that had its lineage originally from FreeBSD, the way the TCP like core code interacts with the CCIDs is modelled after this CCID3 implementation, now I’m thinking about how to proceed.

So far, in the pursuit of having the DCCP code looking as much as the TCP equivalent code as possible, I got the sendmsg path all the way to the equivalent to tcp_write_xmit, that transmits as many skbs from sk_write_queue as the congestion control algorithm allows.

My current doubt is if the right thing is to model all the congestion control algorithms to use variables (congestion window, etc) accessed directly by the core TCP code or if it should always ask the congestion control module being used by means of functions that aggregate the opencoded now sprinkled thru the TCP code.

In the 2.4.20 DCCP implementation (incomplete and for an old draft) by Juwen Lai he implemented the concept of ->sendamap(), i.e. “send as much as possible”, that is roughly the equivalent to TCP’s tcp_write_xmit(), but is a function pointer provided by the CCID (Congestion Control algorithm) being used for the TX half connection.

So in my private tree I’m experimenting with the concept of ccid->ccid_hc_tx_write_xmit(), that uses the concepts outlined in the last paragraph and bodes well with the current CCID3 code, that hasn’t any cwnd variable, but instead calculates a send rate that is allowed at any given point in a connection lifetime, implementing almost directly the equations in the CCID3 draft and in RFC 3448.

At some point I’ll try to investigate how this CCID architecture would be used by TCP, in the search for the nirvana Grand Unified Congestion Control Infrastructure, the GUCCI 😉

Better Test Environment

Revived my A500 PARISC64 machine, upgrading to latest Debian unstable and installing a tg3 gigabit ethernet card, now it is my noisy internal router, with the added bonus of being a big endian, 64 bit machine, where we can test DCCP, in the kernel and the apps, like tcpdump, that after some tweaking seems to be working. Too bad the openssl library isn’t optimized for parisc64-linux (according to Grant on the #parisc-linux channel), git is slow to a crawl due to that.

On the DCCP kernel front I’ve been working on CCID3, experimenting with different CCID infrastructure hooks, reading Juwen Lai’s old stack for 2.4.20.

Also found a bit of time to write ostra-mstats, that collects profiling data out of the data already probed, and now that I have a faster test environment the simplistic collector shows its defficiencies and gets in the way, have to move on to relayfs, when I get it working well with relayfs, steal some of the ccache ideas to make ostra become in fact the preprocessor that it really is, get a web page for in place for Rusty, etc I’ll finally announce this toy. 🙂