Leave a Comment:
(14) comments
So, I’m looking through the capture file, and I’m seeing a two byte packet go out after it receives ACKs for the first 64k. Most of the time it’s including those two bytes with the final 128k ACK packet. In the cases where it’s only sending 64k at a time, that two byes has not been included in the final ACK. As for why the server isn’t ACKing those two byte packets in a timely manner, I’m really not sure.
ReplyNice, Michael. That is indeed the pattern that triggers it. Clearly having that 2 byte segment unACKed causes the sender to stop after 64k+2 bytes and wait for a single ACK. Then it sends the second 64k chunk. It’s not clear to me why it pauses when it can send more than 64k bytes in the other times it sends 128k. But something about that 2 byte segment being unACKed causes it to pause. Good eye!
Reply[…] How to Troubleshoot Throughput and TCP Windows […]
ReplyWhat’s the sending stack?
The PSH bit being set on the end of that 2-byte segment suggests to me that it falls at the end of a socket send() operation.
My suspicion is that there’s a zero-copy mechanism working inside the sending stack: On paper, ACKed data should be purged from the sender’s send buffer segment-by-segment as ACKs roll in, guaranteeing that new data from the sending application is always available for the stack to send.
Zero-copy mechanisms don’t copy data into the socket buffer. Instead, they say “hey, stack! here’s a pointer to a block of data I’d like you to send!” These mechanisms are nice in that they avoid the copy operation (speedy!), but they quickly expose tuning problems because the stack winds up working with very large chunks of data as atomic units. The stack can’t clear individual segments from its “buffer”, because it’s trading pointers to very large chunks of memory. The block of memory and the pointer are both busy until the last ACK for that chunk is received.
If there’s enough memory, and enough pointers, then we’ve effectively got a windowing mechanism (inside the stack) exactly like TCP’s byte-based windowing. You’ll never know that pointers to these large data chunks are cycling around inside the sender.
If these resources inside the sender are scarce, then you run into ugly business like this.
Sometimes the application is complicit in this scheme (google: “io completion ports” for an MS Win example), and sometimes not.
ReplyVery good info, thanks, Chris. The sending stack is Win7. Don’t know the exact version; hafta check. iperf is calling write() with 128k of data to a 64k send buffer. In my research, I’ve seen that the kernel will fudge a bit on the actual buffer size so I assumed this was why it would put more on the wire than the send buffer size. But perhaps the zero-copy behavior is why it will put 128k on the wire (except when the 2 byte segment is unACKed) when there’s only a 64k buffer. I’ll research io completion ports. Thanks for the tip!
ReplyHi Kary,
Great post!!! Very helpful. Could you please share the second packet capture which you refer in the video?
Best Regards,
ReplyDear Kary,
Thanks. This is very helpful. Could you pls share the 2nd pcap? The Rx one.
ReplyMichael/Kary,
For the purpose of learning, could you provide a step by step explanation on how you came to that conclusion?
Replyinteresting… can you share what version of iperf & the params it was executed with? I guess what’s concerning to me (in terms of throughput performance) is that the although the receive window hovers around ~1MB throughout the session, the client doesn’t come close to reaching that. maybe i’m missing something, but the two-byte behavior doesn’t seem overly “painful”, even for a endpoint roughly 98ms apart.
ReplyHi, Kary. I really like your video on YouTube. How can I contact you more directly? I visited your Facebook site, but it seems that it’s been a long time you didn’t post anything.
Reply