I had an interesting issue yesterday where I had to prove that a firewall was injecting TCP RST packets to drop active connections. The details of the firewall problem aren’t relevant, but in order to tie several packet traces together and prove beyond doubt that the mysterious RST packets were being injected and not coming from the end host I had to turn to a little-remembered field in the IP packet header — the identification field.
The firewall problem I was facing was seemlingly random disconnects of stateful TCP applications like RDP, VNC, and other such protocols that use a single long-running TCP connection and are quite sensitive to a disruption of that connection. I suspected that the firewall was injecting a TCP RST packet to tear down the connection but prior to my getting involved the customer only had partial packet captures from one side or the other.
I set up what I like to call a “bookend” capture — capturing on either side of the device I suspect of being the problem. In this case, I ran a packet capture on the client host using Wireshark and another capture right on the ASA using its capture function. After a little while, I caught the symptom and quickly grabbed copies of the captures to analyze.
The issue with this was that the two devices didn’t have perfectly synchronized clocks and the capture filters and durations were different so I couldn’t match up a packet number. At some point I ended up trying to match contents of the payload but that didn’t help prove that the RST was being spoofed (other than the fact that the server received a RST packet that the client didn’t appear to send). To correlate these, I started looking at the ID field of the IP header.
The What Field?
Even many network geeks aren’t too familiar with the IP identification field. So let’s just check good ol’ RFC791 and see what it says:
The identification field is used to distinguish the fragments of one datagram from those of another. The originating protocol module of an internet datagram sets the identification field to a value that must be unique for that source-destination pair and protocol the time the datagram will be active in the internet system.
Essentially, the ID field is just a serial number for each datagram, which is unique within that source/dest conversation. As stated above, it’s mainly used for fragmentation reassembly. The number itself is only pertinent to fragments of that packet, so the numbers need not have any relationship to one another. The IP stack could use prime numbers for each one. Or it could increment it by 731 on each packet. But because programmers are lazy and every programming language has a simple operator for “increment this counter by 1,” in practice the IP ID field usually starts at some “random” value and is just increased by 1 for each datagram sent to the same destination host (“random” is in quotes there because that initialization vector may be based on some pseudo-random seed value related to clock time, system uptime, or other another source).
An interesting feature of the ID field in practice is that because it has to remain unique within a certain IP conversation and it isn’t useful for too much by itself, most NATs and firewalls don’t bother to diddle with it, and instead just copy it when performing a packet-munging operation (like NAT). This would only be a problem if multiple flows to the same destination host went through a PAT device and the sending hosts happened to have synchronized ID field counters. This is very unlikely because of that pseudo-random seeding value I mentioned above.
This means that the ID field can often be used to track packets through a NAT. That method was even used in this paper which describes ways to estimate the number of hosts behind a NAT using the ID field.
OK, so I’ve babbled about what the ID field does and how we can use it (in theory). Lets see how it works in the real world. First up, here was the capture from the client side, showing a TCP reset coming from the server. Please forgive all the blurring; I didn’t have time to recreate this so these are the actual captures from my customer and I wanted to at least try to minimize the likelihood of revealing who it was.
You can see in the above screenshot that the client (10.X.66.51) and server (10.X.1.56) seem to be talking happily, and then BAM! a RST from the server kills the connection. Let’s take a closer look at a couple of these packets.
Note in the above image that the highlighted packet from the server has an ID field value of 11150. Let’s look at a packet from the client. In this next one, we see the client’s ID field is 7274.
Next, we look at our RST packet. The ID field seems to align roughly with the previous packet (11153 vs. 11150) but the astute reader will notice that counting the number of server-to-client packets and comparing to the ID field suggests we’re missing one:
Indeed, packet #12762 had ID field 11150, packet #12764 had ID field 11151, but the RST packet #12766 had ID field 11153. What happened to ID 11152? Let’s look from the other side. This next screenshot is from the capture taken on the ASA, so it’s showing the view of things after the ASA has had a chance to see the client’s packets, and before it’s had a chance to see the server’s packets.
Ahh, here we see a RST packet that supposedly came from our client (10.X.66.51) to the server (10.X.1.51), even though I don’t remember seeing the client send on in the client’s trace. Let’s check the ID fields again.
In the above shot, we see an outgoing packet from server-to-client that has the same ID we saw on the client side, 11150. So that’s a good correlation.
Again, in the above shot we see the packet with ID 7274 coming from our client to the server. So we know we’re in the right spot. Now let’s check that phantom RST packet:
The ID field on the RST is 41871. What?! There’s almost no chance that came from the same host that was sending packets in the 72XX range. Just for good measure, let’s look at the packet that arrived from the client after the RST:
ID 7275, just like we’d expect. So the ID sequence from the client to the server went something like this: 7273, 7274, 41871, 7275. Hmmm….
From this data, I can be highly confident that the RST was injected by another device and was not originated from the client. Further, since this was all happening over a remote access VPN and the capture points were on both sides of the firewall, if a packet was injected it must have been the firewall that did it.
By the way, packet #2763 in the trace above was our missing ID value of 11152. It never made it to the client.
Putting It All Together
So what does all this mean? Basically, things in my conversation were humming along when the server sent that packet with the ID of 11152. The firewall decided at that point to drop the connection, and in the process spoofed a RST from the client (who was on the ‘outside’) to the server (on the ‘inside’). That’s why the ID field didn’t match other packets from the client. A final packet from the client was in flight on its way to the server and made it through (that’s ID 7275), which triggered the server to send its own RST as it had already closed that socket and sending a RST is what a good TCP/IP stack does when it receives a packet destined to a closed socket. That is the genuine RST that the client receives with IP ID 11153.
So what happened? Well, the vendor is still working on the issue. The firewall in question seems to be erroneously killing active connections and we don’t have a cause yet but I thought the exercise of proving the source of IP packets using the ID field might be of help to someone out there.
Who is the firewall vendor?
Cisco is sending back RST packets for long-standing TCP connections? That couldn’t be an accident..
There should at least be a knob for that. Is this related in any way to NAT timeout periods?
It’s for VPN traffic, so NAT timeouts probably won’t apply as it just has an identity Twice NAT rule. Also, the sessions aren’t idle, they’re actively passing traffic and in some cases are only a couple minutes old. Default TCP idle connection timeout is 1 hour and default NAT xlate timeout is 3 mins but even an “idle” RDP session ACKs back and forth every 60 seconds as a keep alive.
TAC confirmed that the firewall is dropping the connections, but it shouldn’t be. As of last night they weren’t sure why after looking at it for 8+ hours. It’s not just a regular timeout.
The config was working fine on 8.2.5. After upgrade to 8.4.6 (including reworking the NATs and interface ACLs manually) this showed up. I’ve converted dozens of firewall configs from 8.2 to 8.3/8.4/9.1 and not had this problem so I don’t think I did anything wrong. This is the only ASA I’ve done 8.4.6 on though, so I suspect a software problem.
We’re working that problem with the vendor’s support. I just thought the IP ID field was an interesting thing to write about.
The IP ID write up was nice. Quite the bizarre bug though.
For those curious, after upgrading to ASA 9.1.1, that issue with the random connection drops subsided. Of course that doesn’t explain what the problem was, but 9.1.1 was where I wanted the customer to be anyway so I’ll take it.
This is my favourite kind of detective story. Good show!