[ tech ]

Troubleshooting Common Networking Problems with Wireshark, Pt. 6: Finding the Source of Network Delays

NOTE: You can now take course by the author with video and example traces on Wireshark. Check this post for more details.

Author’s Note: This is the last part in a six-part series about finding and solving many networking anomalies using the Wireshark network protocol analyzer. If you are new to the series, you can find part 1 here, and the whole series here.
It may seem surprising in this age of ubiquitous Gigabit Ethernet connections, but many of the calls in Enterprise networking still revolve around network delay problems. Slow access to websites, slow logons, slow email, all of these issues can potentially be related to network delays. Unfortunately, tracking down the source of network delays can be very frustrating at times. With the right filtering, however, you can eliminate much of this frustration and drill down to the source.
In some cases, the delays will be easy to spot, showing up as a per-packet delay in the traces. In others, the per-packet latency may be very small, but the upper layer latency may be huge. The simplest check, however, is typically per packet latency. This can be done easily in Wireshark by filtering down to communications between the two problematic systems. Once this is done, ensure that your time display format is set to ‘Seconds since previous packet’ in the view menu, and click on the ‘Time’ column heading to sort by time. This will show you your longest delays between packets. If you have many packets with very long delays (0.2 seconds or longer), re-sort by frame number for several of these and examine the traffic. The type of traffic will determine if the delay is a problem. For instance, a long delay on a HTTP keep-alive is not a problem, but a long delay between an HTTP Get and a HTTP 200 response is.

Note: Make sure, when looking for delay issues, that the trace was captured with a minimal amount of capture filtering. The default setting of filtering out RDP is fine (unless you are getting RDP delays, of course), but any more than that is questionable. The reasoning for this is that if you are filtering out a large chunk of the traffic, you are in effect creating gaps in the capture, which invalidates the technique discussed above.

However, once you find the delay, you may still need to dig further to locate the source. For example, if you right click on an HTTP packet in Wireshark and choose the option ‘Follow TCP stream’, Wireshark will filter based on the source and destination sockets, giving you a good summary of the TCP conversation. An example of a delayed conversation is shown below:

No.

Time

Source

Destination

Protocol

Info

111

0.003251

160.207.226.48

160.207.12.41

HTTP

GET /default.aspx HTTP/1.1

112

0.000771

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 401 Unauthorized (text/html)

113

0.000533

160.207.12.41

160.207.226.48

HTTP

Continuation or non-HTTP traffic

114

0.000986

160.207.226.48

160.207.12.41

TCP

2456 > http [ACK] Seq=591 Ack=1677 Win=65520 Len=0

115

0.007079

160.207.226.48

160.207.12.41

HTTP

GET /default.aspx HTTP/1.1, NTLMSSP_NEGOTIATE

116

0.001008

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)

117

0.000562

160.207.12.41

160.207.226.48

HTTP

Continuation or non-HTTP traffic

118

0.000993

160.207.226.48

160.207.12.41

TCP

2456 > http [ACK] Seq=1283 Ack=3651 Win=65520 Len=0

119

0.005675

160.207.226.48

160.207.12.41

HTTP

GET /default.aspx HTTP/1.1, NTLMSSP_AUTH, User: GDAIS\Michael.Vogt

121

0.112828

160.207.12.41

160.207.226.48

TCP

http > 2456 [ACK] Seq=3651 Ack=2143 Win=64675 Len=0

166

1.634767

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/html)

Looking at this conversation, you may correctly surmise that the page in question takes a long time to retrieve (around 1.7 seconds), but we still do not truly know the source. Even worse, since an HTTP page can include multiple references to additional files (such as images), which can extend over multiple TCP sessions, looking at a single stream does not necessarily give us a true appreciation of the delay. To see the full delay, we must filter just on the two IP addresses in the conversation along with the protocol (HTTP), like so:
(ip.addr==160.207.226.48 and ip.addr==160.207.12.41) and http
With this filter, we can look for each HTTP Get request and response, and determine when the last request for the page is sent, and when the response for that request is returned. We find that the last request is for the file crossbullet.gif in frame 3962, which is returned in frame 4285. We know this is the last request because the Get request for default.aspx (and all the Get requests related to this page) repeats (indicating a new page request) after this Get request. Now that we know where the conversation ends, we can filter to just this HTTP conversation by using the following filter:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and http and frame.number <= 4285
With this filter applied, using the Summary command under the statistics menu gives us the following information:

As the summary shows, there is a 28.5 second delay in retrieving the page. This delay is huge, much worse than we previously anticipated, so now we need to dig in further to determine why. Another Wireshark filter can help here:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.number <= 4285
This filter is essentially filtering to all traffic between the two hosts that is also an HTTP request or response packet and has a frame number less than 4285. This filters the HTTP continuation packets out of the trace, dropping the trace from 689 frames to 182 frames. However, we can break it down further and look for just the high delay packets by adding frame.time_delta to the filter, like so:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.number <= 4285 and frame.time_delta >= 0.2
This filter is the same as the previous filter, but now we are adding the condition that all frames displayed must have a time delta from the previously displayed frame equal to or greater than 200ms. This filters the frame down to 27 high-latency frames:

No.

Time

Source

Destination

Protocol

Info

111

————

160.207.226.48

160.207.12.41

HTTP

GET /default.aspx HTTP/1.1

166

1.765202

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/html)

244

0.208087

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/owsbrows.js HTTP/1.1

351

1.450349

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/css)

550

1.169115

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/css)

651

1.157058

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

724

1.081409

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

920

0.264385

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/search.js HTTP/1.1

1136

1.315121

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

1316

1.771438

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/css)

1382

0.206753

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/images/icongo02.gif HTTP/1.1

1576

2.916854

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

1784

2.512169

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

1951

0.20108

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/images/GDAISsource.gif HTTP/1.1, NTLMSSP_AUTH, User: GDAIS\Michael.Vogt

2002

1.160457

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/css)

2052

0.200265

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)

2158

1.139176

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

2226

1.103149

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 304 Not Modified

2337

1.281311

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (text/x-component)

2469

1.267644

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

2512

1.268341

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

2551

0.217619

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

2615

1.331625

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

2693

1.114701

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

3280

1.147739

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (GIF89a)

3944

0.352483

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/images/partgradp.gif HTTP/1.1, NTLMSSP_NEGOTIATE

4282

0.919761

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 404 Not Found (text/html)

At this point, we simply need to look for a pattern, and luckily, the filter shows us the pattern very quickly. It appears that nearly all of the HTTP responses have delays of 1 second or longer. This is a huge find, and points us very close to the cause.
Initially, we may be tempted to point to the server itself as the cause of the delay, but we can’t prove that without looking at the server side trace. So we will open the server side trace and attempt to find a conversation with similar delay. It is helpful if you can find the same conversation, but not absolutely necessary. If the problem is caused by a server side issue, any conversation that is going to the same URL on the same server at approximately the same time should show a similar delay, and for the same reasons.
To find a similar delay, we will simply modify the previous filter slightly, removing the frame number restriction, like so:
(ip.addr eq 160.207.226.48 and ip.addr eq 160.207.12.41) and (http.request or http.response) and frame.time_delta >= 0.2
This filter will show us some frames with a very striking similarity to the ones seen in the previous trace:

No.

Time

Source

Destination

Protocol

Info

3273

————

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/owsbrows.js HTTP/1.1

3751

1.095343

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

3856

1.081487

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

4088

0.265282

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/search.js HTTP/1.1

4202

1.314339

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 200 OK (application/x-javascript)

This proves that the problem is not due to delay in the link, but is something related to the web server’s delivery of the requested page. To dig deeper, we now need to select an example of the problem, say the Get request for owsbrows.js and the response seen in frames 3273 and 3751, respectively, and filter to all traffic between this request and response. This is to see any additional traffic that our previous filters may have hidden. To do this, we will use the following filter:
frame.number >= 3273 and frame.number <= 3751
This filter filters down to all frames between these two numbers, including the first and last frames (request and response). However, this leaves us with 500 or so frames to look through, which is a bit more than we might wish. Looking at the first 36 packets, however, we can begin to determine what is occurring:

No.

Time

Source

Destination

Protocol

Info

3273

6.466433

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/owsbrows.js HTTP/1.1

3274

0.000061

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=600692 Ack=27063 Win=65047 Len=1460

3275

0.000017

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [PSH, ACK] Seq=602152 Ack=27063 Win=65047 Len=298

3276

0.000018

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [ACK] Seq=27063 Ack=602450 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

3277

0.000348

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 401 Unauthorized (text/html)

3278

0.000009

160.207.12.41

160.207.226.48

HTTP

Continuation or non-HTTP traffic

3279

0.00151

160.207.226.48

160.207.12.41

TCP

2466 > http [ACK] Seq=742 Ack=1677 Win=65520 Len=0

3280

0.000026

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [PSH, ACK] Seq=27063 Ack=602450 Win=65535 [TCP CHECKSUM INCORRECT] Len=122

3281

0.000571

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=602450 Ack=27185 Win=64925 Len=1460

3282

0.000013

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [PSH, ACK] Seq=603910 Ack=27185 Win=64925 Len=298

3283

0.000019

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [ACK] Seq=27185 Ack=604208 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

3284

0.000721

160.207.12.1

Broadcast

ARP

Who has 160.207.12.184?  Tell 160.207.12.1

3285

0.000926

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [PSH, ACK] Seq=27185 Ack=604208 Win=65535 [TCP CHECKSUM INCORRECT] Len=122

3286

0.000439

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=604208 Ack=27307 Win=64803 Len=1460

3287

0.000032

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [PSH, ACK] Seq=605668 Ack=27307 Win=64803 Len=298

3288

0.000015

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [ACK] Seq=27307 Ack=605966 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

3289

0.002

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [PSH, ACK] Seq=27307 Ack=605966 Win=65535 [TCP CHECKSUM INCORRECT] Len=122

3290

0.000152

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/owsbrows.js HTTP/1.1, NTLMSSP_NEGOTIATE

3291

0.000454

160.207.12.41

160.207.226.48

HTTP

HTTP/1.1 401 Unauthorized, NTLMSSP_CHALLENGE (text/html)

3292

0.000011

160.207.12.41

160.207.226.48

HTTP

Continuation or non-HTTP traffic

3293

0.000017

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=605966 Ack=27429 Win=64681 Len=1460

3294

0.000016

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [PSH, ACK] Seq=607426 Ack=27429 Win=64681 Len=298

3295

0.000015

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [ACK] Seq=27429 Ack=607724 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

3296

0.001842

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [PSH, ACK] Seq=27429 Ack=607724 Win=65535 [TCP CHECKSUM INCORRECT] Len=122

3297

0.000052

160.207.226.48

160.207.12.41

TCP

2466 > http [ACK] Seq=1585 Ack=3651 Win=65520 Len=0

3298

0.000805

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=607724 Ack=27551 Win=64559 Len=1460

3299

0.000023

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=609184 Ack=27551 Win=64559 Len=1460

3300

0.000009

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=610644 Ack=27551 Win=64559 Len=1460

3301

0.000016

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [ACK] Seq=27551 Ack=612104 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

3302

0.00003

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [PSH, ACK] Seq=612104 Ack=27551 Win=64559 Len=1192

3303

0.002783

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [PSH, ACK] Seq=27551 Ack=613296 Win=64343 [TCP CHECKSUM INCORRECT] Len=122

3304

0.000144

160.207.226.48

160.207.12.41

HTTP

GET /_layouts/1033/owsbrows.js HTTP/1.1, NTLMSSP_AUTH, User: CORP\John.Doe

3305

0.000484

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [ACK] Seq=613296 Ack=27673 Win=64437 Len=1460

3306

0.000015

160.207.12.37

160.207.12.19

TCP

1292 > 4076 [PSH, ACK] Seq=614756 Ack=27673 Win=64437 Len=298

3307

0.000018

160.207.12.19

160.207.12.37

TCP

4076 > 1292 [ACK] Seq=27673 Ack=615054 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

3308

0.000032

160.207.12.19

160.207.230.36

DCERPC

Request: call_id: 4997 opnum: 39 ctx_id: 0

Looking at the bolded frames, we can see two things of importance. First, the client is requested for authentication from the server for the Get, to which it responds in a normal fashion (Get, 401, Get Negotiate, 401 Challenge, Get with Credentials). The interesting thing to note is the speed with which the server responds to each step of this process. You will notice that the server’s 401 responses occur consistently in less than 0.001 seconds. This is not consistent with our previous view of a web server performance problem. The second thing to notice is what the web server does in the last frame (3308): It makes an RPC call to a domain controller, presumably to verify the authentication provided by the client. This is perhaps the most important find in the trace, as it shows interaction with another server occurs before the client is validated and given the page he has requested.
Now that we have found this new piece of information, let’s filter to determine how long the response to this RPC request takes. This is most easily done by selecting the RPC request and right-clicking to choose ‘Follow TCP Stream’. Once this is done, you can very easily see the RPC request and response, identified by their Call ID:

No.

Time

Source

Destination

Protocol

Info

3308

————-

160.207.12.19

160.207.230.36

DCERPC

Request: call_id: 4997 opnum: 39 ctx_id: 0

3654

0.255042

160.207.230.36

160.207.12.19

TCP

1025 > 4003 [ACK] Seq=7680 Ack=1296 Win=64671 Len=0

3747

0.825159

160.207.230.36

160.207.12.19

DCERPC

Response: call_id: 4997 ctx_id: 0

The result of this final filter is interesting. Our previous analysis showed that there was a total delay between the HTTP Get and the Web Server’s response of 1.095 seconds (the delay between frames 3273 and 3751). The analysis of the RPC traffic shows that the delayed response from the domain controller accounts for 1.08 seconds of that delay! Using this information, we would be able to examine all of the other RPC calls to the DC and verify this finding, proving that the domain controller is the primary cause of the web server delay.
This information is the key to resolving the issue, since no amount of resource allocation towards the network infrastructure, web server, or other services (like SQL) would actually improve the situation at all. But an additional DC to handle the authentication traffic may completely resolve the issue.

[ Comments ]