Sharkfest 2014 – Common Mistakes In Packet Collection (by Chris Greer)

hello and thank you for tuning in to this session common mistakes and packet analysis things that make traces harder to read this session was delivered at shark fest 2014 this was the Wireshark developer and user conference that was held June 16th through 20th the Dominican University and we wanted to make sure that this was a session that was recorded and made available to those who wanted to see it just to introduce myself my name is Chris Greer worked for a company called packet pioneer LLC and at the company I am focused on network and application performance analysis so getting in at the packet level helping people resolve complex application performance problems or helping them to root out root cause when things are going wrong we provide protocol analysis consulting and also and we’re not out there actually troubleshooting problems like to bring that experience back to classrooms and share examples that we’ve seen out there in the real world and noting people become better analysts so also we deliver training that’s fake is focused on Wireshark some other vendors supported products like fluke networks and some others so first of all why do we care what are we going to do in this session today well when we’re analyzing complex application performance problems it’s true that every packet counts it’s important that in order to truly find root cause of an application performance problem or some gritty issue that’s been sticking around for far too long it’s important that we get a complete picture of what the problem really is it’s true that if we cannot see it then we can’t fix it so here we see a missing puzzle piece how frustrating that will you come to the end of that puzzle and we find a piece of the puzzle is missing and it’s true sometimes that when we get down to really get down to that packet level sometimes troubleshooting and resolving a problem comes down to one packet so we want to make sure that when we before we even begin that analysis and spend the time that it takes to read through a trace file that we get everything that we’re going to need and make sure that we’re not making things harder on ourselves now to put together this session for shark fest one thing that I did is I interviewed a couple friends of mine that’s that our consultants in the packet analysis world and I got a list of things from them that they found that they repeatedly did or when learning how to do packet analysis mistakes that they made that made things harder on themselves things that they’d wish they’d done differently or known sooner so that all of this would have been a little easier now why does this matter why is it important well this screenshot might say it all to us sometimes this is what a packet trays can look like it can look like a bunch of lines of code that can be difficult to see underneath we can’t in some cases it can be difficult to turn the screen that we’re seeing into actionable things that will resolve that root cause so unless we really know what we’re looking at and we’re comfortable with those protocols it can simply be difficult or it’s possible that Wireshark could look to us like this cockpit does now a trained pilot he knows what he’s looking at he might take a picture like this he might look at one of those dials and say whoa there’s a problem here why cuz he knows what he’s looking at he’s comfortable with interpreting what these tools are telling him what these gadgets and dials are saying now for me I’m scared if I look at that I think I have an idea what the handlebars do and I think I have an idea that you can pull them back and push forward but the rest of that it’s completely foreign and so it can make it would make driving this plane for me well you definitely won’t want to put me in the driver’s seat but someone that’s new to this it’s completely foreign so in some ways we can do the same thing with Wireshark inadvertently we can make things more difficult on ourselves I’m going to go through a list that I’ve compiled that I’ve found that are common things that can make trace file analysis more difficult so the first one we’re going to discuss today is initially capturing too much traffic this screenshot was taken from the Wireshark University training and it illustrates that we can put an analyzer at any point along the path that we’re trying to get into so say for example we have a client talking to a server we have capture point number one two three four and five and those are just some examples of where we could possibly place an analyzer along the packet path now if we place our analyzer close to the client obviously we’re just gonna get the traffic that that client is sending out on the wire we’re gonna see what servers it’s communicating with what component of the whole application delivery is taking too long for that client it’s talking to a DNS server it’s talking to perhaps login servers or some type of Active Directory box and then it’s going on into the server that it’s interested in talking to so if we capture at the client end we’re going to

see that detail and we’re not going to see the same type of traffic level that we would if we chose to instead capture on the server end so if we capture on that server and that’s when we’re going to see all traffic coming and going to that server it is good that we’ll see those dependencies we’ll see if that servers talking to any back-end devices any back-end database servers for example or mid tier middle tier application servers now those trace files when we do capture on the server end those trace files can get large very quickly and what that means it gives us the type of feeling that we may be digging for a needle in the haystack now these trace files how large would a trace file get if we’re capturing on let’s say for example a one gigabit per second link let’s crunch some numbers here so if we take the file size that we would collect on a one gig link and if we were capturing for five minutes and that link averaged about 50% utilization just looking at those numbers how large would that trace file get are we sure that a problem that we’re interested in would even happen within that span of time well let’s take a look if we crunch some numbers that would turn into about an 18 gigabyte file that we would need to comb through that would represent approximately 300 million packets for us to comb through to find perhaps just one now we just crunch those numbers based on about a 512 byte packet average but that tells a story doesn’t it in a five minute period of time we’re talking about eighteen gigabytes now I don’t know about you but that size of trace file it’s simply too large for me to wrap my head around there’s way too many conversations it can take Wireshark a long time to open up those those capture files now in a lot of modern data centers we can even multiply this by a factor of 10 many data centers were pushing 10 gigs out to servers right so if we put an analyzer on a 10 gigabit per second blink for five minutes imagine it was going about 50 percent utilization that file size is just going to go up by a factor of 10 right we’re gonna get a hundred and eighty gigabyte file approximately three billion packets and it’s possible that you and I are just looking for one so let’s do our best to not experience packet overload because even with filtering locating the root cause of a problem can simply be overwhelming when we have too many packets if it’s available to you if this option is something that you can do then let’s start the client end let’s make things easily on ours easier on ourselves let’s make filtering easier and let’s make finding that root cause from the client end easier on ourselves now that said it may be that we come into a situation where we’re taking a look at traffic going to and from a server and it appears like that server is taking too long to respond well then that’s when we want to move our analysis to the server end we would want to take our our capture equipment or whatever it may be make sure we put it in the packet path over on the server end and then we have more data that we can work with and looking for that application problem number two another important one not thinking before capturing what do we mean by this well before we collect a single packet off the wire there’s a lot of information that we should collect to make that eventual analysis that we’re gonna do a lot easier on ourselves we should think about what’s our goal what is what are we looking to achieve by capturing in this certain area also what is quote quote it that we’re trying to capture sometimes we’ll go out to a customer site and we’ll sit down and talk about the issue that they’re experiencing and they use that word it a lot they might say something like yeah it seems to happen every third Monday it’s a weird issue it only affects Shirley and not John over there okay so what is it explain to me the actual application issue that’s happening can we get more specific what application is it affecting how many users are impacted does it change when I go to a different server or connect in from a different location those are a lot of things that are gonna go a long way toward resolving the problem or helping us find the problem at the packet level if we do some homework in advance also an important thing where do the packets flow we don’t want to assume that packets flow in it through a certain link and that’s where we put our visibility tool and then only to find

out that we were wrong that the traffic actually goes to a different server or perhaps it’s path around us on a different network connection so where do those packets flow that’s also critical data that we need to be sure we know in advance also capturing a problem doesn’t necessarily mean that we can interpret the what what is actually happening at the packet level so be sure to think before you capture don’t just jump on a link grab traffic and then expect to be able to find that root cause before we take some leading questions that should get us there a third point capturing locally on a system so in this example let’s say we wanted to capture traffic going to and from that client end of the connection so we’re over there on the left or on the client end and we actually install Wireshark on that client and we hit capture and then we have that client generate whatever that problem was do our best to reproduce it so this can cause some issues that if we’re not prepared to see them can be difficult to interpret things like false alarms with TCP maybe we’ve seen this before this this happens because when we’re capturing locally on our client the wireshark capture driver is attaching to that tcp/ip stack before some of those checksums are being calculated so this is kind of an old way to save some time and save a processor hit that as a packet or a segment rather was becoming a packet a packet was being put into an Ethernet frame what some stacks would do is that they would offload the checksum calculation of a packet down to the NIC card so let me show you where that is in a actually in a trace file before going any further so I’m going over to Wireshark here and what I’m gonna do I’m just going to scroll up I’m selecting any tcp frame now if i come down into my detail view and i go over to the transmission control protocol header and if i come down to the checksum so this checksum is something that when i’m i’m capturing locally on a system that checksum may not be calculated by the TCP stack instead what it does is it leaves it uncalculated it just puts a random number in there it then sends that down to the NIC and the NIC is configured to look for those checksums and correct them to fix those checksums before it sends that them out on the wire so when I capture traffic that’s being sent out a lot of times I’ll capture it before that is sent or rather before it’s fixed if I took the same analyzer and I put it outside the NIC card then this checksum would be calculated it would be resolved by that NIC card and it would go out well now this is something that we can choose to disable within Wireshark I’m going to show you how to do that briefly if we come up to the TCP header and if we right click this is where we can get to our protocol preferences for TCP now here under protocol preferences for my Wireshark I’ve decided to validate the TCP check sum as possible or don’t validate that checksum so me personally I choose not to validate that checksum the reason why is because if I choose to validate it and I’m capturing locally on my system I’m gonna see a screen that looks like this Wireshark is gonna have all these problems these are flags these TCP errors that come up on the in the packets race and they’re flagging an event a problem now to the untrained eye this screenshot here it’s gonna look bad right we’re gonna start looking at that sender and and taking a look at what’s going on on the TCP stack maybe we’ll replace the NIC card well think hey there’s a problem this thing’s sending out bad checksums when in fact this isn’t a problem at all it’s a false alarm it’s not a real issue that’s being sent out on the wire so again this is something we want to remember that we may see TCP check sum incorrect for packets that our local system is transmitting they are completely healthy and normal if they are captured from that station that’s transmitting if we see the TCP checksum is incorrect on any other transmitting station so if I’m out there and I’m capturing on a wire and I see TCP check sum incorrect on a unicast conversation that I’m not a part of well then that’s something that I’d really want to look into and dig into however this is a type of thing that may trip us up if we not aware that capturing locally on a machine it may

generate these types of issues also we might see some weird things in our packet traces and that’s just a word I use because we might see things like 16,000 byte packets like if we go to the IP length field in our in our Wireshark we’ll actually see you large these large packets and we may think well wait a second how is it that my machine can be sending out 16,000 byte packets isn’t sure it needs to chop them up into much smaller sizes I mean the largest frame size on on an Ethernet to is fifteen hundred and eighteen bytes that is if I’m not using a frame tag or some other type of header information but about fifteen eighteen is about all that I’m able to send so does that mean that I’m sending out overly large frames illegally large frames well no if we captured these sixteen byte packets so to speak outside on the wire external to that local system then we would see several small packets actually physically leaving the NIC again we see this just before the stack segments them something else we may see strange timing issues it’s possible that we could see in our Delta time column this is the amount of time between packets or from the end of the previous packet to the end of the packet that is the timer is set for it’s possible we could see all zeros now we might be thinking that’s impossible it’s impossible that I could see all zeros because no two packets can arrive at exactly the same time on one interface there has to be some type of Delta even on a very very fast link so it’s possible that these were captured and the stack is simply not able to properly timestamp those packets that’s something else we might want to consider also with capturing locally capturing on an end station it’s important to know that non dedicated capture Hardware has a limit now in this screenshot we see the farm and fire hosing packets at a laptop well this is effectively what we do when we put a laptop in a situation where it has doesn’t have the capability of keeping up with the ingress traffic stream now how bad does this get well it’s think about this for a moment now most Wireshark downloads the recent numbers for the amount of downloads per month is on the in the order of half a million per month it’s according to the Wireshark network analysis study guide now how many of those downloads do you think go directly to laptops of those users how many of them leave the optimization settings at the default rate very likely most so we find in most cases where shark is installed on a laptop what’s wrong with that well laptops have a purpose my machine Dell it was created from the Dell manufacturer and they created this thing just so I can browse my email or browse the web do my email work on a couple of applications have a music player play some movies mostly to make me a happy user analyzing packets capturing them off the wire is not the purpose of a laptop right so devil is just gonna it’s gonna get some NIC manufacturer it’s gonna put that NIC card or that interface into it’s going to attach it on the back of my machine and I’m gonna plug in and I’m going to go ahead and start capturing packets but my laptop is not an optimized Hardware based tool with the express purpose of capturing traffic and keeping up at one gig now is one gig really capturing at one gig if I connect my machine to the network it likely has a one gig interface these days but does that mean that my laptop can capture traffic at one gig speed well most of us would agree that our machines cannot keep up with a high rate of traffic but the question then becomes when does it start dropping packets what is the percentage percentage of traffic that I can be sure of that my machine can keep up with or alternatively when should my eyebrow go up at what point should I be concerned that my machine is not keeping up really the question becomes at what utilization point on a link that we really need to consider a hardware based appliance that’s purposed built for doing packet capture and analysis well I’m going to talk a moment about a demonstration I did is this actually a year ago at shark fest 2013 if you’re interested you can find the video from that here on YouTube as well but capture limitation on default

settings so what I did is I took a Paquette generation tool you see the Food Network top the view on the top left and I pushed it through an aggregation switch this was a one gig AG switch I sent a traffic stream from that aggregation switch to my laptop okay and what I did is I saw how much traffic I was actually able to bring in and capture off the wire the results were pretty interesting my dell XPS 15z it’s an i7 and eight gigs of ram it can consistently capture about an 80 megabit per second stream now that Matt might sign it sound alarming or surprising to you it certainly did to me if you think about it what we’re talking about is a very small amount of traffic a very small percentage of traffic on a one gigabit per second stream in fact is not 0.8% we’re talking about and it’s not much better on other streams I’m sorry 8% and we find is simply not much better on other streams it it’s we find that systems simply aren’t able to keep up with the ingress traffic stream last year at the shark fest we shot machines with epic state drives 16 gigs of ram we shot machines with apples we shot we shot almost any brand you could think of and very few of them were able to capture much over a hundred mill megabits per second stream so right there my mental threshold when I’m capturing on a laptop when I plug into a connection if that connection is over over 100 megabits per second or 10 percent of a gig stream then my eyebrow goes up I start to get concerned I begin to question whether I’m effectively able to capture everything now how do we know how do we know if we’re beginning to hit that threshold if we plug in and capture on a stream what symptoms should we look at in the trace file that will tell us whether we’re not keeping up or not well we want to look for expert information things like previous previous segments lost now it’s possible that could be do to assure your transmission but it’s also possible that it could be that so we simply did not see that original packet also act lost packet when an acknowledgement comes across and we never saw the packet that this pack is acting right it’s acknowledging something that we didn’t capture and finally it’s possible sometimes this is also classified as out of order just depends on the situation now something we also could keep our eye on is the dropped counter on the bottom of Wireshark we see there is a dropped packets counter now this is something that it for me in my professional opinion I’ve also had to take with a grain of salt simply because Wireshark is depending on the lower level NIC card settings or the counters that are on that NIC to be able to tell it whether traffic was dropped or not so what I’ll see on those tests that I was just showing you when I shoot a gigabit of traffic at my machine and I’m dropping 90% of it usually that drop counter is sitting at zero so again it’s something that just because that drop counter says zero it doesn’t mean that we’re effectively capturing everything but it is a counter that we could keep our eye on switches and also virtualization can make capturing difficult now remember that when we’re using a layer 2 Ethernet switch a packet that comes into a port it only will be forwarded to the destination port as long as that switch knows where that end station or the next hop along the station there along the path is so in order to capture a packet the analyzer needs to be in line somewhere in our screenshot here if that PC and that server are having a conversation through that switch our analyzer which is sitting out in space there it it could not just simply plug into an standard switch port and be able to see everything that we hope instead we need to use one of the three common capture methods to get our analyzer in line now these are familiar to a lot of us right we’re familiar with the hubs taps and spans conversation but what we’re gonna do here is we’re going to talk about the pros and cons of each one all of them have their plus and all of them have their minus so getting in the path a spin mirror what does that mean well well this will copy on on selected ports host VLANs or traffic patterns it’ll copy those things to a monitor port now there’s some pros and cons the positive aspect of a span mirror is a lot of switches already supported we can log into the configuration and all we need

to know is the commands to enable it and there we are it’s free it’s there it’s built into the infrastructure and we don’t need to break a link to configure it also it’s full duplex traffic analysis we can see traffic in both directions and we can send that traffic over to our analyzer now a span mirror needs configuration that means we’re going into a switch perhaps um during a production timeframe dealing with production traffic and we’re gonna make a change in some environments that’s simply not tolerated also our analyzer is not able to transmit back into the switch it’s also easy to overload if we’re monitoring many ports that means that we have more traffic being sent to this band than the span can effectively handle our alright our next one taps now a tap is something that physically goes in line on a connection so here we see we actually physically break the connection between the server and the switch we put the server on a and on a network port on that tap and then we have the tap going back into the switch so it directly monitors the connection in line and lets us tap in or break that connection and listen in sending those packets over to our analyzer now just like with spans these have some pros and cons with taps it’s true in line analysis we’re not missing anything it’s full duplex there’s no configuration necessary most of them are power fault tolerant so if the power gets pulled from the tap the link doesn’t drop and it’s also always there it’s always available for capturing so in some instances I found myself in a position where we put a tap in for an analysis job and at the end of that job the client may say hey you know what don’t pull out that tap just leave it we want the ability later to walk up click in and start capturing packets taps have a high cost compared to a hub and a span and we also need to break the link the first time that it’s installed and they’re harder to obtain now over-provisioning doesn’t only affect laptops what we mean is that it’s possible to send more traffic to a laptop than it’s able to keep up with the same is true with spans and mirrors they too can be over provisioned especially when we’re spanning a full VLAN or several gig ports at one time and we did a test some years ago where we shot out a span versus a tap on one connection now you can find the details for this if you’re interested in the gory details I’m going to go over the highlights here but you can also find this written up on load my tool comm so what we did is we took two stations and we transmitted a one gigabit stream between them now the PC on the right we took a tap tapping that connection and sending that traffic over to a tap analyzer then on the span we took that same traffic feed and we spanned that traffic over to a span port and sent it to yet another span analyzer now both of those analyzers were the fluke networks Optive uxg tool so we were able to capture all traffic coming in from both the tap and the span I was absolutely sure that the tool itself or the analysis tool was not responsible for dropping packets what I wanted to see is does do both of these analyzers see identical traffic are the timer’s the same do they see do they see the same amount of data well let’s find out what the basic results were tap versus pan so the tap capture results the amount of jackets that were captured on the tap we saw a hundred and thirty three thousand packets more or less and then I’m sorry not more or less that’s exactly what we saw hundred thirty three thousand one hundred twenty six packets the delta time at the TCP setup when we saw the handshake go back and forth between the station a and station B the delta time was two hundred forty three microseconds no that might not seem like much but we’ll get back to that in a second on the span so the analyzer that was actually attached to that switch the number of packets captured we saw a hundred and twenty five thousand packets so right there we lost eight thousand packets the span was not able to keep up with a traffic stream eight thousand were dropped so right there I have faulty information to dig through and analyze it’s going to be difficult to mentally jump my brain oh eight over those eight thousand packets it’s going to make the true problem look muddy right if we really did capture that true

issue but even more interesting for me was the Delta time of the TCP connection setup so 221 microseconds if you notice that’s faster than the one on the top so when station a sent to sin I’m going to go back one slide when station a sends the syn it goes through the tap through the switch and over to the station on the left when the syn ACK comes back from that station it goes to the switch the switch forwards the syn ACK first to the span and then after that it sends it to the tapped interface so what we see here is that the switch was prioritizing ingress traffic over to the span what that tells me is that was slowing down my true production traffic now for me that’s a problem in my analysis tool or my analysis method gets in the way of true production traffic when it actually slows down the rest of the the things that are going on in the network and it causes them to suffer so we never want our analysis tools or method to come at a cost of true performance on production traffic now how do we get around this as far as an analyzer is concerned these are boxes that are designed to capture a full line rate up to ten gigabit per second and beyond they stream do this with node drop gaps or drops one that we can look at as a turbo Kappa Nick from riverbed this is one that’s going to be installed it’s hardware built into that Nick to be able to handle that ingress traffic stream also the cascade appliance from riverbed so the appliance what it does is stream to disk and it goes on to a connection and it brings those packets in and writes them to disk as fast as they enter that interface so we know between these two things that we’re getting what we expect and that our analyzer won’t be muddying the water so to speak when it comes to us reading the traffic okay so now let’s talk about the fifth one forgetting capture filters now to show this I’m going to go back to Wireshark now in Wireshark when we’re setting up a capture on an interface one thing that we can do is come into our capture options and if we choose to we can set a capture filter maybe we’ve done this before and we can set a host or an address here now if I’m interested on a certain host IP I can put that IP address here and I can hit start and I can set my capture filter so my analyzer will only bring in traffic to and from that machine now that’s good but what if I move this station from one connection to another and I forget that this is here and I change my change my goal of analysis maybe I move on to a new problem well we want to make sure that we go back into the capture options and we remove this and that we don’t forget that we have it set now thankfully if we’re dealing with a recent version of Wireshark if we shut down Wireshark and then bring Wireshark back up those capture filters were out will automatically be cleared but it’s still something that we want to keep in mind if we’re moving around two different connections it’s possible we could forget a capture filter this is an interesting one Wireshark not configured to you now if we’ve had the opportunity to go to shark fest we have a lot of great sessions on how to tune Wireshark to your visibility your preferences things that you are comfortable with reading if we’re using a Wireshark system that’s been configured for somebody else maybe they have too many custom columns maybe they have too few columns maybe they’ve configured some coloring rules that we simply don’t understand so what do we mean by this well going back to Wireshark I’m going to open up a simple trace file now here I have a profile that I like to use for when I demonstrate this it’s too many columns so I’m gonna come up here it’s gonna resize that so I have all kinds of stuff in my summary view up here now a column is for something very easy that we can add to our summary view up here if there’s anything down here in the details that I want to see frequently for displayed for every packet well I can come in select that value right click then go to apply ATA’s column it will then be displayed that value will be displayed for all packets that have that header value and it will be shown up here but it’s possible we just simply have too many of them to make sense of what we’re seeing a lot of times we want to start simple and that and then depending on the application we may be looking at or depending on the type of

problem that we’re looking for we may adjust these columns adding or deleting them based on our need so don’t have too many columns maybe this is a good amount of information for a certain protocol but then say we’re analyzing our traffic a lot of these go away when we’re looking at Arps it simply gets too messy right so something that we want to keep in mind wireshark not being configured to us 7 taking the expert information as gospel so when we go back into Wireshark we can look at the expert system down here on the low left we have a dot if that’s colored either red or yellow that means that we have expert infos we have warnings about this trace file that we might want to look at so taking this as absolute fact right so it may be that we see things like on our slide deck information we could see things like TCP check sum errors so that’s a false alarm when we’re capturing locally on a machine or this last week I saw an example where we saw a TCP port that was being reused well this isn’t a problem as long as we don’t have two TCP ports simultaneously being used by the same station now the server port doesn’t usually matter right we’re using the same server port all the time but if two stations are once I’m sorry if one station is reusing a port that’s okay as long as it shuts down one instance before it starts up the next instance so in a trace file there was nothing wrong with it reusing a port it didn’t halt the conversation but in Wireshark we did see that expert event hey this station just reused the port so something we want to consider but in that case there wasn’t anything necessarily wrong so we do want to make sure that when we see an expert info that we read it we understand it and we don’t throw up all the flashing red lights on a problem if there really is none and finally number eight the packet level distractions this is when we allow packets that are not related to the route issue to distract our attention so we could be looking at a certain application or conversation between a client and a server and then we see IP X traffic coming off printers and then we go down that tangent for a while and start to disable the IP x stack from old printers that we have hanging around or maybe we’ve see some other type of spurious information some spurious conversation going on and instead of focusing on what we’re there to capture and analyze we allow these other distractors to take our attention so let’s let’s focus let’s filter and keep our interest where it needs to be to really get to the goal of what our analysis job is I appreciate your attention to this video today again this was common mistakes in packet collection this was delivered at shark fest 2014 if you have any comments please feel free to leave me one down in the comment window below if not then we look forward to seeing you hopefully at the next shark fest thanks again