Sometimes it IS the network: Bandwidth Analysis

Do you know where your links are?

In other words, do you really know how much bandwidth is available for your users, especially those accessing applications across WAN links from remote offices?

Obviously, having an insufficient amount of available bandwidth for remote users affects the performance and reliability of

  • Interactive applications
  • Voice and Video services
  • Data backups and replication

And, of course, provisioning excessive amounts of unused bandwidth in an attempt to solve application performance complaints (that could be caused by application design, app turns, or server processing time issues) or just to make sure the network isn’t the problem is throwing away vast amounts of money that could be better spent elsewhere.

What you may not be aware of is that the standard bandwidth usage graphs and data provided by even the best, industry-standard Network Management Systems (NMS) may be giving you or your network and capacity planning teams an erroneous (and artificially low) view of true usage levels and/or skewed statistical values – and your users may be suffering from poor application performance because no one knows that the network (bandwidth) actually IS the problem.

Inaccurate or misleading bandwidth usage report graphs and statistical data are caused by three primary factors:

  • Long sample times
  • Averaging effects of nights and weekends
  • Inability to investigate anomalies

Long sample times
Although most NMS systems typically poll router interfaces every 5 minutes or so (or collect/store NetFlow data every 15 minutes), and store that data in a database for creating graphs and reports, they are – by default – also configured to roll those short sample times up into longer sample periods – 30 minutes, 1 hour, 2 hours, etc. – after some relatively short period of time – one day, 7 days, etc. – to conserve database space. The problem is that this results is a severe loss of short-term event resolution and true peak usage levels due to averaging effects.
A good practice for bandwidth analysis is to generate usage reports based on at least 30 days to get a true representation and measurement of typical usage levels across multiple business days, and to include any month-end or other recurring events that might be missed on a several-day or even a 7-day report. But generating these reports from the longer sample time data that NMS systems default to after a few days will give erroneously low usage levels.

The solution is to configure your NMS to save at least 35 days or so of low sample time data before rolling it up, at least on your major LAN and WAN links. Granted, this consumes more database space, but the performance and reliability of network services is a critical factor in any business’ success – and disk drives are relatively cheap.

The differences in apparent bandwidth usage with 5 minute and 1 hour samples are evident in the graphs below; there is a significant loss in true, peak level resolution and the periodic events seen in the 5-minute graph are completely lost in the 1-hour graph due to averaging effects:

Singapore - Std - 5 MinSingapore - Std - 1 HrThe Time of Day Analysis™ view with 5-minute samples shows the periodic peaks occur around 7AM Singapore time – these are lost in the 1-hour view:Singapore - ToD - 5 MinSingapore - ToD - 1 Hr
Averaging effects of nights / weekends
If your NMS provides a 95th Percentile line across the usage graph, be aware that his line is based on ALL the sample data – and the majority of the sample data is for nights and weekends periods – so this line is going to be low compared to what it would be if it was based on just Mon-Fri, 6AM-6PM business day periods.

I hope you’re not even looking at any Average values your NMS might be including in their reports – I have no idea why these systems persist in providing this useless data. The effect of nights & weekends on Average values is sufficient that it is typically shows less than 1/2 or even 1/3 of nominal daily usage levels. Max or Peak values are also fairly irrelevant – these typically reflect a one time event that – especially over longer reporting periods – have absolutely no bearing on the usage values we’re after. Ignore both of these and do some real analysis. Or at the very least, try to get a 95th Percentile line based on data with the nights & weekends filtered out.

Investigate, filter, and reconcile recurring or anomalous events
Identifying recurring or anomalous events within the relatively small graphs provided by most NMS bandwidth reports is difficult at best, as is correlating those events time-wise if the timestamps are not correct – which may be the case if the NMS stores sample data with UTC (or other time zone) time stamps and doesn’t offer some way of converting and displaying that data in the time zone for the location the graph is portraying.

Small graphs that cover long periods of time are simply not going to allow useful investigation of usage patterns – it will be necessary to pull several shorter-term (1 day or 12 hour) reports and review these as a set. If a recurring / anomalous event is discovered, it may be necessary to generate additional reports with those time periods filtered out, or use a focused filter to measure the events if they are of interest.

A better bandwidth analysis process
To generate accurate, useful, valuable bandwidth analysis reports that truly represent and reveal what network usage levels are across business day and evening periods, you need to be able to pull 30+ days of short duration sample data, correct that data for the time zone it represents, Zoom in and investigate recurring events / anomalies, filter on those events, and obtain useful statistical values such as 95th or 99th Percentiles that reflect true usage levels across the period of interest, whether that be business day usage levels (to ensure sufficient bandwidth to support business operations) or evenings (to support DR scheduling that doesn’t run into the business day mornings).

Why is this important?
I repeatedly ran into the challenges described above while performing Network Impact and Performance Assessments, a modeling effort that provides per-deployment projections of network impact and response times to support new application roll-outs. Some of these deployments were multimillion dollar, high-visibility projects – there was no way I was going to stake the accuracy of my models, my personal reputation, and the success of these projects on some seat-of-the pants estimate of bandwidth usage. I needed numbers to put in my models that represented business day usage and availability levels – and they had to be right.

And I’ve often been asked to tell someone when they can schedule a DR transfer, how long it will run, and/or how long it can run without bumping into other scheduled transfers or business day periods. This is almost impossible to do with any confidence without extremely accurate bandwidth usage data and a Time of Day Analysis view.

Finally, I’ve often run into situations where insufficient amounts of bandwidth were available to support reasonable application response times – and the bandwidth reports I saw generated (by others) suggested that bandwidth was not the issue – when it actually was. Most modern applications need – as a very general rule of thumb – at least 2.5+ Mbps of available overhead bandwidth to accommodate short-term, mouse-click demands. If this overhead bandwidth isn’t available, it will take longer to service these demands – and response times will rise accordingly. And the short-term duration of these demands – 1 to several seconds – is lost even in 5-minute sample time periods, so you have to allow a reasonable amount of spare, overhead bandwidth in your analysis of bandwidth usage and allocations.

For example, the IO Graph below was generated in Wireshark from a capture of a user purchasing an item from the Amazon website. You can clearly see that one of the mouse-click demands generated a short-term peak traffic rate of ~16 Mbps – for only a couple of seconds or so. If this request was made over a WAN link with only a few Mbps of spare overhead bandwidth (and contention from other’s mouse-click demands), the response time would have been a great deal longer than it was.

IO Graph - Purchase R800 from AmazonTo meet the need and challenge of generating accurate bandwidth analysis data, I developed the PacketIQ Bandwidth Statistical Analyzer. If you can export good (short sample time) bandwidth usage sample data from a NMS, you can import it into the ‘Analyzer, adjust links speeds, time zones, filter for day or night periods of interest, Zoom in and look at the times and usage levels of events, and – using the Time of Day Analysis feature – get an accurate graph and stats values over a 24-hour period, statistically derived from the entire multi-day sample data set.
ToD BW Statistical Analysis Report - Large DC WAN Link - 24 Hrs - Jan 2012
You can learn more about, and see examples of, all the challenges I describe above in the video below – as well as an overview of the Bandwidth Statistical Analyzer and how it overcomes all of these issues and allows you to produce accurate bandwidth analysis data.

This is, admittedly a long video at 35 minutes – but I think you’ll find the information and tips useful for helping you produce more accurate and valuable bandwidth reports, with or without using our Analyzer. If you just want to see how easy it is to use the Bandwidth Statistical Analyzer, skip to ~13.5 minutes.

Introduction to Bandwidth Analysis and the PacketIQ Bandwidth Statistical AnalyzerYou can find more info about the PacketIQ Bandwidth Statistical Analyzer on our website:
PacketIQ Bandwidth Statistical Analyzer

Time of Day Analysis is a trademark of PacketIQ Inc.

Comments are closed.