Network Monitoring

Table of Contents

Creating a Network with Hyper Resiliency and Availability at its Core

A Guide to Network Monitoring and Incident Response

Ever heard of the CIA triad? Well I guess now as a good a time as ever if you haven't. Among information security professionals, the triad of maintaining of strong Confidentiality, Integrity and Availability is a sacred cliche. Each as important as the other when designing a data processing system.

In the case of availability, if you have ever witnessed the disruption that an email system going down; or a core service becoming unresponsive for a significant period of time has on an modern organisation, you will know that the order of the letters in the CIA triad abbreviation does not reflect a scale importance.

In-fact, in today's world, where almost all organisations are reliant on electronic processing systems, the need for hyper-availability and hyper-resiliency, in the case of a failure has become ever more demanding. Take for example the WannaCry ransomware outbreak in the Summer of 2017, where entire organisations, including the UK's health system had their electronic systems disrupted. This event lead to production lines being halted, medical operations cancelled and transport networks unable to collect payment from travellers.

Malicious attacks are of course just one of many dynamics when it comes to maintaining high levels of system availability. Other issues could include aged hardware, bottlenecks in network infrastructure or capacity being met. What unites all of these issues is the need for constant and consistent monitoring of network health, to spot problems in their early stages when they are less impactful; and some form of response plan or action for when things do go wrong.

There do exist solutions which can monitor network and network device health, both current and historical trends. Combined with the capability to alert and react to problems. Often known as network monitoring solutions, this article seeks to provide a definitive guide to these solutions and how they might be able to help you achieve hyper-availability and responsiveness.

What is a Network Monitoring Solution?

Without too much imagination, you will probably be able to decipher from the name of the solution type, that the principle is to monitor networks for overall health and any changes which might indicate a problem.

You would be correct.

All network monitoring solutions tend to contain core functionality:

Discovery- searching for devices present on the network or an owned cloud environment.

Monitoring - ongoing monitoring of critical aspects of the device, for example up-time, disk space and an open network port.

Mapping - the creation of a quick glance display which can provide an overall view health across the monitored devices.

Why Use a Network Monitoring Solution?

At this stage you might be wonder why use such a solution to maintain high levels of availability other other methods? Industry veteran GFI Software, cites ten clear and major benefits to using a network monitoring solution.

Keep informed - With real-time monitoring, if a failure or irregularity is detected, you can immediately be informed via methods such as SMS, pager, emails or a network message.

Plan for change - Network monitoring solutions allow you to study a constant problem with a closer eye. For example, if a piece of hardware is constantly tripping, it may be the time to replace this hardware.

Diagnose issues - Imagine a scenario where one of your company’s websites goes down. Without network monitoring, you may not be able to tell if the problem is with just the website, the web server or the applications which the site runs on.

Reports issues - Network monitoring reports can help you spot trends in system performance, demonstrate the need for upgrades or replacements and prove your value, by documenting the otherwise unseen work that keeps the IT systems you manage in top form.

Remediate disasters - if you are immediately notified that there is an issue with one of your systems, the time saved by being alerted immediately can be used to bring in a backup system to replace the current failure, thereby providing a seamless and efficient service to your users and your customers.

Ensure Operation of Security Systems - although business spend a lot of money, time and resources on security software and hardware, without a network monitoring solution, they cannot be sure that the security devices are up and functioning as wanted to.

Keep Track of Your Web Applications - a Lot of the services that your company offers to your users and customers are probably web applications. Network monitoring solutions allow you to stay on top of website problems; spot issues before your customers notice; and remediate the issues quickly.

Ensure up-time - network monitoring maximizes network availability by monitoring all systems on your network, including servers, workstations and network devices and applications. Whenever a failure is detected, you will immediately be notified via the alerts.

Key Terms & Protocols

Network monitoring, incident response and IT security generally, can be difficult to understand with endless abbreviations and feature-specific terms. To help with you compare and understand the benefits of network monitoring solutions.

Split into two mechanisms, regular SNMP on port 161 and SNMP traps and port 162. The latter of the two, SNMP traps are alerting messages sent from monitored device to the network monitoring solution, drawing attention to a problem. For example a hard disk array in a server can be configured to send SNMP traps to a network monitoring solution when of the disks fails replication.

The former, regular SNMP is a querying technology whereby the workflow is reversed. Instead of an alert being sent, the network monitoring solution queries the monitored device for health statuses.

OIDs or Object IDs are numerical references used by the SNMP portion a network monitoring tool to query very specific parts of the monitored device. The response given can be compared to a known good response to indicate health. With potentially thousands of OIDs per monitored device, all specific to the device being monitored. Device manufacturers create OID libraries which can be imported into network monitoring solutions to make things easier. These are known as MIBs (Management Information Base).

SNMP has come under scrutiny since its creation over its lack of inherent security. SNMP v1 and v2c have no authentication capabilities, outside of their community strings, despite providing diagnostic information; and even allowing remote changes in configuration when write-mode is enabled. SNMPv3 has sought to cure this problem by introducing encryption and authentication but is far from fully available and thought of as complex.

Where SNMPv3 is available, it is recommended it is used. Where it is not the recommendation is to change the default community string and never enable write-mode.

WMI

A SNMP style protocol developed for Windows operating systems since Windows Server 2000.

The purpose of WMI (Windows Management Instrument) is to define a proprietary set of specifications which allow management information to be shared with a querying network monitoring solution. It provides querying access into all manner of operating system functions such as services, file shares, hardware health and file properties.

The initial request for WMI is made using port 135, subsequent communication regarding that specific request is then made on a randomised port.

As SNMP is not turned on by default, the de-facto standard for monitoring Windows devices is WMI. However, note that WMI can be used to query SNMP OIDs too.

SNMP and WMI are the two primary protocols used in device monitoring.

Agent and Agentless

Network monitoring solution vendors will pick one of two methods for monitoring devices, agent-based or agentless.

An agent is a small application or piece of software, which is installed onto a device to be monitored. It is then the job of the agent to query the device for health information and pass this back to the network monitoring solution.

The alternative is that nothing is installed on the monitored device and instead the network monitoring solution is provided with credentials for the devices it is to monitor so that it may log onto them remotely and collect diagnostic information.

While there are a mix of both solutions available, agent-based solutions in an industry are usually seen as a disadvantage for a number of reasons:

Agents have a resource overhead which might negatively affect the monitored device.

Agents might be using frameworks such as Java which need updating.

There might be incompatibilities with software installed on monitored devices and the agent.

There is an initial rollout of installing agents which can be labour and time intensive.

If the network monitoring solution is decommissioned, the agents will need to be removed.

Agentless or credential based network monitoring are generally favoured by both users and vendors alike.

Monitors

At an application level, monitors are the individual health queries which the network monitoring solution carries out. For example, ping being used to monitor the uptime of a device's network card is an individual monitor.

Monitoring typically takes one of three forms:

Active monitoring.

Passive monitoring.

Performance monitoring.

This topic will be covered in more detail in a later section of the article.

Alerting

When a device state changes, an alert whether positive or negative can be generated to notify someone or multiple persons to this change in state.

Email or SMS notifications are common, however most modern network monitoring tools will offer a number of different notification types. Some with integrating capabilities.

Type of alert include:

Email.

SMS.

Pager alerts.

IFTTT.

Syslog.

Script actions.

Integrations with VMware.

SSH actions.

ServiceNow integrations.

In addition to changes in state, network monitoring tools which have been configured to collect and store performance information can generate notifications based on thresholds. For example, a list of all devices which have used more than 90% of available hard disk capacity in the preceding week.

Alerts like this are useful in predicting future events and being able to correct them before they cause a network or service outage.

Network monitoring solution Ipswitch WhatsUp Gold includes IFTTT integration which is used to integrate all manner of devices and technology, such as Amazon Alexa and other automation solutions, with a monitored condition. Take a look at this blog post where we were able to create a mobile phone alert from a monitored device.

Discovery, Mapping & Monitoring

With all network monitoring solutions, whether agent-based or agentless, the initial starting point is to discover devices for monitoring.

The discovery phase will cover the entire known network or a specified subnet, depending on the scope of the project and what it is be discovered. In the case of agentless network monitoring, a device is classified as discovered when it responds to one of three tests.

Ping.

SNMP.

WMI.

Depending on whether the discovery scan has been paired with SNMP connectivity parameters (community string) or credentials, the scan will attempt to identify the type of device. Displaying device manufacturer, software version and device type.

Despite being used as an tool for initial discovery before full monitoring, discovery scans should be configured to run on an automated and frequent basis. This will reveal any new devices added to the network, especially useful when those device additions are unauthorised. Scan results can often be output into recurring reports which can, in turn, be used as asset registers. Useful for good IT management and compliance drivers such as ISO 27001 and the UK's Cyber Essentials programme.

Network Monitoring Maps

With the network discovered and an initial set of devices recorded. Those that are to be monitored are marked as such, moving them into monitored mode. Once marked as monitored, devices will be displayed in some form of status screen, often represented as a map. One such example from Ipswitch WhatsUp Gold is shown below.

Network maps serve multiple functions and are useful for IT teams and operation centres for the following reasons.

Maps provide an overall health screen with colour coded indicators.

Maps will show both the physical and logical links between devices in the network and even indicate the health of those those connections.

Maps are interactive and clicking on an individual device can open up additional diagnostic control or displays of supplementary information.

Maps can be paired with geographical maps or building blueprints to show physical location.

In most solutions, the mapping function is dynamic, reacting to changes in conditions across the network. For this reason and those highlighted above, it is very common to see such maps on large displays in IT team or operation centre offices.

Five Things to Look for in a Network Monitoring Solution Map

If you are looking for the right network monitoring solution for your organisation, the suitability of the mapping function is crucial, as this is likely to be the most used function on a day-to-day basis.

Consider our top five mapping features when looking at solutions:

Colour coded indicators for each monitor on a device to indicate health.

Grouped maps, whereby you can view a subset of the network. For example, a map which shows just boundary routers or devices located in the US offices.

Interactivity, whether that be tools such as SSH or drill-down information about the selected device.

Logical and physical links, showing virtualization and WiFi AP to WLC relationships; and those devices connected to each other with a cable.

Link utlisation and bandwidth monitoring. Devices are not the only things which need monitoring, so do the connecting links for bottlenecks.

We have put together a list of all the network monitoring solution must-haves in this blog post here. Check it out.

Active, Passive and Performance Monitors

Network monitoring solutions typically have three types of monitors, which can be used in combination to get an overall and comprehensive idea of device health.

Active monitors - used to display immediate health, active monitors poll a device for an open port, service status or ping response and then compares that to a known good or healthy value. Active monitors can poll as frequently as every 60 seconds to keep the status of that monitor as current as possible.

Active monitors lack any real intelligence and are at a basic level, "are you alive?" tests.

Passive monitors - In the opposite fashion to an active monitor, a passive monitor is alerted by the monitored device to a condition which it has logged. Depending on the device type, this could be an SNMP trap, a syslog message or a Microsoft event log.

The network monitoring tool will be configured to look out for particular error codes or event IDs which are of interest and flag these using an alerting type when they appear.

Performance monitors - Unlike the active and passive monitors, performance monitors are not indicators of immediate health. Instead they collect long term data regarding hardware such as disk space, RAM utilisation and CPU usage. This can then be plotted on graphs and other analysis tools for trend purposes.

Performance monitors are less about the short term availability of devices and more focused on the longer term.

Long-Term Performance Monitoring

The final of the three monitor types, performance monitoring, can provide a rich insight into the health of network devices over long periods of time, through dashboards and reports.

Take the image below as a sample. This top 10 dash highlights network interface utlisation and ping availability statistics over a 24-hour period, giving insight into areas of the network where poor connectivity might be experienced.

In theory any active monitor can be turned into a performance monitor, however the most common performance monitor types include:

Ping Latency and Availability.

CPU Utilisation.

Disk Utilisation.

Memory or RAM Utilisation.

UPS performance statistics such as total charge.

Hyper-V and VMware statistics.

Crucially, long term performance statistics can be turned into threshold alerts which allow you to monitor for breached thresholds. For example, you might be interested to receive an alert each month for devices which have consumed more than 90% of their available disk capacity. This type of quick-to-hand information is invaluable when avoiding a major outage due to exhausted disk space on a critical resource.

Cloud Monitoring & Hybrid Networks

The network of today very rarely lacks some element of cloud infrastructure. Whether it be for development, hosting an externally facing service or to reduce your data center footprint, networks are becoming ever more hybrid and cloud aligned.

There are a number of cloud providers offering hosting options today. Some of the more popular include:

Microsoft Azure.

Amazon AWS.

Google Cloud.

Irrespective of the cloud provider you are using, the infrastructure you place in the cloud is likely to have some importance to your organisation or and its data processing activities. As a result, these cloud hosted devices will need to include some of the protections you would expect from any other device.

A good example of this is antivirus software. In today's security focussed world, it is almost inconceivable not to have antivirus software installed on a server hosted in an on-site data center. The same should be true for the cloud.

Network monitoring solutions are being used to bridge the management and monitoring gaps between the cloud and the internal networks, as the two begin to merge. No longer is cloud seen as developmental and new, instead it is expected that it affords the same capabilities and more.

For example, cloud hosted infrastructure is priced based on both the size of the hosted infrastructure in terms of its resource requirement and the time it is in and online state. This is not typically the case for in-network hosted equipment and so there is an additional need to monitor cost for cloud.

These capabilities are of course all achievable using the APIs provided by the cloud hosting service provider. In the case of Azure, Microsoft provide an API key which can be fed into a supporting network monitoring tool, so that it can read the properties of the cloud hosted devices.

From this, there are a number of metrics which can be derived, such as:

Bandwidth.

Device health, such as disk space.

Online and offline states.

Total accumulated cost for a period of time.

Connected users.

Running services.

This is not an exhaustive list, as the API capabilities of cloud hosting service providers are being constantly updated. For a full list of Azure's API references, click here.

Once you are monitoring your cloud infrastructure in a manner similar to the in-network devices, you can also benefit from the same alerting and incident response features. Such as alerting by email to state changes, thresholds which indicate impending issues being met and corrective actions being executed.

What is Netflow?

Netflow is a diagnostic and analytics protocol, originally created by industry giant Cisco. It is used to collect and record all IP Traffic going to and from a network device which has the netflow function or capability enabled. This collected packet data is then usually forwarded to a netflow analyser or network monitoring solution where it is collated and presented in a readable format..

How Does Netflow Work?

In the case of the cache, this is a temporary holding space in system memory where data flows are held before being handed to the exporter for delivery to your configured netflow analyser or network monitoring tool.

Netflow attempts to identify flows or strings of related network packets, rather than treat each individually. This helps to understand the context of network conversations.

Each time a packet is received on network device, its source, destination, port numbers, protocol, TOS byte and input are analysed to determine the flow it belongs to. Once identified, it is then added to its respective flow and stored in the netflow cache.

Once the netflow cache reaches its maximum size or its time to live value expires, the contents of the netflow cache are exported to a configured destination determined by you. This could be a dedicated netflow analyser tool or a full network management suite which accepts netflow as a complimenting feature.

Network monitoring solutions such as the widely acclaimed Ipswitch WhatsUp Gold includes an extension for netflow analysis. With drill-down reports and real-time dashboards, you have complete visibility of your network traffic.

How to Turn on Netflow

For detail and precise steps for turning on netflow or any of its rival derivatives, it is recommended that you refer the manufacturer's guidance.

Network monitoring solutions are passive to netflow traffic being sent to them typically on ports 9999 or 9995. Some network monitoring solutions will allow you to utilize their existing connection the network device via SNMP to configure and enable netflow. Saving you the need to find the manufacturer's instructions.

Benefits to Using Netflow

While some free analysers do exist, they are limited in functionality and will often restrict the number of sources; and so you will be left asking whether or not paying for a solution or a plug-in for netflow is a nice to have or is a worthwhile investment.

A number of our customers use netflow analysing features and have cited different reasons, including:

Understanding why network speeds would slow at particular times in the day.

Discovering how much traffic related to internet browsing during working hours.

Monitoring large file transfers or cloud destined backups during the night.

Understanding the makeup of traffic in the network.

Discovering bottlenecks which need correcting.

Discovering outbound routes, some of which had been thought to have been disused.

Incident Response & Alerting

For almost any network monitoring solution project, a core business outcome will be the proactive alerting of service outage before it takes place, so that such disruption can either be avoided or contained early enough that it is minimised.

Therefore the alerting and incident response capabilities of any selected or implemented network monitoring solution is of paramount importance.

Alerting functions tend to provide two major forms.

Immediate alerting.

Threshold alerting.

In the case of immediate alerting, a message of some kind is sent to alert to a current state or just changed state. For example, if a network switch fails to reply to ICMP or ping packets within a 60 second window. The state is assumed to be down and an immediate alert sent.

The former warns or thresholds being met, which could indicate a problem developing in the near future. An example of this might be a hard disk in a server reaching 90% capacity. The server is still operational, however has been flagged as a device which may need some remediative action to avoid a future outage. This could also be referred to as predictive trending analysis.

In either case, the mechanism for delivering the alert may vary from one of the following actions:

Email.

SMS.

Pager alert.

Syslog.

SNMP trap.

Write to log file.

IFTTT interaction.

Integration with a third-party solution, such as ServiceNow.

Post into Slack or other team chat utility.

Push notification on a smartphone app.

Have you come across IFTTT before? In a recent blog post, we used IFTTT to generate alerts from a network monitoring solution which can be sent to almost any internet enabled device. Read more here.

Different devices may be owned and maintained by different teams, meaning alerts must be routed to the correct parties. In addition, it might be wise to think about having escalating alerting, whereby if a device remains in a state for a period after the first alert has been sent, another can be sent via a different means or to a different recipient. For example, should a VMware become unavailable, in the first instance email the virtualisation team. If VMware remains unavailable for a further 30 minutes, send an SMS to the manager of the virtualisation team.

With support teams mobilised at the point of the outage taking place, the road to resumed service should be much shorter. Not to mention the preemptive fixes made by those threshold based alerts.

Where a fix is known, network monitoring solutions can become incident response tools and perform corrective actions. For example, should a known problematic Windows service turn to an off state, a network monitoring solutions can detect this and restart it, resulting in minimal impact.

The follow are some of the possible corrective actions:

Execute a powershell script.

Execute a batch file.

Take a VMware based action.

Interact with an API.

Restart a service.

Run an SSH command or an SNMP write command.

With the use of APIs or scripting, almost anything is achievable as a corrective action.

Take another example of there being two service providing servers, one of which is accidently taken offline. A network monitoring solution could detect this has happened and send an SSH to command to a critical router which changes the routing path from the offline server, to one which has been sitting in a cold backup site.

In today's world, it is ever increasingly important to maintain high levels of availability for both internal services and those which are public facing. Employees demand remote working capabilities which are leading to an increase in non-standard working hours; and an organisation's presence online means it is expected to provide a service at a 24-hour convenience.

This style of hyper-availability has ultimately lead to the need for hyper-resilience in the face of both cyber threats and loss of service.

Licensing Principles

So you are interested in using a network monitoring solution? Good, our article writing skills were not in vain.

The question then becomes, whether to go with an off-the shelf commercial solution; or a freeware option. The freeware / open source / DIY option is a question which arises in any new project as a way of saving on cost. After all, good software solutions are not cheap and justifying the need to senior managers can often be an art in itself.

DIY Options

A DIY build of a network monitoring solution is usually not formally planned. It just starts and evolves as your requirements dictate. Over time, it usually becomes the responsibility of a very small number of people or even a sole individual, within the organisation who becomes the owner for the home grown solution.

One thing which all organisations who adopt freeware or create their own solutions agree on, is is that ongoing maintenance of such solutions can consume a considerable amount of at least one, but usually a few people’s time. Smaller organisations have reported that, on average, one of their skilled IT operations personnel needs to spend up to 40% of their time maintaining their home grown tool.

The cost of building the initial version of the tool is often not pre-calculated, but let’s assume for a small IT services organisation they allocate one experienced IT operations engineer for 50% of his or her time to develop a solution over a period of six months. If that engineer has a £45,000 salary. The initial Build Cost is £11,250 (£45,000 x 0.5 x 0.5).

The cost of maintenance.

Given the constant rate of change in the technology sector, it’s reasonable to assume that up to 30% of an engineer’s time will be needed to maintain and update the DIY monitoring solution, which would include adding new functionality to meet ongoing requirements. The annual maintenance cost would therefore conservatively be £13,500 (£45,000 x 0.3) each year.

The opportunity cost.

This is usually a hidden cost that many organisations fail to factor in at all. IT services organisations and many in-house IT departments charge their customers a fixed hourly or daily rate for their qualified engineers. So again, let’s assume that one of the IT engineers is spending 30% of their time annually maintaining the in-house solution instead of providing the service they usually would provide to your customers or departments. We can make the following additional assumptions based on what would be typical industry norms

The engineer’s Daily Charge Out Rate is £325/day

The number of billable days per annum per engineer is 220

The opportunity cost, or “lost revenue” that your business has missed out on because of your engineer maintaining your in-house tool is £325 x 220 x 0.3 = £21,450 per annum

Therefore the initial cost £11,250 and the ongoing cost is £34,950.

This might be a palatable amount depending on the size of the organisation and the use case.

Freeware & Open Source Options

Free network monitoring tools are popular among smaller organisations who find it harder to justify IT spending.

Organisations who use freeware options tend to return to off-the-shelf commercial offerings later in life, whether it be because they have larger budgets or have had a poor experience. Some of the cited reasons we have noted are:

Significant difficulty in customisation, often having to be achieved with in-depth scripting knowledge.

Lack of support for when things go wrong or a customisation is required.

Vulnerabilities remaining unresolved in the solutions long after disclosure.

Lack of support for new devices to be monitored.

Difficulties in upgrading or migrating.

Freeware and open source software might be a quick win for the finance or procurement department, however the ongoing difficulties mean that network monitoring projects are far too often abandoned.

Off-the-Shelf Commercial Software

Commercial network monitoring software offered by industry vendors offers the best option in our opinion. With dedicated support, routine development of the solution and lower ongoing costs.

Network monitoring solutions are primarily licensed in two manners. Monitor or sometimes called sensor based licensing; and device licensing.

Monitor based licensing is priced per monitor applied to a device. For example, if you wanted to monitor all the ports on a standard 48-port switch, you would need to factor in a cost of 48 x monitor price. The price of a monitor is typically much less but you will need to purchase more of them.

Device based licensing takes the view that all monitors are free so long as there is a license for a particular device available. 100 to be monitored is 100 x the device price and each device can be monitored with an unlimited number of monitors. In our example with the 48-port switch, the switch would consume once license and all ports would be monitored by default. Of course this makes the price of a device license higher than a monitor license, however you will need fewer of them.

Whichever option best suits, it is important to factor in a five year pricing plan with growth expectations to ensure that you are investing in the right tool. With many solutions offering perpetual licensing, the year 1 investment is high and so make a mistake in solution choice is a costly one.