Tag Archives: Network Management

There is no doubt that information security is one of the main objectives of every organization that rely on an ICT infrastructure. For some organizations, the task of maintaining information security is assigned to dedicated teams not concerned with keeping the network up and running or delivering IT services. However, it is crucial that the network, IT, and security teams collaborate to protect the organization’s information assets. One area where the ICT team can support the security team is by implementing proper network management functions.

Network management best practices require following the ISO Telecommunications Management Network (TMN) framework. This framework splits the network management functions into five key areas referred to by the acronym FCAPS. It can be argued that, efficient information security starts after the five functions are put in place and used properly. To elaborate, here are some of the areas where the FCAPS play vital roles in securing the data assets of the organization:

Fault Management Functions

Active/Passive Monitoring

Organizations’ security concerns are often focused on protecting their data and ensuring its integrity and confidentiality. Availability of the service provided by the IT infrastructure is also an important aspect of information security as cyber attacks may target the infrastructure by denial of services (DoS) attacks in an attempt to prevent the organization from conducting its normal operations. For instance, a DoS attack may aim at preventing the organization from collecting toll fees and generating revenue.

Fault Alerts

Active and passive monitoring of network devices or network services (such as the organization online sales portal) for continued activity will provide the network administrators with alerts when these devices or services stop functioning properly. Whether the outage is caused by a malfunction or a cyber attack, restoring the services and resuming normal operations is the responsibility of both network management and information security roles.

Configuration Management Functions

Information security relies on Configuration Management in many aspects, including configuration monitoring, change‐control, and auditing. For example, the organization can use the following configuration management functions within the information security context:

Topology Discovery

Topology discovery and device inventory tools will be able to detect devices that are connected to the network without authorization. As the organization’s infrastructure covers a large area, this capability is necessary to enforce change management controls as well as detecting malicious attempts to infiltrate the infrastructure through devices located in remote areas.

Configuration Audit

Regular configuration audit provides the ability to detect any change to the configuration that may weaken network security. As many organizations rely on various contractors to provide technical services and support, the ability to detect and track configuration change in network devices made by contractors will provide the means to assess the change from the security point of view by the organization staff. Also, if an attacker manages to break into the network and change device configuration to pursue an advanced attack, a comparison with older configuration will detect the change and recover from the attack by restoring the proper configuration.

Equipment Hardening

Equipment hardening is basic best practice that the organization should follow to maintain network security. The practice includes restricting physical and logical access to the network infrastructure to authorized personnel, disabling protocols that are not needed or are considered unsecure (such as http and telnet) and shutting down unused ports to prevent unauthorized access. Security protocol (e.g. 802.1X) can be implemented to limit wired and wireless access to only devices with known MAC addresses.

Accounting Management Functions

Although accounting management functions are largely ignored in organizations that do not track usage or charge fees for using ICT resources, tracking these resources may serve security purposes. Restricting and monitoring resource usage by quotas (such as disk space) can protect the organization from the abuse of these resources by employees or outsiders who manage to gain access to these resources.

Performance Management Functions

Performance monitoring tools can gather information to satisfy security and compliance requirements. Performance analysis tools can generate security reports directly or export the data to a dedicated security tools for further analysis and reporting. Network management teams can use performance monitoring to support information security in these areas:

Utilization Monitoring

Monitoring the utilization of certain resources or some events (Internet bandwidth, disk space, number of failed login attempts, etc.) and setting thresholds for normal values can assist in identifying security incidents. Sudden surge in Internet traffic may signal that an upload of large amount of data is in progress, or the onset of distributed denial of service attack (DDoS).

Event Correlation

Correlating traffic anomalies and other events to detect security incidents is a function that sophisticated security tools provide. Similar results are also possible using performance management tools that use trend monitoring and correlation functions to detect and isolate network problems. For instance, a surge in upstream traffic that is accompanied by a drop in failed login attempts could be a sign that an attacker is successful in gaining access to the network.

Traffic Analysis

Awareness of the type of data traffic flowing in and out of the network can be gained by using protocols such as NetFlow (or IPFIX) and its analysis tools. NetFlow provides the security administrator with information such as main traffic sources and destinations of traffic, protocols and applications. This information will provide clues about suspicious activity such as traffic going to uncommon destinations as a result of various infections or “botnets”.

Security Management Functions

The ‘S’ in the FCAPS model focuses in securing the network infrastructure and controlling access to devices. To achieve the goal of securing access to the ICT assets, the organization needs:

Centralized Authentication

Controlling access to devices in the infrastructure using a centralized authentication server (e.g. RADIUS). Such service allows the network manager to create access policies based on user profiles and track usage.

Multitier Access Privileges

Developing different access and authorization levels for various groups of users who may need access to the infrastructure (network administrators, engineers, operations, security personnel, vendors, contractors, etc.)

Access Logging

Configuring and feeding device-generated logs to a centralized server. In addition to their value in troubleshooting problems, the logs can be used to detect anomalous behavior that can be a symptom of a malicious security attack.

Conclusions

There are several ways by which the implementation of network management’s FCAPs can support the objectives of information security. For this reason, network management and security should be treated as two complementary functions in an organization. In fact, for many SMB organizations, there can be only one ICT team and information security must begin with proper network management.

Your organization’s firewall is the first line-of-defense against cyber-attacks and it is where the implementation of the access policies takes place. In a typical organization firewall policies are constantly changing to respond to various threats and adapt to changes in the network environment. Therefore, regular audit of the firewall rules is necessary, not only to maintain the security of the network, but also to ensure the correct and optimal functioning of the firewall as policy rules continue to grow more granular and complex.

Such firewall audit should look for some common problems that result from frequent changes to firewall policies and provide recommendation on how to correct them. Among the common problems to watch for are:

Excessively permissive rules: Rules that use “any” or “*” in one or more of its fields permit more packets than what is required for the network operations. These rules increase the risk of exploitation.

Redundant rules: A rule is redundant if there is another (prior or subsequent) rule that matches the same packets and requires the same action such that if the redundant rule is removed, the security policy will not be affected. Redundant rules enlarge the size of the security policy unnecessarily and degrade the firewall’s performance.

Shadowed rules: This situation occurs when a rule matches all the packets that subsequent rules should match but with a different action. Shadowed rules are problematic because they are never activated, resulting in an incorrect implementation of the security policy.

Unused rules: This includes rules that have not matched any packets for a significant period of time. They are often caused by a change in the network or the applications that is not reflected in the firewall policy. These rules clutter the firewall policy and decrease performance. They also slow policy maintenance and hinder troubleshooting problems.

Disabled rules: These are rules that are marked as inactive of disabled but are not yet removed from the policy. Unless they are kept for a good reason, disabled rules increase the clutter and memory usage.

The sound practice is to perform regular audits (e.g. twice a year) to clean up all redundant, unused, and disabled rules that may have been caused by removing services that are no longer exist, temporary exceptions, network upgraded, mergers and so on. It is also extremely important to find and correct shadowed rules and restrict the wide open rule rules to improve security and adhere to the organizations security policy.

Manual audit of firewall policy rules is tedious and error prone. It also adds significant load to the network administrators. Yet, the audit is necessary or even mandated for compliance purposes. To overcome these challenges, some automation of the audit process can reduce complexity and achieve significant performance improvements.

At DynamikNets, we have developed the tools to automate firewall policy audits and recommend improvements. The tools inspect firewall configurations from major vendors and identify rule anomalies and other problems. Combined with manual review of other firewall data, we are able to provide our customers with comprehensive recommendations of the changes that need to be made to the firewall rules to optimize performance.

To learn more about DynamikNets firewall policy auditing capabilities and services, please contact us. Also, please tell us more about your firewall audit practices by answering an anonymous survey.

Any organization that relies on data networks for its core operations needs to ensure the continued availability of its network infrastructure, which includes LAN devices (switches, routers, firewalls, etc.), WAN links, Internet and cloud connections, and the support facilities (power, air conditioning, etc.). The network operator may achieve the desired level of LAN availability using several approaches. Availability of services obtained from service providers or carriers is often defined by an SLA (service level agreement) that states, among other things, the percentage of time during which the service is expected to be up and running (uptime). It is common for the uptime to range from 99.95% to 99.9999%, depending on the type of service. Availability of five nines, 99.999% uptime or a little over 5 minutes of downtime a year, is considered the norm for telecommunication carriers [1], but as networks are becoming vital to business continuity for many organizations, the need for five nines availability will proliferate.

The percentage of uptime does not provide sufficient information about the availability of the network. For instance, the 99.99% availability translates to about 52 minutes of downtime/year. This outage can occur in one occasion, in periods of four minutes every month, or one minute a week. Therefore, the availability is better measured and controlled using the metrics MTBF (mean time between failures) and MTTR (mean time to repair). The two metrics are related to the percentage of availability by the equation shown below but they provide better estimate of how long the network is expected to be operational and how fast it can recover from an unexpected failure.

(1)

Reliability

Reliability of a given system or a component refers to the probability (likelihood) that a network is operational at any given time. A network component that remains operational, on average, for 364 days/year is said to have reliability of 99.73%. The component can also be described to have a frequency of failure of 1 day/year. Components that have high reliability are expected to have long MTBF and low frequency of failure.

Networks consist of many interdependent systems and components such as internetworking devices (routers and switches), facilities (power, air-conditioning, rack space, etc), physical and data security, management and configuration controls, and others. If the relationship among these systems is viewed as a connected chain of functions, the network can be operational only if all these systems are functioning properly. Then, the reliability, R, of the network is equivalent to the multiplication of the reliability of all individual systems.

(2)

According to Equation (1), to end up with a network of 99.999% reliability, individual components and systems must have higher reliability. As the number of components the network increases, the individual reliability must go higher as well. Reliability of hardware components and services acquired from providers (e.g. WAN connections) are often beyond the network operator’s control. Instead, the effort is concentrated on eliminating single points of failure (SPOF) in the network to reduce the chance that failure of a single component takes down the entire network [2].

Redundancy

Redundancies in the network infrastructure eliminate SPOFs by adding components and other resources (e.g. memory, bandwidth or power) beyond those needed for the normal operation of the network. The goal is to make these resources available in the event of a loss of the main resources due to a failure. Complete duplication of components, known as 2N redundancy, is quite expensive considering the excess resources that remain unused. Alternatively, N+1 or N+M redundancy may provide more cost effective redundancy by relaxing the requirement to duplicate every component and providing one or few standby components instead.

Let’s consider a simple example to demonstrate the effect of redundancy. A router serves as an Internet gateway for an organization’s LAN. If the router fails unexpectedly, purchasing a replacement router and waiting for its arrival may extend the time to recover from the failure to days or even weeks. If the LAN availability is measured within a year (to comply with an SLA, for example), then it is estimated to be around 95% based on Equation (1). A vendor’s service contract may guarantee replacing the failed router within a predefined time (2 to 24 hours) for an annual fee. In this case, the repair time will include the delivery time plus the time to bring the router online and the availability approaches 99.95%. To reduce the repair time further to an hour or less and increase availability to 99.99%, a spare router can be kept in storage.

The example is an oversimplification because a single instance of failure is not sufficient to measure the availability of a network since the MTTR is the statistical average of repair times from multiple failures accruing over a long period of time. The example shows, however, that redundancy is an attractive approach to improving the network reliability. As the following equation shows, a highly available system of 99.9999% reliability can be constructed from two redundant components of 99.9% reliability or three redundant components of 99% reliability.

(3)

Failover switching mechanisms

Redundancy requires a mechanism to detect the failure and initiate the failover process. Successful failover to standby components requires timely detection of the fault and successful transfer of functions to the standby components. It is also required that any activated component and other backup resources are capable of performing the same functions and carrying the same workload as the failed component.

Network management systems (NMS) can detect faults and alert the network operators who can replace the faulty components and restore configurations manually as in the previous example. However, as MTTR requirements approach few minutes, timely response by the human operator becomes impossible and automatic detection and switching mechanisms are needed.

This means for our router example that the backup router must be connected to the network, powered on, and ready to take over as soon as the main router fails. Protocols such as VRRP (Virtual Router Redundancy Protocol) or HSRP (Cisco’s counterpart) provide the necessary detection and activation mechanisms, but the network operator must ensure the synchronization of the configuration in both routers manually or using automated tools. Considering the various delays associated with operating VRRP [3] and the convergence of other protocols such as BGP after the fault [4], the network may return to full operation within few minutes and that is sufficient to push the availability to 99.999%.

It is easy to overlook the fact that failover switching mechanisms can also be subject to failure. The NMS may fail to detect the fault or send the alert message to the human operator. The VRRP may not function properly because of misconfiguration or because of another fault in the network. These failures may go unnoticed during normal network operation because they do not affect network performance, but they cause severe consequences when faults occur.

Stateful recovery

Redundancy may reduce the network recovery time but it does not recover the data in transit. If failover is to be unnoticeable by protocols and applications such as VoIP, then stateful recovery is required. In stateless failover all active connections going through our example router can timeout and sessions are dropped. The backup router needs to establish all routing adjacencies and rebuild the routing, NAT, and ARP tables. Applications also need to re-establish connections when the backup router takes over.

If stateful failover is supported, the active router must continuously pass information to the backup router such as device status, TCP connection states, NAT and ARP tables, etc. When the failover occurs, the same information will be available to the backup router to use immediately and the applications running on the network do not need to re-establish communication sessions. In addition to its advantage to certain applications (e.g. no dropped VoIP conversations), stateful recovery reduces the recovery time to seconds or a sub-second interval and improves the availability to the range of 99.9999%.

Complexity

Equation (3) justifies the use of redundancy as means to build highly available network infrastructure from less reliable, inexpensive components. However, Equation (2) suggests that the same result can be achieved by simplifying the architecture (fewer stages) and/or using highly available components in each stage. Estimating the reliability of the network by statistical means is not a trivial task because of issues of complexities that result from adding redundancies and inter-dependencies.

The multitude of protocols and level of redundancies may cause multiple protocols to react to failure and attempt recovery simultaneously. For instance, when a link fails, the network will initiate recovery by re-configuring the spanning tree and activating another link. A backup router may also attempt to take over the routing function when it fails to receive notifications from the main router as a result of the link-loss. Such conditions are avoided by introducing artificial delays in reacting to these event, at the expense of longer repair time.

Redundancy may provide false sense of reliability and scalability. A network of many redundant components can experience multiple failures before it suffers performance degradation or an all-out outage. Once an outage occurs the network operator is faced with the task of repairing multiple failures. Redundant components may suffer cascading failures if they are subjected to the same external events that caused the original failure such as capacity overload or an exploitation of a software bug. Also performance degradation may occur when one component within a group of load-balancing components fails if the remaining components do not have enough capacity to handle all the workload.

Conclusions

Network availability can be improved considerably by implementing multiple types of redundancies, but achieving the coveted five nines requires paying attention to issues beyond simple redundancies. Decisions about levels of complexity and scalability can be made during the design stage by choosing between few highly available components or more of less reliable components. Failover mechanisms can fail but redundancy in these systems is not always available. The role of network management systems and practices is significant not only in detecting faults and critical conditions, such as exceeding safe capacity levels, but also in ensuring fast recovery. Standard network protocols have inherent limitations with respect to reacting to faults and recovery. These limitations have to be understood and proprietary solutions can be sought, if available, to archive the desired availability.

Software Defined Networks (SDN) is a new technology with a lot of potential and a healthy dose of hype. The main premise of SDN is moving the intelligence of the network from distributed network nodes to a centralized location to enable programmability and flexibility of configuration through software applications.

Software Defined Networks

Each router in today’s communication networks is capable of making decisions on its own regarding how to forward data packets to their final destination. The router gathers information about available paths to other networks and builds a view of the entire network topology independent from all other routers. This view allows the router to decide along which path a packet should be forwarded to reach its destination according to predetermined criteria. The distributed routing decision mechanism creates resiliency. If the path fails the routers will find another path to deliver packets to their destination with minimal interruption. To be able to provide this level of survivability, each router has to process every received data packet, decide where it should go, and forward it; all while communicating with other routers to maintain up-to-date topology view.

SDN proposes to separate the packet forwarding function from the routing decision function in all network devices, not just routers, and move all control to a central device. This simplifies the design of network devices and reduces their cost. Removing all control from devices also means eliminating the distinction between switches, routers, or firewalls as they all can be combined in one device that forwards (or drops) packets according to instructions received from a central controller. The result is simplified, inexpensive hardware and significant reduction in energy consumption due to eliminating the redundant computation needed in topology discovery.

Advances in general purpose microprocessors makes it possible to use an off-the-shelf server as a central controller; thus eliminating any need for special hardware. The SDN controller offers many functions that are currently difficult to perform with current network management tools. For instance, routing and other configuration policies can be pushed from a central location and changed dynamically as needed.

In virtualized environments where a virtual machine (VM) may move from one physical hardware to another, even across data centers, there is a need to reconfigure the network accordingly to maintain VM connectivity without human intervention. Carriers and infrastructure providers may use the central configuration ability of SDN to create visualized, independent networks to deliver services or rent directly to customers. Organizations, such as universities, may use the technology to run research experiments on the same hardware as production networks without affecting the latter. Some of these abilities exist today using various technologies and standards. SDN brings dramatic simplification to routing functions by centralizing the control. Also, allowing user applications to control routing means that network users can write their own routing protocols to handle data packets in the networks under their control.

The SDN’s potential to turn networking equipment into commodity products, maximize network utilization, and meeting the dynamic demands of cloud environments, has attracted the support of cloud and network service providers such as Deutsche Telekom, Facebook, Google, Microsoft, Verizon, Yahoo, and NTT. Yet, there are many challenges to overcome in order for the technology to be widely adopted. Among these challenges, fault-tolerance must be achieved by replicating the controller and maintaining synchronization among the replicas. Performance bottleneck issues may arise in large networks when all decisions need to be taken by a single controller. Also, vendor support and standardization remains a major challenge in this early stage of the technology development.

SDN can be disruptive because of the fundamental way it changes network design, operation, configuration, and management. The ability to provide X-as-a-Service (XaaS) over virtualized networks may depend on it. However, its widespread adoption will require resolving outstanding issues in areas of performance, scalability, security, and interoperability.