Junos Telemetry : Detecting Microbursts

Managing networks is actually not that difficult when things are working as designed. Operational headaches happen when things go wrong. And even then, when they go fantastically wrong (like hard failures that are easily identifiable), troubleshooting or remediation can be relatively straightforward.

Rather, the biggest challenges for network operators is diagnosing transient issues. The only information that is available is often an observation about some downstream consequence (“the network is slow” or “my application isn’t responding”). To correctly diagnose issues here, there must be real-time telemetry that is fine-grained enough to provide meaningful input.

Take microbursts as an example. A microburst is a short spike of packets received in a relatively small interval at a rate much higher than the configured guaranteed bandwidth for a given queue.

It’s not hard to imagine scenarios where microbursts might impact the business such as high frequency trading platforms. Those platforms depend on real-time market data to formulate trading strategies. Microbursts will result in stale data delivery and trading algorithms will be out of sync with the market which can be catastrophic to their business.

What network operators need are fine-grained monitoring tools that can detect issues as they are happening. Snapshots of average queue depths do not help identify issues much less provide real-time remediation. This is why we have introduced a queue monitoring sensor as part of the Junos Telemetry Interface (JTI) in Junos release 17.1.

What can cause Microbursts?

The main factors that can cause micro bursts in a network are:

Multiple sources sending packets to a single queue

Significant speed mismatch between ingress and egress interfaces (for example, a 100G/40G ingress interface forwarding packets to 10G/1G egress interfaces or to a queue which is shaped at a lower rate)

Multicast replication done by egress Packet Forwarding Engine (PFE) to a large number of receivers on the same egress interface

A microburst may result in dropped packets if queues are configured with small buffers. If queues are configured with adequate buffers to absorb the microburst, there won’t be any drops but it will introduce additional latency in delivering packets due to increased queue utilization. Dropped packets are properly accounted and easy to troubleshoot. However, determining the source of additional latency can be quite challenging in the network for many reasons:

Typical network topologies consist of multiple routers, and it is difficult to identify the router that is introducing the latency.

Monitoring tools (SNMP or CLI based polling) query interface statistics every 30 or 60 seconds by default. That interval provides good average utilization but is not sufficient to detect microbursts. The polling interval needs to be less than 1ms in order to detect microbursts reliably. It is not practical to poll at that high a rate from the routing engine or line card CPU.

How to detect Microbursts?

For MPCs 7E/8E/9E, a new queue monitoring sensor will be introduced as part of the Junos Telemetry Interface. The queue monitoring sensor will periodically export peak queue depth information to an external collector.

The microcode engine in the Trio ASIC will monitor queue depths for all configured queues and build/export JVision telemetry packets encoded in Google Protocol Buffer (GPB) format with all necessary information. Since all required tasks are performed in-line in the Trio ASIC without adding any additional load on the line card and routing engine CPUs, the queue monitoring sensor can monitor a large number of queues (32,000 queues for MPC7 and 64,000 queues for MPC8/9) simultaneously.

The following configuration will enable the queue-mon sensor on interface et-5/0/0:

The GPB proto format for the queue-mon sensor:

Conclusion

With the addition of the queue monitoring sensor to the existing library of rich sensors in JTI, network operators will have much better visibility into queue utilization as compared to the average utilization supported on most routers. And this continues Juniper’s commitment to producing the single most automation-friendly network operating system in the industry.

Do you think this data can be collected and be used by some kind of AI system to predict the traffic in a router? I had done a school project to collect cpu data to predict and scale VMs based on their usage. I think something similar can be done here to get a traffic prediction and take preventive actions.

Also, how difficult is it to translate these effects to the actual services that get affected. I think it's still a huge task for the network admin to interpret this into business meaningful data.

You are correct to see the connection between telemetry and predictive network operations. Juniper has outlined a vision of the "other SDN" Self Driving Networks that behave just as you ask. Taking in data from all these sources and making automated corrections to drive the network on demain.

Ben has been working with service providers around the world for the last 15 years developing business cases for a variety of product concepts and new ventures.
Ben holds an MBA from MIT and a BS & MS in Mechanical Engineering from Johns Hopkins University.

Donyel Jones-Williams is the Director of Service Provider Product Marketing Management overseeing all of Juniper's Service Provider Products for Juniper Networks. In this role, he leads all of the internal and external marketing activities for Juniper with respect to routing, automation, SDN and NFV.
Prior to joining Juniper Networks in January 2014, Donyel was a Senior Product Line Manager for Cisco Systems with in the High End Optical Routing Group managing product lifecycle for multiple products lines helping telecom providers operate efficiently and effectively including; ONS 155xx Product Family, ONS 15216, ONS 15454 MSTP, Carrier Packet Transport Product Family, ME 2600x, & ASR 9000v. He also negotiated favorable agreements with 3rd-party vendors furnishing components and parts and conducted both outbound and inbound marketing (webinars, case study-development, developed and delivered both business & technical at Cisco Live 2005-2012).
Donyel graduated from California Polytechnic State University-San Luis Obispo with a Bachelor of Science in Computer Science. While attending Cal Poly SLO he was a collegiate student athlete playing football as a wide receiver and a key member of the National Society of Black Engineers. Donyel is now an active volunteer for V Foundation.

Marcel Wiget is a member of the Routing TME team. His career within Juniper started back in 2009 as a Senior Systems Engineer driving one of the first MX based Broadband Edge deployment to success. Prior to Juniper, Marcel held various positions in pre-sales, professional services and development at Chantry Networks, Spring Tide, Nortel Networks and Wellfleet.