Embedding Call Monitoring in VoIP Designs

Despite all the bad press, voice-over-IP (VoIP) is continuing to make inroads in communication architectures. From enterprises to core networks, VoIP is being implemented as a means for merging voice and data traffic on the same broadband pipe. And, interest in VoIP is starting to bubble in the wireless sector, with many third-generation (3G) wireless operators and equipment developers eyeing VoIP as a means to improve network efficiency.

But, mass deployment of VoIP technology is still limited by voice quality and manageability concerns. Specifically, design engineers building VoIP-enabled equipment are still struggling to deliver the quality of service (QoS) for VoIP that end users demand.

Embedding a call quality monitoring agent into a system architecture is one technique that can help designers measure voice quality and in turn improve QoS offered by the system. Several months back, we explored the techniques designers could employ to monitor quality in a VoIP system (See Improving VoIP call quality with embedded monitoring). Now let's explore the methods and resources requirements needed to integrate passive call-quality monitoring agents in VoIP system architectures. Let's start by quick review of call quality monitoring.

Monitoring Techniques-The Overview Traditionally, user perceived call quality is measured using subjective testing, with results typically reported as a mean opinion score (MOS). Under this method, human test participants listen to a variety of speech samples and assign each of them a score in the range unacceptable (1) to excellent (5). The average of these scores is MOS.

Being a subjective test, the results will vary depending on the test conditions and participants. To offset this variability, MOS data is accumulated over years and hundreds of thousands of tests so that truly meaningful measurements can be derived. Due to the number of people and time involved, subjective testing is impractical for real-time monitoring.

In response, designers can turn to objective test methods to measure voice quality. These methods rely on measurements made on active (real or prototype) voice networks. The quality metrics or the algorithms that consume them are used to estimate call quality. Objective voice quality measurements can be obtained using either intrusive testing or passive monitoring techniques.

Intrusive call quality testing methods, such as perceptual evaluation of speech quality (PESQ), generate synthetic test calls through the voice network, where the audio signal is transmitted, recorded, and compared to the original signal to determine the distortion. Because of the complexity of the signal comparisons, involving Fast Fourier transforms (FFTs), most intrusive testing algorithms are compute intensive, require significant processing time and are not viable for real-time quality measurement.

There are several other reasons why intrusive testing methods are not viable. First, the use of intrusive tests generates additional network traffic, exacerbating the network problems, causing call quality degradation, and stealing bandwidth from current and pending calls. In addition, intrusive techniques like PESQ can miss the effects of packet loss, where packet loss concealment algorithms produce very minimal distortion. Intrusive tests are also overly sensitive to delay. Thus, while the analysis may show minute small levels of signal misalignment, the PESQ score is lowered while the end user actually hears excellent speech quality.
Unlike the intrusive techniques, passive monitors obtain quality metrics from live calls and can estimate call quality in real-time for calls. Traditional passive monitors, such as remote monitoring (RMON) probes, provide simple and discrete statistics that both miss the key degradation factors, such as burstiness and jitter buffer cards, and do not correlate the statistics to provide a composite quality score ala MOS. Instead, end users are left to make sense of whether X delay, Y packet loss, and Z jitter values in a myriad of levels and combinations, usually expressed as misleading edges, equates to a good call or a bad call or to determine if call A is better than or worse than call B, making call quality trending impossible.

Newer call quality monitors extend the agent functionality to provide visibility into the key degradation factors such as jitter buffer discards, burst loss, burst density and duration, as well as incorporate end-user-perception metrics and correlation algorithms to rate call quality with a single number score, with excellent correspondence to subjective test results.

Implementing Passive Monitors
As can be seen above, passive monitoring techniques provide some benefits to VoIP equipment designers. Now, let's look at some of the key issues designers will face when implementing call quality monitors in a system architecture.

Overall, passive call quality monitors can be implemented in a variety of applications, including gateways, media servers, and IP phones. However, for the purpose of this paper, we'll consider the implementation of these agents in performance management devices, such as probes, analyzers, firewalls, service level agreement (SLA) verifiers, and traffic shapers, to name a few.

We can divide the target environments into two general classes-class A and B. Class A environments have the ability to capture packets and assign a highly accurate, locally generated timestamp to each packet as it is captured from the network. Class B environments do not store packets into a capture buffer but "process" each packet as it is received from the network.

Most class A and B environments do not offer the ability to accurately determine the true extent of packet loss--information needed for accurate call quality metrics. The reason is that such environments do not consider the effects of jitter or jitter buffer discards made at a call end-point. In short, neither environment implements a jitter buffer. By adding an agent that implements a jitter buffer emulator into these environments, the total and discrete packet loss distribution can be known and accounted for to determine call quality.

With that stage set, let's look at the where agents get integrated in class A and B environments as well as their function in a design. We'll start with the class A environment.

Monitoring in a Class A World
Class A environments typically include network analyzers, testers, SLA verifiers, and in some cases firewalls. Their architecture is based around a capture or receive buffer that stores frames or packets as they arrive off the wire. When frames or packets are captured or received, a locally generated timestamp is assigned to each frame or packet. Capture buffers allow class A environments to perform frame or packet decodes asynchronously, as well as generate or "playback" network traffic patterns.

As Figure 1 illustrates, a typical class A environment includes a number of key components that warrant additional definition. In order to retrieve frames or packets from the physical medium, a class A environment must include a media access controller (MAC), or a similar such device, that interfaces to the network medium and can receive all traffic occurring on the physical medium, i.e. support a "promiscuous mode".

Figure 1: Typical class A target environment.

The MAC is generally implemented in silicon and controlled by a software driver. The driver's purpose is to handle interrupts from (or poll) the MAC hardware, and transfer frames or packets from the network medium into the capture buffer.

As the frames or packets are placed in the capture buffer, they are time-stamped with the locally generated timestamp. Over time, the capture buffer can evolve into a series of tuples of:
{ , }

When the capture buffer fills, or at periodic intervals, its contents are passed to a filter/dispatcher function for further processing. The filter/dispatcher function may perform some pre-processing filtering, and dispatch the frame or packet into the statistics calculator followed by the decoder.

The statistics calculator computes a number of statistics about the packet. These statistics usually include packet counts, byte counts, protocol distributions, packet delay variation estimations, and transmission delay estimations.

The decoder is responsible for the actual packet decodes. The decoder can resolve VoIP call signaling protocols to determine when a call is initiated and terminated, as well as determining what call a particular voice packet might belong to, as several calls may be active simultaneously.

The information generated by the statistics calculator and decoder components is then passed into the agent along with other pertinent data, such as call initiation and termination event.

As voice packets are received, the locally generated receive-timestamp and the contents of specific packet header fields are indicated to the passive call-monitoring agent. The agent first hands the packet to the jitter buffer emulator to determine whether packets have been lost or whether packets would be discarded due to jitter or excessive delay. The agent records both events, since both significantly impact voice quality calculations and perception by the end-user.

The timestamp required in a class A target environment should deliver 1-ms or finer resolution. Generally, a 1-ms timestamp resolution allows the jitter buffer to handle most if not all popular voice codec frame types. More accurate timestamps, however, can be used.

Tracking in Class B Designs
Class B environments, as shown in Figure 2, typically include firewalls, edge routers, or traffic shapers. These environments are designed around the concepts of packet filtering, packet queuing, and packet forwarding, and may not have the ability to generate a local timestamp for packets received off the physical medium.

Figure 2: Typical class B environment.

When locally generated timestamps are not available for each packet received, a passive call-monitoring agent should be implemented to deliver a real-time clock interrupt. Using the real-time clock interrupt, the agent can estimate packet loss due to packet delay variation through the implementation of the fully functional jitter buffer emulator.

As Figure 2 above illustrates, class B environments typically include at least one media access controller (MAC), or a similar such device, that has the ability to promiscuously receive all traffic traversing the physical medium. The MAC driver should handle interrupts from (or poll) the MAC hardware, transferring frames or packets into a frame or packet buffer. Class B devices may often include multiple physical network medium interfaces as indicated in the figure.

As each frame or packet is received, it is passed up the stack to the filter function. The filter is responsible for applying any pre-processing frame or packet filtering.

From the filter, the frame/packet buffer is passed to the decoder. The decoder performs the necessary frame/packet decoding to direct the dispatcher on how to route the frame or packet.

The dispatcher examines the basic decode information and determines how to continue the packet processing. The dispatcher can distinguish VoIP signaling and voice data packets and then pass call start and termination events on to the agent for processing. The dispatcher is also responsible for identification and association of the incoming packets to a particular VoIP call.

The delay estimator is responsible for estimating the transmission delay for packets received. This information is passed into the monitoring agent to ensure accurate call quality measurement.

Transmission Delay Estimation
In both the class A and B environments, the embedded monitoring agent uses transmission delay estimates to calculate impairment factors incurred as a result of excessive delay and ultimately to calculate the quality level, often called an "R factor". There are several methods suitable for delay estimation.

One method is a passive estimation of round-trip delay based on the real time control protocol (RTCP) sender or receiver reports. When an RTCP receiver report is received, the appropriate fields can be passed into the agent, where the agent can then calculate the round-trip delay in the method specified in IETF RFC 1890.

Under another method, round-trip delay is estimated using an Internet control message protocol (ICMP) echo. An ICMP echo request can be sent from the target environment to each call endpoint of interest. When the requests are transmitted, a timer can be started that is used to determine the elapsed time to when the ICMP echo response is received.

Using ICMP echo for active round-trip delay estimation has some drawbacks. For example, if the call passes through a firewall, most firewalls prevent ICMP echo requests from passing through. In these situations, another protocol can be used to request a response from the call endpoints, recording the elapsed time between the request and response.

If it is not possible to implement either passive or active round-trip delay estimations, don't panic. A call-quality monitoring agent can still generate useful call quality metrics.

Generally, the effects of delay do not impact the network "R" factor, such as call quality measures that do not include perceptual effects. Moreover, the impact of round-trip delay on the agents user "R" call quality metrics, such as metrics that include perceptual effects and thus delay, is relatively small for round-trip delays as high as 300 ms, which is unusually high for IP networks. The small impact of delay is due to the fact that we are measuring call or speech clarity as opposed to conversational difficulty, where delay has a much bigger role to play producing significant impairments such as double talk.

In cases of extreme delay--350 ms or higher--the accuracy of the user "R" factor will suffer. However, even in these extreme situations, designers can still use the network "R" factor as a measure of voice quality.

If there is no way to estimate round-trip transmission delay, the agent's quality metrics can still be generated, but will tend to show somewhat higher quality than actually experienced. While this is not a good thing, the data is still useful in trend analysis, i.e. over time, one can still discern if call quality provided by the network is improving or getting worse.

Resource Requirements
Now that we've laid out how the agent will work in the class A and B environments, let's look at the system resources required to integrate these agents. In order for a call quality agent to generate accurate voice quality metrics, the target must provide sufficient host processor resources, enough code space to support the agent, and heap and stack memory.

Let's look at these three requirements in greater detail. Note that the host system resource requirements indicated below are general estimates based on the use of an Intel Pentium III style processor and are included for rough resource estimation purposes. Use of a different processor will affect the required resources. Different compilers will also generate differing agent code sizes based on the compiler optimizations. Memory requirements also vary with compilers due to memory alignment optimizations.

1. Host Processor Resources
Determining a valid measure of processor loading generated by a very lightweight agent implementation is difficult to do. One logical basis for processor load calculations is the number of Intel Pentium machine instructions required to complete a certain agent task. The actual number of machine cycles will vary based on compiler optimizations and instruction pipelining. The number of instructions indicated is an approximation of the worst-case processing path for the indicated event.
Generally, the task set is as follows:

Create voice call packet stream. As a new call is detected, instruct the agent to construct and initialize its call-tracking data set.

Per packet tracking. As each packet is received, its timestamp and some of the transport protocol packet header fields are indicated to the agent for the jitter buffer emulation functions.

Calculate call quality at desired intervals during the call. If quality metrics are desired more frequently than just at the end of the call, instruct the agent to calculate call quality on demand. This is especially useful for generating alarms.

Calculate call quality at the end of the call . When the call is indicated as terminated, the call quality metrics are automatically calculated and the MIB or data store is updated .

Total instructions to accomplish this set of tasks would be on the order of 1500 to 2000 instructions for a class A target while the class B target would require anywhere from 50 to 100 more instructions.

The additional processing in class B environments is attributed to the fact that packet receive timestamps are not available. Thus a real-time interrupt is used to trigger the jitter buffer "playout" of received packets. In order to handle this type of architecture, the agent must record more information for each packet received, thus the additional instructions per packet.

2.Code Space
Estimated code space requirements for the call quality agent code, compiled for an Intel Pentium processor without optimizations, would be in the range of 50 to 100 KB. With optimizations, the estimated code space requirements could be lowered to 20 to 30 KB.

3. Memory Needs
In order to effectively implement a passive monitoring agent, the target must provide heap memory for each agent instance. This includes the RAM needed to store information about the voice codecs supported by the agent. Generally 700 to 1000 bytes could be allocated for a class A or class B environment.

There should also be on the order of 200 bytes heap memory allocated for each physical MAC interface supported by the agent. Each voice stream constructed on each physical interface should also be allocated heap memory. Generally, 150 to 300 bytes would suffice in a Class A environment while a the Class B target would require approximately twice that amount as it must record more data to handle the real-time interrupt.

The actual amount of stack memory needed would also depend upon the number of compile-time build options offered within the agent. These build options control the jitter buffer emulator configuration and statistical information about each voice call stream. Stack memory usage would be on the order of 125 to 175 bytes in either environment.

About the AuthorsBob Massad is vice president of product strategy at Telchemy. Prior to Telchemy Bob was director of advanced technologies at NetScout Systems. He can be reached at rmassad@telchemy.com.

Shane Holthaus is a principal software engineer at Telchemy. Prior to Telchemy, he was a principal engineer at Virata Corp. Shane can be reached at sholthaus@telchemy.com.