Inference Moves To The Network

Machine-learning inference started out as a data-center activity, but tremendous effort is being put into inference at the edge.

At this point, the “edge” is not a well-defined concept, and future inference capabilities will reside not only at the extremes of the data center and a data-gathering device, but at multiple points in between.

“Inference isn’t a function that has to reside anywhere specific,” said Kurt Shuler, vice president of marketing at Arteris IP.

Exactly what happens in those middle locations, however, is still a matter of conjecture.

Data center or edge? A binary decision?
In early ML implementations, inference was done in data centers, where there is ample processing power. The idea was that devices that required inference, like smart speakers, might do a minimal amount of work locally, but ship most of the data to the data center for the serious inference work.

The problem is that adds latency and consumes bandwidth. It takes time for the data to go to the data center and come back, and the original data shipment might be significant. With security cameras, for example, streaming video to the cloud requires significant bandwidth, and delays might hurt the camera’s effectiveness.

This has sparked the current push to bring inference down into the edge devices themselves, freeing up the connection and giving more timely results. “AI accelerators are trying to get closer to the edge, if not at the edge,” said Ron Lowman, strategic marketing manager for IP at Synopsys.

But many such edge devices have limited power budgets — especially if they operate on batteries. The result is a need to go easy, simplifying processing wherever possible and compressing algorithms if necessary. Edge inference might be slightly less accurate than cloud inference, but the accuracy must remain high enough to maintain the performance expected by users.

This has led to a polarized view of the overall system architecture. Either the inference happens in the data center, or it remains within the edge device itself. However, the need for more sophisticated data processing within networks — and the advent of 5G wireless in particular — are creating an opportunity for inference in many places between the data center and the edge.

Fuzzy terminology
The terms “data center,” “edge,” and “cloud” are often used imprecisely. The most general term describing where the high-powered version of inference would occur is “data center.” The “cloud” tends to refer to a data center operated by one of the major cloud vendors like Amazon and Microsoft, where cloud services are made available to a wide range of customers. But data centers also can be private, and they can reside on the same premises as the edge device.

“Most folks are developing their own [data center],” said Ron Renmick, senior director of product marketing at Mellanox.

That changes the latency and bandwidth considerations, but the power constraints are similar to those in the cloud. Flex Logix CEO Geoff Tate and John Kim, Director of Storage Marketing for Mellanox, observed that medical inference in particular is very likely to remain onsite with its own data servers due to HIPAA privacy and security concerns.

The “edge,” meanwhile, most specifically means the very edge of the network, which would be the device generating the data. It’s connected to the network with or without wires, and inference activities begin there with the generated data, and they end there based on decisions made through inference. But the word “edge” often is used to refer to anything that’s not the data center. That could include cellular base stations, telephony points of presence, ISP servers, and gateways.

Those middle points typically will have a lower power budget than a data-center server, but because they often have a more generous power supply than many edge devices they are not as restricted as the edge device itself would be. They also wouldn’t scale in the way that data-center processing can scale. But they’re likely to be able to handle more complex algorithms than would be viable within an edge device.

Three categories of inference
Interviews with different players reveal three distinct categories of inference between the cloud and the edge. Those are:

Inference done for the sake of the network itself;

Offloading inference from an edge device to a smartphone; and

Providing inference as a service for other applications somewhere within the network.

Network operation has become increasingly complex as operators try to maximize the utilization of their bandwidth and live up to their quality-of-service (QoS) agreements and security obligations. “The use of AI in improving infrastructure is huge,” said Anoop Saha, market development manager at Mentor, a Siemens Business. This is particularly true in the case of the new 5G networks being rolled out. Advanced 5G capabilities require significant predictive analytics, like determining where to focus a beam using the new massive MIMO capabilities. Server chips capable of inference are starting to appear in base stations.

Video also requires inference. A report by Sandvine says that video made up 60.6% of global downstream internet traffic. Numerous points in any network may need to provide video transcoding to adapt video streams to the capabilities and bandwidth of their destinations.

In addition, security is creating the need for inference everywhere to detect malicious packets as soon as possible, and to purge them from the network before they travel further. This is probably the most pervasive use of artificial intelligence (AI) in networks so far. “There’s AI for packet analysis everywhere,” said Saha. Mellanox (which has just become a part of Nvidia) has built this capability into their I/O Processing Units (IPUs), as presented at the recent Linley Spring Processor Conference. These applications mean that inference will pervade all of the major communications networks.

Fig. 1: Data center future architecture, which looks similar to today’s architecture with the addition of smart NICs throughout the network. Such security measures will also be needed in the networks outside the data center. Source: Mellanox

These examples of inference in the network, however, serve only the network itself. Inference improves operating efficiency and quality, and it’s necessary to allow for the full realization of the promise of new technologies like 5G.

The next example of inference not at the edge or in the cloud occurs right next to the edge — smartphones acting as inference engines for devices connected by Bluetooth. This is most obvious for the extremely lightweight computing capabilities of devices like earbuds, which have little space or power for doing work. Yipeng Liu, technical marketing director for Tensilica audio/voice IP at Cadence, sees this happening already. “With smart watches and earbuds, if connected to the phone, then their data can be processed on the phone,” she said.

This makes the phone an obvious proxy for inference that might otherwise take place on the devices themselves. Smartphones already have significant computing capabilities, but they’ll soon have neural processing units (NPUs – not to be confused with the “network processing units” of the early 2000s). “In a year or two, every phone in the world will have [NPUs],” said Saha. In specific application example, Liu said that, “In today’s devices, more language processing is going into the phones.”

Inference-in-the-middle (-as-a-service)
The third category is both the most interesting and the least clear of the three — the ability to perform inference within the network when that inference is unrelated to the network. If a 5G base station has inference capacity for itself, could extra capacity be built in so that it can be sold for inference as a service?

On the face of it, this seems like an obvious possibility. Inference jobs that are too heavy for the edge devices themselves (and where a phone isn’t involved) may not need the full power of a data center. “Some tasks might not even get to the cloud,” said Kim. “Like a reflex, it gets to the spine and turns around,” rather than going all the way to the brain. A medium-weight inference accelerator somewhere short of the cloud could, in theory, perform the inference.

In one example, Lowman said that, “Server chips are going into base stations. This is particularly good for AR/VR.”

While this might be a tantalizing idea, is there any sign of it happening? Thinking through the details reveals some practical issues that would need to be sorted out for it to work. “What is the business model?” asked Flex Logix’s Tate. First, some applications will benefit more than others. A VR application might be able to operate using a beefy local processing system, but AR systems might need to fuse in data from other sources, making a trip to the network necessary. But latency must also be kept as low as possible, making it useful to bring data together as close to the edge as possible.

Decisions on where to perform the inference also may vary by location and device. A phone-oriented application may try to leverage inference capacity in the 5G base station, for example. But what if the phone is operating in WiFi-calling mode, where it’s using the wired Internet connection instead of the cellular system? In that case, the phone traffic would bypass the base station. Would it be able to leverage some other network location, or would that mean it goes all the way to the data center?

Some large providers or applications companies might build in contingencies for flexible inference routing. For example, if a network has spare inference capacity at some node, it might capture a session and process it there, freeing up bandwidth between that point and the data center. On the other hand, if it didn’t have capacity, it might pass the session on to some other upstream engine, or perhaps all the way to the data center.

In another scenario, if intermediate inference came at a higher cost than data-center inference, an application provider with its own data center might preferentially send tasks to the data center unless traffic was particularly high. In that case, it might use intermediate nodes as peak processing overflow relief.

For smart-home applications, the home gateway is another possible locus of inference. If a single company outfits a home and includes its own hub, then that company has complete control over how the traffic is managed inside the home. If, on the other hand, the home had smart devices from different companies, then the ISP might have inference capacity in the modem or in its own servers just upstream.

“Netflix is doing local caching with the ISP,” Lowman said. “They may be able to do some of the AI stuff there.”

While Cadence’s Liu doesn’t see this happening yet, she did agree it’s a possibility. “With a smart speaker or a light switch, you could do [inference] in the gateway.”

Fig. 2: In this speculative scenario, headphones could send data for inference in the phone, or, from the phone, it could be sent over the cellular or WiFi/wired network. At any point along the way, NPU hardware could execute the task. If any of the points didn’t have inference capability, then the task could move upstream to the next unit. This doesn’t take into account security or the business model. Source: Bryon Moyer/Semiconductor Engineering

The more difficult issue, however, involves the business model. There are two obvious possibilities for build-out of these capabilities. In one case, inference is built for the network itself, but has capacity to handle high-traffic episodes. Between those episodes, it might choose to “rent” the spare capacity for extra revenue. In the other case, a provider might choose to build in excess capacity for the express purpose of monetizing its use for non-networking applications. The second case in particular cries out for a return on the investment in that additional hardware. But who pays for that return?

There are a few different possibilities, depending on whose needs are being met. If a big cloud provider sees this as providing more flexibility and greater capacity at incremental cost, it might make arrangements with a network provider (or several of them) to pay them for tasks performed outside its domain. Or a big application provider might make a similar deal to improve the performance of its application in the eyes of its users. A large enterprise could leverage this to offload tasks from its own internal data centers. In any of these cases, billing and task management facilities would need to be incorporated into parts of the network that, until now, haven’t required that capability.

Security and privacy considerations also change. “Security is both improved and threatened by this,” said Kim. While networks are used to protecting data in motion, that happens through encryption. If the network itself is operating on the data, then it has to be able to decrypt the data before working on it. In that case, data-in-use protections are needed. For end-to-end encryption, the encryption keys typically aren’t available except at the very endpoints, so a way would be needed for the data center to delegate tasks to an intermediary “delegate” node.

Kim provided some scenarios as to how that might work. They all assume that the delegate node will operate not just on one task, but for an entire session that might last seconds or longer. That helps to amortize the time required for the extra key exchanges necessary to make this work.

In one scenario, the delegate node is authorized before the session starts up, and the edge device authenticates directly with the delegate instead of the data center.

In another scenario, authentication happens with the data center, but then the data center authorizes a key exchange between the delegate and the key server so that the delegate can retrieve the existing key from the key server.

If the data-center and delegate keys must be different, then the edge device could run a separate authentication with the delegate, so that the data-center session uses one key and the delegate session uses another.

Alternatively, the edge device could authenticate only with the delegate, with the delegate then communicating with the data center to authorize the session. This is like the first scenario, except that the delegate must get permission from the data center in real time.

These details make what conceptually sounds simple more complicated — but that doesn’t mean it’s not possible. If the economics are attractive enough, it’s certainly doable. But that remains the big question: is this worth anyone’s while to do?

Some think this is unlikely to happen. “In general, when we look at AI and the data coming from vision (or other) sensors, they aren’t doing inference in the middle,” said Pulin Desai, product management group director for Tensilica Vision and AI DSP IP at Cadence. “SoC makers are targeting the actual edge devices.” If it does happen, there’s no expectation that anything like this would materialize quickly. Said Kim, “It’s really early days. Some is out there, but it’s very limited,” with more serious usage expected three to five years out.

Bryon Moyer

(all posts)
Bryon Moyer is a technology editor at Semiconductor Engineering. He has been involved in the electronics industry for more than 35 years. The first 25 were as an engineer and marketer at all levels of management, working for MMI, AMD, Cypress, Altera, Actel, Teja Technologies, and Vector Fabrics. His industry focus was on PLDs/FPGAs, EDA, multicore processing, networking, and software analysis. He has been an editor and freelance ghostwriter for more than 12 years, having previously written for EE Journal. His editorial coverage has included AI, security, MEMS and sensors, IoT, and semiconductor processing to his portfolio. His technical interests are broad, and he finds particular satisfaction in drawing useful parallels between seemingly unrelated fields. He has a BSEE from UC Berkeley and an MSEE from Santa Clara University. Away from work, Bryon enjoys music, photography, travel, cooking, hiking, and languages.