Charting Where Enterprise Video Will Go in the Years Ahead

Imagine that you’re trying to plan for the upcoming fiscal year, and suddenly remember that a key budget element that impacts your department was mentioned in an all-hands meeting a week or so ago.

You search through the agenda document, but don’t see anything that fits what you remember, and the C-suite hasn’t yet released full budget figures for next quarter, so there’s no way to find that particular detail.

One option is to email the comptroller’s office and ask the team to give you the budget information, but you don’t remember enough detail about the particular point the speaker or speakers made in the all-hands meeting. What to do?

A few years ago, searching through hours of all-hands meeting video would’ve been your only option. But thanks to computer vision, machine learning, and robust metadata extraction, today’s enterprise video platforms (EVPs) are able to help you find what you need to know to appear knowledgeable when you ask for additional details.

Technologies that perform facial recognition and speech-to-text indexing have been around for at least 2 decades—in fact, a research project in Europe that I was part of around 2004 posited the very scenario mentioned above—but the tools were complex and didn’t easily tie into EVP solutions. Still, even those tools were capable of up to 32,000 pieces of metadata per frame.

Today’s solutions offer more accurate facial recognition and slightly better speech-to-text transcription, the latter in part derivative of what we’d done on both our research project and other projects that focused on virtual beam-forming microphones for use in corporate environments. Yet the true power of these solutions isn’t in the discrete tools but rather the holistic approach to searching indexed content.

Going back to our scenario, let’s consider ways to approach the problem.

First, if the presenter was properly recorded—meaning the recording used a good lavaliere microphone and didn’t combine that microphone with the audience mics in the final recording—then recent advances in speech-to-text processing should yield adequate enough results to find the keywords you’re looking for.

The same may be true with several presenters, assuming they don’t talk over each other. Some solutions even offer a way to differentiate between speakers. While it’s often a fairly rudimentary differentiation (e.g., Speaker 1, Speaker 2), it still makes it possible for the seeker to filter searches by particular speakers.

What if you don’t remember who the speaker was, or if the audio is illegible enough to throw off the speech-to-text transcription engine? Several solutions offer the ability to find video based on a speaker’s image.

We’re all familiar with facial recognition, which has become more prevalent thanks to Facebook and integrated facial-matching technologies like those integrated into the Photos app on Apple’s iOS products.

Video facial recognition is a bit more of a black art, though. After all, video is at least 24 still images per second, and sometimes up to 60 images per second. The sheer amount of information available in a single still image, or frame, is staggering, which is why most facial recognition systems for single still images require several seconds to process each frame.

Add to this the complexities of the way that intraframe compression works— where codecs like H.264 use full frames, or I-frames, coupled with differential frames like P- or B-frames that don’t store the entire image—and the complexities of decoding and indexing each individual frame rise significantly.

In addition, regardless of whether it’s a still image or a single frame of video, complexity also rises if there are several people in the shot. All in all, the processing required to handle just the facial recognition just for a few seconds of video is staggering.

On top of that, a professionally edited video will often cut back and forth between one of several presenters, the audience, and graphics (e.g., websites or PowerPoint slides).

So not only does the facial recognition portion of an EVP solution need to identify when a presenter appears on screen, but also when that person disappears and then reappears again within a given threshold of time.

In other words, facial recognition needs to have both a tolerance threshold and an aggregation function, so users can search for a person and receive results that generalize sections of a video in which a particular presenter appears.

Based on the above, the good news is that this multiface, multiframe facial recognition might actually help solve the problem of finding the right video clip to help with our budget problem. If you suddenly remember that it was two co-presenters who were talking about the new budget and how it impacts your department, might it be possible for the EVP to search for more than one person at a time?

The answer is yes, although very few solutions offer this option.

One that does offer this in a somewhat rudimentary form is the new Microsoft Stream service. Designed to replace the legacy Office365 Video service, Stream makes it possible to choose more than one person to search for, at least for on-demand content.

To do so, Stream offers features like audio transcriptions and face detection as a way to find relevant content.

Beyond that, Stream also offers the ability to search text that appears in a video, “even for specific words or people shown on screen, whether in a single video or across all your company’s videos.”

According to the Stream site, built-in machine learning “intelligence also drives accessibility features, so every person can engage according to their need.”

Microsoft Stream offers audio transcriptions and face detection as ways to find relevant content for an extra $2 per month beyond the $3 per month, per user base fee.

Stream is available for those who have an Office365 subscription, but it is not limited to subscribers. Pricing for those without Office365 is on a per-user, per-month basis. The basic service, for $3 per month per user, offers a way to aggregate, organize, and search videos. For an additional $2 per user per month, Stream offers two features key to our scenario: search “using deep search based on in-content signals like speech to text,” and search “using face detection and audio transcripts.”

What About Live Enterprise Video?

It is possible to implement search features in live video, but the processing requirements noted above make it impractical in most on-prem solutions. The growth areas in making live video searchable will most likely come from cloud-based EVP solutions.

One possible approach, if your web-based unified communications tool offers traditional videoconferencing functionality, is to add an end point to a call that’s equipped with these indexing features. This allows the end point to begin recording the conference, in much the same way as a traditional network digital video recorder (NDVR), and then begin processing and indexing video frames at near-real-time speeds.

In practical terms, today this takes two to three times the actual length of the videoconference, but algorithm optimizations and increasingly powerful processors may bring this down closer to real time in the next 2 years. Also expect to see this type of feature added to web-only services like Zoom and Skype for Business in the near term.

Likewise, portable production and video capture solutions are growing in popularity. Mike Savello, VP of sales at LiveU, says live enterprise video is becoming a bigger part of the company’s target market.

“In fact, we are now targeting Fortune 500 accounts that regularly do internal global events,” says Savello, noting these events might span a range from CEO addresses and quarterly updates to new product or service announcements.

In the past, Savello says, these types of corporate events might “involve a company renting a production truck and a satellite truck, which can get very expensive.

Although typically associated with “in the field” capture for sports events and news, cellular bonding solutions like those from LiveU are finding a home in enterprises, particularly Fortune 500 companies that regularly do internal global events.

“We can offer ‘at-home’ production,” says Savello, “by backhauling each camera over cellular or other IP connectivity to a central production platform. That means you don’t need a production truck or sat truck on site anymore.”

Monetizing Enterprise Content?

One pitch we’ve heard in the last year is the idea that enterprise content can be monetized. The pitch is often made in reverse, with companies noting that regular social platforms should not be considered as equivalent to EVPs, for several reasons including their inability to monetize content.