Speech analysis concept

FCP X already analyzes audio waveform patterns. Apple owns speech recognition software in the form of Siri. It would be great to see them enhance the FCP X Event Find bar to include speech detection, much like Avid's PhraseFind or BorisFX's Soundbite. Potentially there may be some patent issues, but at least in theory this concept could be built upon technology Apple owns.

Type in a phrase string and FCP X finds the clips that match this string based solely on the audio within the clip.

Siri is actaully a human voice, I heard an interview just this past week with the VO artist herself, on NPR. However, I guess you're referring to the speech recognition of user voices that generates responses from Siri, right?

With regard to speech recognition built-in to X, I am under the impression that Apple is still very much into having 3rd parties supply the majority of add-ons, so wouldn''t 3rd party software such as that from Boris suffice?

[David Roth Weiss]"However, I guess you're referring to the speech recognition of user voices that generates responses from Siri, right?"

Correct.

[David Roth Weiss]"so wouldn''t 3rd party software such as that from Boris suffice"

No, because it doesn't originate from within the application. Boris places markers that have to be imported as XML. Not nearly as useful as when you want to find a specific word to properly button the inflection of a "frankenbite".

Exactly. And unless I'm wrong, loads of Siri APIs are coming in iOS8 and Yosemite- so perhaps even if Apple doesn't incorporate it themselves, a 3rd party might be able to write a plugin or extension against the siri API. I dunno, but it would be another big feather in X's metadata cap.

Demo of relevant feature from Premiere Pro / AME (Speech Analysis) (this demo uses a reference script that aids the indexing process ... also it shows "clicking on words" to jump that point in the clip):

When Adobe first introduced this feature, it was very hit-or-miss. That's because it was positioned as an instant transcription. Later Adobe sort of admitted that it was primarily a way to link text to keywords derived from the text. So they added this "dictionary" function as a way to improve the accuracy. Give it some sample text to improve the interpolation of the dialogue.

It works even better when you have the actual transcript and load that into the text field. The point is that it links the words to the point in the track when that word is spoken. In that sense it actually works a bit more like Avid ScriptSync than Avid PhraseFind. This way you can click on the word in the text field and find that portion of the clip.

[Michael Phillips]"Except that unlike ScriptSync, there is no unified view of text and all related clips. "

That's what struck me too. There are a lot of different implementations, including several built around licensing the same Nexedia technology that Avid has, but if it was easy to build something as feature-complete as ScriptSync, more people would.

Or maybe not enough people have seen it work that they know the target to hit, or surpass, or to understand the transformative potential of getting this right.

The fact is that even the less capable or less complete alternatives are pretty amazing. Whether it's a plug-in, or something built in to one's NLE of choice, more people should be using it, and telling developers how to make it even more useful.

People will find a way to make technologies work for them if it fits their price point. But if I were working on a complicated feature or doc, my first choice would be Avid strictly because of ScriptSync (and Phrase Find, which - although it is now charged for separately used to be, if memory serves, part and parcel of ScriptSync.) Yes, Avid has it's set of problems. But the availability of ScriptSync can often be a deal breaker.

ScriptSync is the automated part and is separate from script integration, which can be done manually. Both ScriptSync and PhraseFind have been extra cost options. Neither is currently available with new Avid software due to ongoing and as yet unresolved negotiations between Avid and Nexidia.

True, but it still works with older Avid Software - at least for now. As for them being separate functions - maybe. But charging for them as separate functions - at least as I remember it - was a relatively recent happening.

That sounds right. I know I still have a machine running 3 that I use all the time that has it and I know it wasn't extra then. One of the more bizzare and inexplicable moves Avid ever made was to begin charging thousands of dollars for a function that had always before been included. Don't know what the legal ramifications are but if going forward Avid will not have this functionality then the only thing that will set them apart is having markers that stay where you put them, no matter what. (And that is not a small thing to me.)

You'll remember that when Avid included these features, the retail price for Media Composer was higher. So Avid was absorbing the licensing fees. As Avid had to drop the price, they decided to option these functions to cover the expense.

One of the challenges facing a lot of these technologies is quality of recording and quality of speech. Volume, emotion, accents, whispering, etc. all affect the success of a speech to text engine, not to mention that the words have to be in a dictionary. Dictionaries do not represent entire languages, names, etc. are usually missing.

You'll find on average, perfectly recorded, pronounced at a steady pace, has about 80% accuracy.And you'll be amazed at how much you miss that 20% to make things truly useful. Premiere Pro's speech to text is a good example of that.

This is the state of speech to text today, who knows over the next 5-10 years, but many companies have been working on this for the past 50 years.

Speech to text, and phonetic based technologies such as the Nexidia solution are very different technologies each with their own distinct advantages and disadvantages - just don't expect the same functionality from both of them for all things.

[Michael Phillips]"This is the state of speech to text today, who knows over the next 5-10 years, but many companies have been working on this for the past 50 years."

The nature of allophonic representation is such that computers just can't do it, ever; they cannot computer a meaning. They are not good at inductive reasoning. Humans are great at inductive reasoning.

If speech recognition were solved via computation, then also video editing would be solved via computation. A bit of reflection and we can see this to be true: phonemes represented as allophones occur in sequences, just like images.

Speech to text is really a training the computer to match the allophonic representation of a single user (because we all have different shaped mouths and tongues) into the prescribed phonology that corresponds to the prescribed spelling.

And all of you are 'speaking of' only English as it's spoken in the US. Then there's English as spoken in Scotland, Ireland, Australia, or heavily accented English spoken by non-English speaking people.

And just after they finish figuring out English, there's the rest of the world, with its myriad languages. In a docu in India, for instance it's not uncommon to have 3-6 Indian languages and languages with English words sprinkled in sentences of another language. It would take a super computer to just identify which language is being spoken. I think we are a way off from this speech-text interactivity in editing. At least on a global scale.

A phonetic based technology can also derive language identification and does not need a super computer. Nexidia already provides language ID for some of its solutions now for any of its currently supported languages. From a dialog search, I can search based on Canadian French versus France French, etc.

Latest release of their QC product includes identifying dialects for certain languages.

Phonetic language modeling can be done for languages that do not exist.

[Oliver Peters]"FCP X already analyzes audio waveform patterns. Apple owns speech recognition software in the form of Siri. It would be great to see them enhance the FCP X Event Find bar to include speech detection, much like Avid's PhraseFind or BorisFX's Soundbite. Potentially there may be some patent issues, but at least in theory this concept could be built upon technology Apple owns."

Also, I was under the impression that Siri uploads audio recordings and processes them in the cloud -- not locally on the handset -- and I know that many here are not fond of cloud-based solutions. Would reliance on the cloud be a showstopper here?

That would depend, at least in my own opinion. If said 'cloud solution' included a 'subscription only' way to pay for the plug in, it would be a non starter for me. Just my personal opinion, but I'm not a fan of that way of doing business, and my business decision is not to spend money on subscription based 'cloud services'. And it's not just Adobe's CC. I recently stopped doing business with Digital Juice as they went to a subscription based model for their content. I've got no problem with various aspects of 'cloud solutions'. For example, being able to instantly download a program after you purchase it is awesome! And is 'green' (no discs or big manuals and packaging just
a download and a PDF manual.). As always, the devil is in the details.

[Walter Soyka]"If this proposed feature were to work the way Siri works now, all your audio would be uploaded to someone else's computers, out of your control, for analysis. And no Internet? No audio search."

I think there's an important distinction to make between Siri and Dictation. While, I'm sure, there's crossover in that Siri is a learning engine for Dictation, there is a difference.

For example, you can download an "Enhanced Dictation Engine" in the form of 784 MB that helps for offline Dictation in Mavericks.

Siri, on the other hand, needs to look queries up on a network in order to retrieve an answer, Dictation doesn't necessarily need this same capability.

I'm sure that all the folks that blab at Siri helps Apple develop better Dictation skills.

You're right in that in order to get this working the way that Apple would want to use it, and in order to
constantly improve this technology, the internets will need to be nearby, but there is already an "offline mode" in place (...you know, the offline mode, that you have to download from the internets...such a Catch22 with this technology).

[Gabe Strong]"For example, being able to instantly download a program after you purchase it is awesome! And is 'green' (no discs or big manuals and packaging just"

I agree that having download now options can be very convenient but since this is the FCP X or Not forum I'm going to get all tangental and question the assumption that going the download route is inherently more green. To replace physical things Apple, Adobe, Netflix, Amazon, etc., have had to build huge facilities brimming with servers and networking equipment and all of that manufacturing of electronics comes with an environmental cost. For example, trees, if harvested sustainably, are an abundant and renewable resource where as the raw materials used in computers (and other electronics) are not and recycling a plastic disc and paper manual is much easier than recycling electronic waste.

As somebody said before, in Avid's script integration there is a unified view of the screenplay + all the takes right inside, which allows us to see what takes include this line of dialogue, compare them quickly, takes the best take etc.

it is true the sync has to be a manual process until the scriptsync licensing issues are resolved, but this view of takes it's still ultra-useful, especially in fiction, when takes have different energy, varying tones, and the edit wants to use all that richness etc.

does that "script integration view of takes" exist in FCP X or Premiere Pro? I've searched the net, it doesn't seem so... any solutions?

Script Based Editing is what the feature was called before the phonetic transcript alignment was added (Nexidia technology) at which time it was rebranded ScriptSync. While waiting for the licensing issues to be sorted out, script based editing still exists and does provide a unique view of the coverage available to me as an editor and the ability to select line(s) and take(s) for review, etc. I am editing a short right now where I manually synced the takes to the script for just this purpose. I wish Avid would take this view more seriously as the functionality hasn't changed since it was first introduced in 1996-7 - other than "hold slated on screen, and of course the phonetic sync capabilities.

The phonetic sync is what really brought the feature to the masses as it eliminated the one major pain point of the process - which is syncing. Transcripts can be done faster manually, but when dealing with multiple takes takes more time manually.

But once lined, it is a great way to edit. And no, there is no such thing in other applications although Lightworks did show a script line view back in the 90's but the Ediflex patent held up. That patent is one that Avid got in a deal when Ediflex went under and Script Based Editing was implemented from a new design. But that patent has now expired (3-4 years ago?) so there is no real reason why we won't see similar solutions coming to market. Avid does have additional patents which refer to auto-lining using speech technology which covered the use of technologies like Nexidia, but even those are set to expire relatively soon.

There's a bit of history lesson in what should be a quick answer, no other NLE's have it (at this time).

As an aside, one of the advantages of script-based editing in Avid is the ability to preview a series of different takes at any given dialogue line, back-to-back. A number of editors who don't used this tool have described a different practice to me. These include William Goldenberg and Kirk Baxter. It goes like this.

Make a general decision about where you plan to make edits, based on the dialogue lines. Edit a string-out sequence of each line from each take in succession. This means that each line is repeated for as many takes are you feel are good. Then add the next line and repeat. Organize these clips from wide to close-up for each line. Now you have one long timeline with all the options and angles for each line. This gives you a quick way to compare, as well as a sequence to go back to when the director wants to review the coverage for other options.

Baxter then goes through this sequence and on any of the selects that he likes, he'll raise the clip to a higher track. Once done with this copy, he'll delete the non-selected clips and start shaping the scene with the remainder.

Not the same as true script-based editing, of course, but still a very viable way to achieve the same result, especially when you don't have this feature available, like in FCP X.

[Jeremy Garchow]""Auditions" in FCPX allow you to edit with different takes"

I'm not sure if you are responding to my post or just adding another option.

While useful, audition is not really the same as the process I described. With auditions, you have to switch back and forth. With a string-out of all takes, you can directly compare one line reading to another without any delay in switching the audition. Furthermore, with a 2-sequence pancaked timeline in Premiere Pro, just pull down the section you like from one sequence to another to build the scene.

Auditions is nice, but seems impractical if you have a dozen versions of each dialogue line in a dramatic script.

I should note that Mike Matzdorff addressed this directly in one of his web appearances, expressing the opinion that while the ability to match takes to a script can be nice, it falls apart pretty quickly in the face of real real world filming where the actors and a director often want to deviate from the "as written" lines.

He described great frustration during the edit of Focus, trying to fit what was ACTUALLY said, into a form where the script was running simultaneously. In the end, I think he opted for using the X database with LINE NUMBERS instead of transcriptions - since that bucketed the takes designed to get the sense of the scenes across, regardless of the actual words spoken doing so.

That doesn't say script line matching it's useless at all. Just that it might be a more suited approach for strictly constructed copy where a performer is expected to do the lines precisely as written. This may make the most sense in areas such as corporate video where highly technical or procedural scripts are vetted by content experts and even teams of lawyers and have to be delivered word for word.

Know someone who teaches video editing in elementary school, high school or college? Tell them to check out http://www.StartEditingNow.com - video editing curriculum complete with licensed practice content.

well for me even when words are changed and small improvs are made, it's still very useful to be able to click on a line and go directly to the time where the actor says that line, responds to the previous line

so I can compare quickly and in context the different tones and wording styles of how a character could answer a line, etc. - without having to spend time finding the right time in each take, which breaks the process...

[Bill Davis]"I should note that Mike Matzdorff addressed this directly in one of his web appearances, expressing the opinion that while the ability to match takes to a script can be nice, it falls apart pretty quickly in the face of real real world filming where the actors and a director often want to deviate from the "as written" lines."

I've interviewed a lot of big-name Avid editors for some of the DV stories that I write and I generally find them to be split. Some love it and can't live without it and others have never used it. One sentiment I've heard a lot is that they think they'd like to use it on a film and drive their assistants crazy setting it up, only to never use it when they start cutting.

FWIW - here's a bit of a pseudo script-based workflow I set up for FCP X. It's a lot like Mike's approach.

[Bill Davis]"That doesn't say script line matching it's useless at all. Just that it might be a more suited approach for strictly constructed copy where a performer is expected to do the lines precisely as written. This may make the most sense in areas such as corporate video where highly technical or procedural scripts are vetted by content experts and even teams of lawyers and have to be delivered word for word."

Where it really helps on the Avid side is with documentaries, if you have interview transcripts. Obviously the holy grail would be speech-to-text that actually works, so you didn't have to create a transcription in the first place. Steve Hullfish create a nice workaround for this using dictation.

[David Steiner]"As somebody said before, in Avid's script integration there is a unified view of the screenplay + all the takes right inside, which allows us to see what takes include this line of dialogue, compare them quickly, takes the best take etc.

it is true the sync has to be a manual process until the scriptsync licensing issues are resolved, but this view of takes it's still ultra-useful, especially in fiction, when takes have different energy, varying tones, and the edit wants to use all that richness etc.

does that "script integration view of takes" exist in FCP X or Premiere Pro? I've searched the net, it doesn't seem so... any solutions?"

[Jeremy Garchow]""Auditions" in FCPX allow you to edit with different takes, and the magnetic timeline makes this so easy to switch from take to take and not have to heal the timeline around it.

If the metadata has been prepared in advance, getting to synced Auditions is very easy in FCPX with Shot Notes and Sync-n-link ."

Actually, while Siri speech-to-text does work, it works only in English or even English delivered with a sufficiently western accent. Siri absolutely doesn't work in, for instance India, or the Far East, or maybe even Africa. OTOH, Google search works just fine with our accents too. On my iPhone I cannot make Siri understand me without repeating things with a faux US accent.
But Google maps and Google search can understand me perfectly.

Avid Script Sync too (AFAIK) works only in English. With dialogues and scripts in other languages, not so. I've seen Script Sync magic work in an American movie being shot in India where the asst editor had set it up with the script neatly and one could simply double-click on any line in the script and it would jump to the shot with those lines. Except when one of the characters was speaking some other language or even English with a non-Western accent. There it needed help.

Nexidia supports a couple dozen languages and dialects and language packs were available to purchase for both ScriptSync and PhraseFind. Whether Avid's implementation stayed updated with latest versions might be a reasons for a more limited set of languages as new ones get modeled over time.

I think there are a lot of valid alternatives as to how takes are organized. What is unique about the script layout is there is no need to do a search or other operations as it can be done at a glance - I can look at the script and see what coverage is available for any given span of story. This can be further refined with indications of dialog being spoken on screen, or off (or from the back), color for preferences, etc. Once edited, there is a script matchback (like matchframe) that from the event in the timeline goes back to the script and highlights take and span used for alternate choices. And without PhraseFind can be used as a text search for action, scene, dialog etc. to make the story work.

There are many more things this interface could provide if Avid was committed to the interface with or without phonetic sync (which is huge, but still a supporting player).

Speech to text (ala Siri) and such get you to a transcript which could then get you aligned, but that technology has been hit and miss for decades and Adobe even removed it from the product. It's amazing how much 80% accuracy is still a pain to work with.