Post navigation

No cloud server or messaging system is completely secure: Just ask Hillary Clinton. Even though these systems are protected with layers of security, these layers can be hacked. Brute force attacks can crack passwords. MITM attacks using tools like sslstrip can turn secure sessions into insecure HTTP sessions. And outright manipulation of human confidence can be used to access virtually anything.

This is why homomorphic encryption is on the brink of becoming popular in cloud computing, especially when only 25% of people trust cloud providers with their data.

With homomorphic encryption, a cloud server can’t see the original content of a file. Instead of the original content being stored, a scrambled version of it is stored. And using homomorphic encryption, everything from plaintext to audio snippets can be stored, searched for, and located on the cloud server without the cloud server company seeing it (explained visually below).

For instance, if you are a doctor who has dictated sensitive patient data (as hundreds of thousands of medical professionals do every day), you could send the recording to a homomorphic speech service, then search the audio file for specific keywords. Without understanding the content of the recording, the service could locate parts of the recording with those keywords and send them back to you.

Currently, most practices send audio reports to medical transcriptionists, which is hardly secure, especially if the transcription service is outsourced and not kept in-house. At the end of the day, computers are less emotional and, therefore, more reliable with information than humans.

How files are securely stored and searched for on cloud servers

At Intelligent Voice we take emails, phone calls and other communication and put them through a powerful, AI-driven analytics engine. This helps companies see what kind of conversations their team is having with customers, among other things.

The results from this, including transcripts of video files and phone calls, can now be stored securely using homomorphic encryption on cloud servers.

We can search encrypted audio transcripts without ever decrypting them. The cloud server never sees them in plaintext form and privacy is assured.

Below we’ll go over how this works with an audio file. However, the approach is the same for files that are already in plaintext.

Architecture of homomorphic-based encrypted phonetic-string-search

Data Flow

We reduce an audio or text file into symbols (which could be phonetically based). These symbols are the “content” that’s indexed on our cloud servers.

The encrypted audio and symbols are uploaded to the cloud and added to an encrypted index.

When a search for a file is initiated, the search term is encrypted using our algorithms to find matching symbols. Relevant files and file portions are returned.

Legend

Light blue: Encrypt Audio File

Blue: Cloud Server

Green: Turn Audio into phonetic symbols and encrypt

Yellow: Homomorphic representation of phonemes

Red: Client-side search preparation

Purple: Encrypted results returned

Glossary

AES encryption: A very powerful “symmetric” encryption technique ie the key used to encrypt is the same as the key used to decrypt

Trapdoor: A mathematical function that is easy to compute in one direction, but very difficult to reverse engineer from just the answer

This symbol approach is important (and patent pending) because it reduces “search space.” Technologists have found that if you search for words using this approach, it’s painfully slow because of the processing power required. You might be trying to find over a million possible combinations.

However, if we take a word or phrase and reduce it to symbols — homomorphic HH AO MX AH MX AO RX FX IH KX, for instance — there are only dozens of available symbols. So we index these instead, across voice or text, and the search space is reduced from millions to dozens of units. Instead of looking for collections of matching words, we’re looking for matching streams.

Take a banking institution for instance. While the customer service representative is asking you questions about your social security number and where you live, voice print recognition software could be working in the background for enhanced security. It would identify characteristics of your voice like pronunciation, emphasis, accent, and talking speed.

Currently, it’s harder for someone to steal someone’s unique voiceprint than it is to steal information like social security and account numbers. But it’s not impossible. A hacker could easily hack a third-party cloud server that has your voiceprint and use voice mimicking software to hack your financial accounts.

The recent CloudPets hack shows just how easy this is. Using homomorphically encrypted and stored audio would significantly increase the security and privacy of this data

Conclusion

Even though homomorphic encryption was discovered decades ago, there’s only recently been enough computer processing power to make homomorphic storage and search practical. Before, it would take hours or days to do what now takes seconds.

This is good news for cloud service providers, because even though cloud servers can be hacked, it won’t matter as much if they and their customers are using homomorphic encryption to increase the overall security and privacy of their data: If the cloud has never had a “plain” version of the original data, the hacked data remains encrypted and inaccessible.

We all have a nagging sense of worry about letting our personal data roam free in the cloud. Can we trust the people who hold it not to misuse it?

It seems that in fact very few people feel comfortable about it: We asked 1,500 Americans over the Thanksgiving weekend whether they trusted the “Big Cloud” providers, such as Google, Amazon and Apple, not to misuse their personal data.

Shockingly, only 25% said that they did trust them. Which begs the question what the remaining 75% are doing with their data…

We have done a lot of work recently on porting the IV system to run on small embedded devices, specifically those that will be used in self-driving cars.

I have a particular issue with the way the industry has been looking at the implementation of self-driving capabilities, particularly the reliance on the “cloud”.

A lot of the core driving competencies of the self-driving vehicle will be built in: it has to be, as you can’t be sending data to a central server in a timely enough fashion to avoid a crash. But we have seen that at the moment, a lot of the “non-core” elements, from mapping to the HID (Human Interface Device), are being tossed over the fence to cloud providers.

Given my interest on “on-premise” voice recognition, an obvious bug-bear for me is doing the Natural Language Understanding piece in the cloud, ie where you talk to the car in a natural way, and it reacts accordingly. Some elements such as booking a restaurant have to be done with some form of connectivity (known as “fulfillment”), but many can be done in car, and I would argue, have to be.

The computational power to build a spoken dialogue system that can react to many situations and many languages is pretty big – Gigabytes of memory may be needed just to allow for a wide vocabulary to be used by a driver in even one language. Add to that the need to allow the system to converse, and that is a lot of horsepower, so surely, it makes sense to use what the cloud is good at, and provide vast computational power to throw at difficult problems, all for the cost of a data link. Why have a supercomputer in every car, when you can just have one at the end of a pipe?

The answer becomes simple for anyone who has spent time away from a conurbation, or who has gone on a long-distance road trip. Quite often, there is no data link at all, or if it is, it is so puny as to only be able to send SMS or limited data.

I think that most people who are designing and investing in these systems are commuting up and down routes well served by 3G and 4G data: Route 101 in California probably carries more VC’s in a day than most other roads in the world carry in a lifetime.

I can see hybrid models being adopted, where there is a fallback from the cloud to a pretty intelligent system in a car when the main data links are slow or unavailable, even perhaps doing basic pre-processing in the car (turn the audio into text, and then send it by SMS to a server for “meaning” parsing and fulfillment)

But for me, if you want an autonomous vehicle, it needs to be actually autonomous, not autonomous except when you can’t connect to the internet: so all of the “smarts” need to be in the car. Yes, you do need more processing and more memory and more storage in the car, but by making the car less reliant of data links, you also make it less vulnerable to hacking, and much less vulnerable to internet outages.

A few weeks ago, the entire East Coast of the US was affected by a massive Distributed Denial of Service (DDOS) Attack. This meant that many popular websites could not be accessed. Imagine how you would feel if were at work one night, hopped into your car, and it told you that it couldn’t get you home, because the Internet was down? And because there is no manual control, you’re stuck in the car park, getting cold and not a little upset with your less than autonomous car.

And in the Internet of Things (IOT) future that will have self-reliant, meshed devices at its core, the autonomous car will be the star: and the most popular target for hackers. What better way to bring a country to standstill than by having half the cars on the road accelerate, while the other half hits the brakes?

My vision is that one day pretty soon, you will be able to talk to your car, not just to tell it basic things (“Go Home”, “Pick up Mum”), but to actually have a conversation, the logical extension of the famous Turing Test, where a machine is indistinguishable from a human. On long journeys, having a companion who gets to know you, helps to take away boredom, particularly if you no longer have to drive. This type of interactivity will become more and more prevalent as we see the chatbot move from being a toy to a real-world tool: As the elderly care crisis gets worse, home-care “robots” will be a reality, and these need as human an interface as possible to make them acceptable to the frail and home-bound.

I know that the race at the moment is on getting a car that can drive itself: But in our haste, let’s not forget what it is we are trying to achieve: “Autonomous” means a lot more than just “self-driving”

Also see my recent article on self-driving cars in the Huffington Post here

Image

The Intelligent Voice team participated in the latest annual NVidia GPU Technology Conference (GTC 2016) held in San Jose, California, April 5 to 7. Each year NVidia announce new products and technologies at GTC. This year the focus was on self-driving cars, virtual reality and deep learning. Coinciding with the GTC event, Intelligent Voice released their pat-pending SmartTranscript™. The SmartTranscript™ uses the new HTML5 standard and is essentially a wrapper for audio and video files.

As well as having the standard play, pause and drag bar navigation tools, the SmartTranscript™, powered by intelligent Voice’s JumpTo™ technology contains a searchable automatically generated full transcription of the speech contents. The SmartTranscript™ also contains an automatically generated list of suggested topics of interest which can be used to navigate the file. This list is also useful for quickly getting a sense of what the file as a whole contains. The SmartTranscript™ is a stand-alone file, and as such it can be emailed, indexed and stored easily on your file system.

For each of the last eight years GTC has featured an Emerging Companies Summit (ECS). ECS is a great way for companies to put their technology in the spotlight in order to find potential partners, investors or attract investment. The event has a strong track-record of helping promising companies win world-wide recognition. Among the previous competitors are Oculus Rift (acquired by Facebook for $2 billion), Gaikai (acquired by Sony for $380 million) and Natural Motion (acquired by Zynga for $527 million). The top prize this year was $100, 000. From an initial entry this year of over 90 companies, Intelligent Voice were shortlisted to the top 12 and invited to pitch at the event. Intelligent Voice would like to congratulate the winner Sadako from Barcelona who are developing a robotic solution for plastic bottle recycling. Intelligent Voice were then given a separate award for innovation for their new SmartTranscript™, winning almost $100, 000 in prizes.

IV are pioneers in the use of GPUs for not only for training, but for decoding, allowing for ultra-high speed and volume speech to text, something only made possible by a UK Government SMART grant: Hats off to InnovateUK!

Intelligent Voice also unveiled some exciting new speech research. Traditionally, speech recognition is a complicated procedure of combinations of different algorithms for feature extraction, dimensionality reduction, sequence modelling and optimisation. Intelligent Voice have managed to simplify the process with deep learning. At an invited talk at GTC, CTO Nigel Cannings outlined a ‘crazy idea he had on a Saturday morning’, to get a deep Convolution Neural Network (CNN) to perform speech recognition. Implemented on the NVidia DIGITS platform, Intelligent Voice used the image classification capabilities of the CNN to classify spectrogram images into classes of phones. Using the TIMIT benchmark speech corpus, 1.4 million spectrogram images in the 61 phone classes were used to train the network achieving state-of-the-art performance.

The animation shows at the top a randomly selected utterance selected from the TIMIT speech corpus and its ground truth transcription. Below that are the phones of the utterance which were transcribed by phoneticians. With the trained network, inference is performed by sliding the spectrogram image through time with the resulting classification shown at the bottom of the animation.

I thought I’d give this post a fairly descriptive title, just to get the point across.

There is some significant FUD that has entered my marketplace recently. A quick bit of background:

In certain jurisdictions, call recordings must be maintained for a period of time (6 months in the UK) if they are made under specific conditions (eg FCA COBS 11.8). They are the incorruptible record of a transaction, and so are important.

In other cases, as a matter of good practice, companies keep records of phone calls: If they have personal information in them, then a whole raft of legislation comes raining down about for how long and under what circumstances they can be held.

Someone, somewhere, and I know exactly who, has been suggesting that somehow, if you make an automated transcript of a call, then it magically becomes a “record”, and has to be kept for 6 or 7 years (depends on which version of the urban myth you believe). As you can imagine, this is what someone who had a vested interest in stopping you buying a transcription system would say, as they are having you believe that you have added a huge retention burden by transcribing. That someone would either have slow speech-to-text capability, or be selling a phonetic indexing solution, or possibly both. They would probably be worried about a company that sold a very, very fast and accurate speech-to-text system (with phonetics to boot…)

Now, I’m a lawyer by training, and so this just felt wrong to me. So I did some research, and found no basis for it at all, something I started to mention at meetings. And I was asked whether I had taken my own legal advice on the matter (I tried not to bristle at having my legal acumen and integrity impugned).

So I did: I asked one of the foremost Financial Services lawyers in one of the foremost Financial Services legal firms in the world. I asked this person to give it to me straight.

And the answer was that there is no reason at all why a transcript would transform into a protected species, whether made in a regulated environment or not. Lots of analogies were drawn, and lots of words were written, but the upshot is that you can make a transcript, you can change a transcript and you can delete a transcript: You just don’t have to retain a transcript, unless you delete the original audio file, in which case the usual (say 6 month) period applies, worst case.

Quite a lot, we think. For years now, we have used the Intelligent Voice branding, but have kept the name of the company as Chase Information Technology Services Limited.

With effect from this afternoon, we’ve pushed the button, and changed the name to Intelligent Voice Limited. We are very proud of the strides we have made as a business, and think we should put our money, and our name, where our mouth is.

One of the ways we price our software is by the hour – You send me an hour of audio files, I’ll put them into a “JumpTo” format, and then charge you for an hour. Job done.

We are selling a lot into the eDiscovery market, and that is dominated by one pricing metric: the per Gb charge. I send you a Gb of data, you charge me for a Gb of data.

So, every day, I get asked a variation on this question “I have xxGb of voice data, how much?” or if I am very lucky “I have xxGb of voice data, and yy files, how much?”

Audio files come in all shapes and sizes. You can have the same length audio file, and yet depending on the compression rate, it can vary wildly in size. A 5 minute audio file? Anywhere between 400Mb (honest, see here) down to 0.5Mb for a heavily compressed GSM format file.

I often seem to cause offence when I ask people how long their audio job is. So I try to explain it: It is like saying to me “I have 75 blue snakes: how long are they all?” – The fact is, I know what a snake is, and I know that there are big ones and little ones. But telling me they are blue doesn’t really help me in terms of working out the average length. If you told me they were all Boa Constrictors that might help, but even then, there are baby ones, and adult ones etc etc.

Having gone round this loop, I am then asked “Why don’t you charge by the Gb, like other data sources?” – A good question, and an easy one to answer.

The higher the quality of the input file, the better the results. A stereo 16Khz uncompressed wav file is much better than an 8Khz mono wav file encoded using GSM. You will get better quality text, period. But if I told you that on a per Gb basis one would be 10x expensive than the other, you might just be tempted to try to scalp a little here and there, sacrificing quality for a smaller file size.

The fact is, we have to stick to our guns. This whole speech-to-text business is very new in the litigation space, and so it will take a while before people understand the charging mechanism, and why we think it is so important. Until then, I may have to repeat my blue snake story a few more times still…

I am a terrible note taker – My handwriting is appalling, and while I can type pretty fast, I find it very difficult to look people in the eye at the same time – And when I am on the phone, trying to tuck the headset under my chin and type is near impossible (and the swearing that accompanies my dropping the phone has been deemed unprofessional by my co-workers…)

I live a lot of my life in e-mail. I get hundreds of the things every day, personal and professional, many of them urgent, and some of them important even.

But a lot of my time is spent talking. I attend meetings and conference calls. I talk on the phone in my office, or while I am running to another meeting on my mobile. And you know what? All of that is lost in the ether. Someone somewhere might have some notes of what was said, but the rest? Pouff. Gone.

Back when I was a lawyer working in private practice (this is a long time ago now), I used to have to write an “Attendance Note” of every phone call I made or received, an encapsulation of advice given and information received. We didn’t have computers on desks back then, and so it was dictated and typed up by a secretary. Imagine it. You have just spent 20 minutes speaking on the phone, and then you spend another 20 minutes talking all over again into a machine trying to remember what was said – Or in many cases, what you wish was said. Never compare the Attendance Notes of two opposing lawyers of the same call if you actually want to know what really took place on the phone.

The practice of Attendance Notes persists (rightly) to this day, but with the increasing bombardment of information from other sources (email in particular), it is often easy to overlook, sometimes with devastating consequences (See my article on Withers here)

And here is the point: the spoken word is important, and it is as much a record of our working and professional lives as email has become. I used to gaily flit between email providers, often linked to whichever dial-up or broadband provider I was using at the time. Now, I rely on Gmail as the archive of my personal life, a searchable chronicle of pretty much everything. It would be very hard to live without that. I would not have believed it if someone had told me that back when I was somenumber@compuserve.com

But we allow our voices to disappear. Or if we do capture them, we do it in such a way that it is difficult to access it later.

I now record all my telephone calls, something that I am allowed to do under the laws of where I live. And I have used those recordings to win arguments with people who would swear black was white. And I have replayed them, and thought “Why did I say that?” – A painful learning experience. I can search them, listen to them, and view them just like my email, but then, that is what I do for a living.

A lot of people are shocked when I tell them that I record like this. And often, the first thing they ask me about is the legality, and that always interests me. If you send me an email, do you wonder about what I do with it next? Do you ask me about whether it is legal for me to store, search and review it? If you send me a letter, in the post, and I open it, do you question my right to scan it and store it (I do, by the way, so be warned…). Then why shouldn’t I record what I say? It is (as I said above), less rude than typing while trying to speak!

Outside of the personal and business side, we do, in some cases, record what is important, although often in a sea of dross. Parliaments, Congresses, Public Committees and Inquiries, News, Radio, YouTube, WhatsApp – It’s all out there, but it is difficult to get to. We do not make this type of search and subsequent review easy: If I Google “FIFA Election” (This morning’s hot topic) – I get a lot of content, all of it text based. But there will be hundreds of videos and radio programmes where it is mentioned. Why can’t I just dive straight in to them? Where is the button that takes me to that point in the latest BBC News or Radio 4 programme where it is mentioned? (I have a badge that someone once made for me to wear on my trips to the West Coast that says “Google. Buy Me”. This is a hint, chaps)

The fact is that what we say, and what others say, is an important part of our history, and the ongoing history of the world. But here in 2015, we’re stuck in the Text Age. We’ve moved on from the Paper Age, but we haven’t really moved into the Information Age, no matter how much we kid ourselves. I think in 5 years, we will record everything as a matter of course. Our wearable devices will deal with day-to-day meetings and other interactions (Baristas beware), and the telcos will finally latch on to the fact that they can charge you to record your calls and then charge you to store them – In a declining telecoms market, they need every dime.

5 years ago, I said audio would be a problem, and trust me, there are a lot of banks and law firms who are finding that I was right. I’m hoping that if I’m right about this next 5 year prediction, it is because we’re turned the problem into a solution.