Microsoft actually thought it hit this point last year, when it reached 5.9%, the word error rate it had measured for humans. But then other researchers carried out separate studies and pegged the human error level as slightly lower, 5.1%.

But it has now achieved this — reducing its error rate by 12%, and using AI techniques like "neural-net based acoustic and language models." Another innovation was to take into account the context of the speech to make better guesses as to what unclear words are, like humans do.

For example: It might not be clear from the audio whether someone is saying "that's not fair" or "that's not fur." Traditionally, this ambiguity might lead to transcription errors. But now the speech recognition tech can look at context for clues. If it's a speech about the risks of gambling, then it's probably "that's not fair"; if it's a conversation about fabrics, "that's not fur" probably fits better.

"Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years," Xuedong Huang wrote. But in practice, Microsoft still faces significant challenges. "such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available."

So while Microsoft's tech is impressive, it won't be on a par with humans in all real-world situations just yet.

The researcher added: "Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent. Moving from recognizing to understanding speech is the next major frontier for speech technology."