Using AI to Automate Dialogue Animation of 3D Mesh Character Models

I believe I've developed a process to use Artificial Intelligence (AI) to automate the dialogue animation of 3D mesh character models. Let me start with the vision: I want to...

Record an audio track of character dialogue.

Analyse the audio track using speech-to-text artificial intelligence.

Receive speech-to-text results, but with the time offset information for words and phonemes.

Import those encoded results into a Blender 3D animation timeline.

Blender uses those results to match phonemes with timeline.

Mouth shape from character pose library is selected based on phoneme and timeline.

While this sounds like a dream (because it would be), I actually think the pieces for this are already out there. With Google Speech API, I can post my audio file to the AI and get reliable speech-to-text conversion with word confidence scores. If in our Python script, we set:

enable_word_time_offsets=True

we get the text results with time offsets for each word. I'm going to check with Google, but I bet there is a debug flag available to get the offsets for time offsets for each individual phoneme. Why can't we use that data to re-associate the words with our timeline in Blender?

Working from the other end of the pipeline, I see a Papagayo product that puts text into mouth shapes, and I see a Blender addon called Automatic Lipsync that puts the Papagayo data into Blender.

Mission: Don't we now have the technology to put ALL of these together into either a addon plugin, or better yet core?

I'm fairly new to Blender, so this is going to be above my skill level - but folks - although ambitious I see no reason why this isn't possible? Where do I begin to make this happen or what would be the most appropriate forum to further the discussion?