Reading media data in raw form.

I'm working on a complex AI system that pulls information from the internet (and possibly from other live sources in the future) and learns from it in the same way that a human does. I'm not sure if this counts as a C# problem or more of a media format problem (depends upon the solution I guess), but I want to be able to store the previously-mentioned information in the following ways;

Text; well, quite simple. Just as-is really.

Images should be stored in raw form as pixel data (I've already found out how to do this; relatively simple using the Bitmap class).

Sound should be stored as waveforms (i.e. simple arrays of doubles or something similar); I'm guessing one waveform per channel.

Video should be stored in a similar form to images, but as an array of images representing the frames (with extra information such as frame-rate .etc). The data would also contain the audio as described above.

I'm aware of the sheer number of image, audio and video formats out there, and I wondered if there's any class or API that can decode the information from the files into the formats described above (or into something that I can derive those formats from)?

How do I extract waveform data from any audio file (I'm looking for something that will handle most if not all audio formats)?

How do I extract frame-by-frame pixel data from any video file (again, I'm looking for something that handles different formats)?

The idea is that I'm trying to get at the raw data so that I can process it in various forms (e.g. run a speech recognition algorithm on audio data, or a machine vision algorithm on video data). Text data is the most straight-forward (until I start dealing with files other than .txt), as I can use the data as-is. For images, I'm using the Bitmap class to get at the pixel data (from here I can apply convolution filters .etc.). With audio data, I'll want to be able to recognise sounds and speech. With video data, I'll want to look at the individual images for each frame to do things like track moving objects .etc.

Ideally I want to use something that already exists. For example, for decoding video files perhaps I can use the codecs to get the data into a common format, and the same with audio files.