Stitch-ster would be a spider that crawls the web tring to match your
stuff with other stuff that is out there, so for instance if there is a
known time on a video that coincides with a known time on your
video then Stitch-ster would stitch those two together. And t would
have an algorithmic
aspect that searches for really long-scale
gestures that coordinate between files -- like if two videos taken
from different angles contain the same sound of a car, the algorithm
could look for similarities between videos and audios and create a
stitched reality out of that.

[+] The latter part of this, at least, would be
extremely difficult to impossible to achieve with
today's technology. But someday, probably before too
long, this idea will come to fruition. I recall seeing
something similar, with 3D models of famous landmarks
being rebuilt automatically from tourists' photographs
found on
the Web.

It could start with videos where there are lots of known
recordings from one time/place, take an event that is
unique, characteristic, and for which there is one very
clear recording, and then work out an algorithm by
associating videos of progressively less similarity, and using
crowd sourcing like the mechanical Turk from amazon to
associate videos that cant be done automatically, and have
the algorithm learn from the associations that the people
make. The cutting edge of the algorithm development
would be inferring things about intelligent agents based on
the differences in the ways they interact with different
recordings.

In order to compare features of video or audio files,
wouldn't this bot have to 'play' each file in order to analyze
it? Or does it just wait until somebody opens an mpeg and
operate on their cpu while it analyzes the content of
whatever they're viewing?

Maybe it could just start with whatever the most
recognizable "event" that was common to a bunch of videos
was -- so whatever the first "thing" or pattern that meets
some criterion of interestingness that an AI could recognize
in a set of audio or video, or correlated audio and video
data.

I think you could do text by starting with speech and a set
of files that have been agreed on as "real", (not staged) by
humans, and then speech recognizing and indexing those
files for statistical correlations, like % of time the word
"oops" is followed by crashing sound and then statistics for
how often the creasing sound has this general
characteristic...