The ideal image recognition and video compression will similarly
objectify the speaker and people, and stationary objects in the room,
and reconstruct the image at the receiver, having sent only
(minimally) continual motion and change informations,
dominated by movements and convergences, in combination called,
movergence.

A rudimentary object cognizer (as we discuss here) detects
image-local translatory motions, directions and rates, compares them
for local convergences, and sends regular update-reports (signals):
receiving and retaining the same information at the receiver
for reconstitution and re-apparition as objects turn away
and back-again in view.

A typically medium-resolution [500x500=250K] digital color camera for
the tele-phane/vid-link application presenting 4 individuals
plus diagrams in conference on a standard high-resolution display -
about 2K x 1K (=2M) pixels / 4+4 (=8) sub-frames = 500x500 (=250K) pixels each -
captures moving imagery rapidly [typically 30-60 frames/sec] for picture
clarity and motion estimation, from which a full image must be built-up
to good resolution in a fraction of a second, and then maintained and sharpened.
The sender and the receiver must also identify overlapping portions
(typically one partially transparent) moving at distinct rates.
And, individual pixels are in good color, typically 8-bit [4+2+2].
This is data bandwidth easily exceeding 56KB/sec typical of modem-telephony,
but pre-loading images is facilitated by the image-caching model in
the tele-phane/vid-link design, and thereby sped-up for initial view:
Subsequent viewing image resolution is maintained by cache on the receiver side.

THEORY [compressing image information]

Image pixels are sent sparcely (similar to video-scan interleaving),
with subsequent transmissions including interstitial pixels, until
the whole image is sent.
This would take 40 seconds at 56KB/sec for one medium resolution color
image direct pixel-by-pixel, and telephany needs 30 images per second, and 8 pictures: a factor of 10K speed.

Instead: By sending each pixel with its spatio-temporal (positional)
offset relative to its placement with respect to the preceding frame
(which thus requires a high-powered computational image processing engine)
the pixels can be retained for subsequent reuse, even when hidden
momentarily, until they expire by refreshment or replacement.

Thus a typical connection can send a simple full-color good resolution
image in the first 1 sec of connection, and then move and update it
continually thereafter about 1 sec lag behind live-direct.

VISIBLE RESOLUTION

Stationary objects require the highest resolution.
Very slow linearly motion still requires high resolution,
but faster motion less, as the object moves against an interesting background, and quickly out of range.
Accelerated objects need much less resolution,
as these are not readily anticipated, nor followed,
though at slow speeds simple linear motion tracks them sufficiently.
Turning objects require fairly high resolution,
as the turn is within the focal span, and expected with visual practice:
Turning motion is non-linear, non-simple acceleration -
it's labelled, vergence: as edges disappear and reappear.

The simplest image reduction for telephany would be to capture a
complete image of the speaker's face, estimate the shape of the head,
and affix the face-image to the head, letting a cartoon-script [JAVA]
match the face-motions to the audio channel.