Along with the marriage of motion pictures
and computers has come an increasing interest in making images appear
to have a greater degree of realness or presence, which I call "realspace
imaging." Such topics as high definition television, 3D, fisheye lenses,
surrogate travel, and "cyberspace" reflect such interest. These topics
are usually piled together and are unparsable, with the implicit assumptions
that "the more resolution, the more presence" and "the more presence,
the better." This paper proposes a taxonomy of the elements of realspace
imaging. The taxonomy is organized around six sections: 1) monoscopic
imaging, 2) stereoscopic imaging , 3) multiscopic imaging, 4) panoramics,
5) surrogate travel, and 6) realtime imaging.

INTRODUCTION: REALSPACE IMAGING

Realspace imaging is the process of recording
and displaying sensory information indistinguishable from unmediated reality.
Imagine looking at a framed image as if it were a window. Fooling the
eye into believing the image is real is a difficult task. Fooling two
eyes is even more difficult. Fooling two eyes while allowing freedom of
head motion is yet more difficult. Now imagine removing the frame and
having the freedom to look around. Now imagine having the freedom to move
around. Now add time-based phenomena such as motion and sound. These are
the elements of realspace imaging proposed.

1. MONOSCOPIC IMAGING

Monoscopic images represent one single point
of view.

1.1. Orthoscopy (Scale)

Orthoscopic images, images viewed in proper
scale, require the viewer see from the same angle of view as that of the
camera lens. Orthoscopically correct images must be viewed from only one
single point of view, hence virtually every image we see is non-orthoscopic.
In addition to scale change, off-axis viewing results in trapezoidal distortion.
We are rarely cognizant that we are looking at a trapezoid when we sit
to the side of a movie theater, though it has been shown that additional
cognitive processing is required to "straighten it out."1

1.2. Spatial Resolution and Color

Spatial resolution is possibly the most-discussed
aspect of realspace imaging. Today, the images we see on television and
in movie theaters are recorded on a wide variety of formats whose principal
difference is spatial resolution (and color, partly a subset of spatial
resolution and partly a subset of recording and display technology). The
current dilemma over "how much is enough?" has been additionally fueled
by standardization arguments for high definition television (HDTV). To
some, the research is disheartening: a display with a 45° by 90°
field of view would ideally require a 3000 by 6000 pixel display, many
times the resolution of even present HDTV standards.2

1.3. Dynamic Range and Brightness

Dynamic range is the span from the whitest
whites to the blackest blacks in visual recording and display. The eye
has a broad dynamic range allowing us to see bright outdoor scenes and
shadow detail simultaneously. Film has less dynamic range than the eye.
Scenes shot in film must be carefully lit to "squeeze" the dynamic range
into films' limits, such as "fill lighting" the shaded areas more and
"key lighting" the bright areas less. Video has even less dynamic range
than film, requiring more complicated lighting to achieve the same effect
as film. Film shot and transferred to video results in a greater dynamic
range than video-originated material, which is one reason why many cinematographers
prefer the "film look" (another reason is frame rate, see below).

1.4. Spatial Consistancy and Spatialization

Many kinds of lower resolution images appear
acceptable, particularly if the noise or artifacts have realworld analogs,
such as looking at the world through a veil (or a fan blade). But when
the styles of resolution are inconsistant with each other in the same
image, it looks "wrong." Steve Yelick once referred to this as the "Gumby
in Manhattan" problem.

Similarly an overlay can be "spatialized"
by making it appear as a contiguous part of the image, where lighting
and shadow, scale, and synchronous movement must correspond with the background.
The motion picture special effects industry has long been aware of such
importance, and the so-called "virtual reality" field is currently realizing
the need for spatializing data. A videodisc-based example of spatializing
data was recently produced by the Apple Multimedia Lab, where electronic
graphics were inserted along a filmed desert highway to teach scale to
schoolchildren .3

1.5. Monoscopic Depth Cues

1.5.1. Content Cues

There are several monoscopic cues to depth
perception which are based on image content, such as perspective, overlapping
or occlusion, aerial perspective or atmospherics, light and shade, and
textural gradient.4 It is noteworthy that these cues are automatically
captured with a camera but must be addressed explicitly with computer-generated
imagery, where these factors have historically been of great concern.

1.5.2. Accomodation (Focus)

With one eye open and a fixed head position,
a prominent depth cue is accomodation, the focussing of the eye's lense
by the surrounding muscles. It is similar to focussing the lense of a
camera. There are two ways to determine depth from focussing a camera.
One is to focus on an object and read the calibrated focus setting. We
sense our eye muscles in a similar way when we change focus. The other
way to determine depth is by the amount of blur that exists for objects
out-of-focus, which is partly a function of distance and partly a function
of brightness. Obtaining depth data in an image by comparing two samples
of blurriness has been demonstrated .5

Accomodation discrepancies are often prominent
while viewing landscape images, where the eyes should be focussed on infinity
(and where parallax diminishes to near zero). One method of achieving
"infinity focus" is simply to project farfield imagery far away on a large
screen, but a large space is requried. Another technique, common in flight
simulators, uses a large concave mirror which magnifies and refocusses
the image from a small monitor. But even with this large mirror, the effective
viewing area is small, enough for only one person. Smaller and less expensive
optics can be used instead of a mirror if the viewer is further restricted
to an even smaller viewing area, like a peephole.

Similarly, a concave mirror may be used
for nearfield imaging by projecting a real image in front of the mirror's
surface. The "floating nickel" illusion and video "genies" hovering in
space are popular examples.

2. STEREOSCOPIC IMAGING

Stereoscopic images represent two single
points of view, one for each eye, separated to give a noticable lateral
displacement, or parallax. Parallax is often erroneously pitched as all
that is necessary for depth in imagery (the stereoscopic movies of the
1950s were simply labelled "3D"). There is no easy way to record and reproduce
parallax. First, two simultaneous views must be recorded, with care taken
for proper convergence and disparity. Then, each view must be seen exclusively
by each eye.

Stereoscopic photography has a lively history
dating back at least to Wheatstone's invention of the stereoscope in 1833.6
The most popular techniques require glasses to be worn (such as anaglyphic,
polarized, or shutter). Methods not requiring glasses usually require
the head to be held in a particular position (using mirrors, peepholes,
or lenticular screens, for example).

3. MULTISCOPIC IMAGING

Multiscopic imaging represents multiple
points of view: lateral head movement while the body is more-or-less still.
The result is local motion parallax (global motion parallax equals travel).
Local motion parallax is a stronger depth cue than stereoscopic parallax
because more than two points of view can be seen in a relatively short
period of time.

That local motion parallax occurs when we
rotate our head is a wonderful evolutionary feature of humans (and most
other animals) because our eyes are displaced from our neck's axis of
rotation. If our eyes were on the neck's axis of rotation (like
a camera mounted on a tripod), there would be no lateral displacement
when we turn our head and therefore no parallax.

3.1. Mirrors

A mirror is multiscopic by nature. When
we view ourselves in a mirror, each eye is seeing a different point of
view, so we see stereoscopically. But also when we move our head, we see
correspondingly different points of view.

An early technique for achieving multiscopic
images required a giant half-silvered mirror with which to reflect hidden
images and props over an actual stage set. Like normal mirrors, both parallax
and accomodation are preserved, but since it is half-silvered, the reflected
imagery appears transparent, making it great for ghosts but not much else.
Such giant half-silvered mirrors date back to the Phantasmagoria shows
of the 18th century and have been popularized by the ballroom of Disney's
Haunted Mansion. Similar examples, but where the 3D floating images were
of 2D film and video screens (and if done cleverly appears 3D), were popular
attractions at the last three Worlds' Fairs: the GM "Spirit Lodge" (EXPO
'86, Vancouver), the Australian Pavillion (EXPO '88, Brisbane), and the
Ginko, Gas, and Mitsui Toshiba pavillions (EXPO '90, Osaka).

A technique for producing small multiscopic
images employs a flexible vibrating mirror to rapidly change focal lengths.
This varifocal mirror reflects a video display whose image is in sync
with the vibration, resulting in a relatively small volumetric display.7Since
the video must be from a computer-generated 3D model, direct display of
camera-originated images is not possible. Indeed, no such camera exists.

3.2. Relief Projection

Another multiscopic technique is sometimes
called relief projection, where an image is projected onto a screen whose
shape physically matches the image itself. Historically, the most popular
application of relief projection are "talking heads," where a mask is
made of a person's face to be used as the projection screen. The person
is filmed with their head totally motionless but their mouth and eyes
moving. The film is projected onto the facemask screen, with careful alignment
such that the eyes fall in the eye sockets, the mouth along the mouth
line, etc. The illusion is very powerful, and the fact that the image
of the eyes and mouth move and the screen does not is barely noticable.
The talking head in Disney's Haunted Mansion uses this technique. A more
advanced version, where the mask screen moved in sync with the image,
was produced at MIT in 1980 . 8 The author has produced room-sized relief
projection by painting entire stage sets white after filming them and
projecting the original image back on the white-painted surfaces.9 The
limit of relief projection, of course, is that the shape of the screen
cannot easily change.

One method of making a more flexible relief
projection display is to rapidly spin a disc or corkscrew-shaped screen
while projecting on it with synchronized lasers.10The result is a volumetric
display (usually inside a clear housing for safety) whose size, detail,
and flicker rate are related to the computational horespower and the mechanics
of the system. And like varifocal mirror displays, all the imagery must
come from a 3D computer model rather than directly from a camera.

3.3. Holography

Holography achieves both parallax and accomodation,
but is a filmic medium (with the extremely significant exception of Benton's
most recent work at the MIT Media Lab). Being film-based has it implications.
It cannot be transmitted live like video. Also, it is near-impossible
for any kind of computer control and interactivity.

Another popular misconception about holography
is its "projection." The holographic image can only be seen while viewing
through the film: it may appear behind the film, in front of the film,
or both, but one must always be looking through the film. The concept
of both the audience and the hologram being on the ground and a holographic
image "projected" in the sky is simply inaccurate.

"Stereograms" (or integrams) are holograms
made from filmed or computer-generated material, where images are recorded
from multiple points of view along a single straight or curved track.
If the material is shot with a single moving camera, any motion counteracts
the stereoscopy. Real holographic movies, though demonstrated, require
massive amounts of holographic film and even still offer very limited
viewing.

3.4 . Viewpoint-Dependent Imaging

Local motion parallax is possible with a
conventional display if driven by the user's head position. An example
was produced at MIT, where an outdoor scene was shot laterally and mastered
on videodisc. The videodisc speed and direction was controlled by a single
user wearing a position-tracking device on his head. As the user sways
back and forth, the video image changes correspondingly.11 Because such
a display is interactive, it is limited to one single user. Though trivial
with a virtual camera, recording with a real camera becomes increasingly
difficult when more than one dimension is shot (allowing the user to sway
back and forth, in and out, and up and down simultaneously). And like
other single camera applications for multiple points of view, time artifacts
(motion) counteracts the multiscopic effect.

4. PANORAMICS

Panoramics is the ability to look around.
An image is considered panoramic as it approaches framelessness, when
the image is larger than the viewer's field of vision. When this occurs,
there is a sense of immersion, of being inside rather than outside
looking in. Panoramic imagery allows freedom of angular movement.

4.1. Rectilinear Perspective

Rectilinear perspective is shot on flat
film and displayed on a flat surface. When viewing rectilinear perspective
images off-axis, the frame will appear trapezoidal, but straight lines
will always appear straight. Practically every camera we have ever seen
or used and practically every image we've ever seen or made, has been
of rectilinear perspective.

Because of their flat nature, all rectilinear
perspective images must have less than a 180° angle of view, and
therefore full panoramic construction requires mutiple images. For example,
MIT's Aspen Moviemap was shot with four 16mm cameras with slightly less
than 90° lenses, pointing front, back, left, and right. For these
images to be viewed together properly, one must stand in the center of
a four-walled projection space, otherwise trapezoidal distortion will
occur. The four images, when laid flat, will exhibit discontinuities which
can be computer-corrected by "undistorting" them linearly.12

4.2. Cylindrical Perspective

Cylindrical perspective is shot on cylindrically-positioned
film and displayed on a cylindrical surface. Unlike rectilinear perspective,
only one cylindrical perspective image is required for a 360° panorama,
and since it is a single image, there are no discontinuities. Cameras
with rotating slits and lenses (such as the Widelux, Hulcher, Globus,
and Roundshot cameras) can shoot a single image over a relatively short
period of time.

Optimal viewing is from the center of the
cylinder, like being inside a large lampshade. When the viewer is off-axis,
straight horizontal lines appear curved, while straight vertical lines
remain straight. The distortion can be "dewarped" with a computer by non-linear
correction in one dimension and linear correction in the other dimernsion
for a flat display.

4.3. Spherical Perspective

Spherical perspective is shot with spherical
optics (such as fisheye lenses) and displayed on a spherical surface.
Optimal viewing is from the center of the sphere, like being inside a
large dome or an Omnimax theater. When the viewer is off-axis, straight
lines in both dimensions will appear curved.

Spherical recording is most often associated
with fisheye lenses, but other such specialty lenses exist. For example,
the Peri-Appolar lens made by Volpi was used for the Aspen Moviemap. It
produces a donut-shaped image representing a 360° azimuth by ±30°
elevation, centered on the horizon rather than on the zenith when pointing
upward. Shooting off convex mirrors also produces spherical perspective
(a Legg's pantyhose package is a favorite), but the camera will be visible
in the middle of the frame.

Spherical perspective images can capture
the most in a single shot, but flat viewing results in distortion. The
distortion can be "dewarped" with a computer by non-linear correction
in both dimensions.13

4.4. Substituting Interactivity for Wholeness

For each type of perspective, it is possible
to store the entire panoramic image in such a way that the user may access
a subset of it. The obvious advantages are that it eliminates the need
for a 360° projection space and requires less display bandwidth.

A reasonable method of viewing panoramic
imagery is through a small flat rectilinear window such as a video display
if the user has control of the point of view, using a joystick, for example.
Intel's DVI technology has such a method for "dewarping" and displaying
imagery shot with a fisheye lens. 14

A less reasonable method (but one that kept
this author obsessed for several years) is where a projected image physically
moves around the playback space in order to retain the spatial correspondence.15A "moving movie" requires neither
the power nor bandwidth to fill the entire playback space, but it nevertheless
requires a special playback space.

It is possible to combine "interactive small
window" viewing with spatial correspondence by wearing the display on
one's head and tracking head position. These head-mounted displays (HMDs)
can also offer properly accomodated, stereoscopic, wide angle optics16
and have received a great deal of recent attention under such labels as
"virtual realities," "virtual environments," and "cyberspace."

Virtually all imagery shown in HMDs today
are either computer-generated or from live telerobotic cameras. Realworld
recording and storage for HMDs presents novel challenges. For example,
shooting for both stereoscopy and panoramics has no simple solution, since
two panoramic cameras separated for stereoscopy results in variable parallax
as the "interactive small windows" rotate.

5. SURROGATE TRAVEL

Surrogate travel, or "moviemaps," is the
ability to move around, allowing the user to laterally move through a
recorded or created place. Moving around any virtual space presents some
problems not present when looking around a panoramic scene. Looking around
need not be explicitly interactive: the entire view can be displayed.
But moving around under one's own control must be explicitly interactive:
one must tell the system to change lateral position. Thus, while an audience
in a panoramic theater can all look in different directions, an audience
in a surrogate travel theater must somehow to come to grips with who is
navigating.

Problems arise when surrogate travel is
shot with real cameras, rather than generated from 3D databases. Though
it is possible to record an entire panorama from a given point in a single
instant, the only way to record surrogate travel in a single instant is
with one camera at each location. The more realistic alternative is to
move a single camera from one location to another, but time artifacts
caused by moving clouds, shadows, cars, and people may result.

Another major difference between panoramic
recording and surrogate travel recording is in continuousness. Once a
panoramic scene is recorded and stored as a 2D single image, the user
may have continuous access to any portion of it. But surrogate travel
requires recording many 2D images at spatial intervals (like one frame
every ten feet) and creating in-between images from these is a state-of-the-art
computing problem. Hence surrogate travel material made from realworld
recording is currently stored as many 2D images, on fast-access lookup
media such as optical videodiscs.

5.1. One-Dimensional Movement

One-dimensional surrogate travel is along
one particular path. The user may go forward and backward, at any speed,
but cannot stray from this path.

5.1.1. Distance-Dependent Recording

In order to give the user a predictable
sense of speed control, realworld images along a route are best shot at
regular spatial intervals. Motion picture cameras, both film and video,
are time-triggered instruments, for example recording one frame every
1/24 or 1/30 of a second. If the camera tracking speed can be held constant,
then time triggering is equivalent to distance triggering. Otherwise explicit
distance triggering is necessary like from an odometer or an external
"fifth wheel."

The triggering distance affects visual continuity
on the one hand and frame storage "real estate" on the other. The more
images, the smoother the apparent movement, but the more storage space
required. Smoothness is also related to angle of view of the camera, height
and distance to the nearest objects, and camera stability.

5.1.2. Image Stabilization

Camera stability is a realworld problem,
not relavent for virtual cameras or for model cameras on motion control
systems. Instability results from any variance of the lateral path or
the angular position of the camera during shooting. High frequency instabilites
such as vibrations will produce blur or smear and affect individual frames.
They can be minimized by using a short exposure time, a wide angle lens,
and staying away from close or fast-moving objects.

Low frequency instabilites will produce
a "wobble" from frame to frame. Since moviemaps are often shot at frame
rates less than normal motion pictures' (one frame per second may be an
average recording speed, for example), such instabilities are exagerrated.
Consequently, closed-loop gyroscopic stabilizers (such as Wescams, Gyrospheres,
or Tyler "Sea Mounts") perform better than either passive gyroscopic stabilizers
(such as some helicopter mounts) or passive inertial stabilizers (such
as Steadicams). In-camera and in-lens stabilizers (such as the 1962 "Dynalens,"
Arriflex's Image Stabilizer, and Schwem's "Gyrozoom") can only correct
for pan and tilt but not for rotation.

5.2. "1.1" Dimensional Movement

Moving along a path with occassional choice-points
is a far cry from being able to "travel anywhere." One might call this
class of surrogate travel "1.1 dimensional" because only some of the points
along the path have a two dimensional choice and most have only a one
dimensional choice.

5.2.1. Match-Cuts

At nodes (points with a two-dimensional
choice), the better the match-cut between two intersecting routes, the
greater the sense of seamlessness. Several factors contribute to matching
cuts. First, the camera has to be in the same position and pointing in
the same direction for both routes as it passes through the node. One
may use lines on the street or one may use compass coordinates, but there
is no easy way to do this in the real world. Also, temporal artifacts
are inevitable, since the matching shots must be recorded at different
times. Lighting and shadow discepancies can be minimized by shooting during
a narrow window of time, like from 10 am to 2 pm, or shooting on cloudy
days. For 3D database recording, as well as for motion control model shooting,
these problems don't exist.

5.2.2. Camera Angle

Since panoramic recording is often impractical,
the camera's angular position becomes an issue, since a less-than-360°
lense must be explicitly pointed. The simplest technique is to fix the
camera angle to the lateral direction of motion, either pointing straight
ahead or pointing sideways. At each node, every possible turn must be
separately recorded in order to match-cut the intersecting routes. MIT's
Aspen Moviemap was shot in this fashion.17

Another, more complicated way to point the
camera is in an absolute direction independent of lateral position. For
example, a camera could always point north regardless of whether it is
facing forward, sideways, or backward. An advantage is that shooting turns
is not required since the camera is pointing in the same direction at
any given point.

An even more complex way is to point the
camera at an absolute location, such as tracking a central object. For
example, the "Golden Gate Videodisc" produced by Advanced Interaction
Inc. and directed by the author for the Exploratorium is an aerial moviemap
over the Bay Area where the camera always pointed at the center of the
Golden Gate Bridge. Like absolute position, the payoff is that separate
turn sequences are not necessary to record since the camera always points
in the same direction at any given point.

5.3. 2-D and 3-D Movement

Recording and storing two-dimensional grids
and three-dimensional lattices, where the user has freedom of movement,
is problematic because the numbers grow quickly. Consider that a 15 by
20 foot space with a 10 foot high ceiling requires 3,000 frames if shot
at intervals of one feet and over 5 million frames at intervals of one
inch!

In the future, the very idea of discrete
frame storage will be obsolete. Computers will store information in spatial
databases based on whatever data has been collected (and will interpolate
what is missing). It has been demonstrated that significant bandwidth
compression occurs when the visual information from separate (highly redundant)
movie frames are stored as a single computer model.18 But data will still
need to be collected, and visual data will be collected with cameras.

6. REALTIME IMAGING

Realtime imaging is the process of recording
and displaying temporal sensory information indistinguishable from
unmediated reality.

6.1. Dynamic Visual Cues

6.1.1. Frame Rate

At least 15 updates per second are necessary
for motion to appear on a screen. The upper level is arguable. Modern
American film runs at 24 frames per second (fps), American video updates
at 60 fps, but 80 or 90 fps may be necessary.19

Apparently, part of our association with
the "film look" is film's lack of a sufficient frame rate. When
video is "defluttered" (every other field removed reducing the effective
update rate from 60 to 30 fps), the result takes on a film look .20 Similarly,
the Showscan film format, which records and projects at 60 fps, has a
video look.

6.1.2. Temporal Continuity

Temporal continuity is the opposite of cuts,
or montage. The real world exhibits temporal continuity always, regardless
if it is seen looking out from a train or from a racecar or sitting still.
There are no cuts in the real world. (Believing that you really are instantly
somewhere else, as opposed to imagining it, is the definition of psychosis.)
Temporal continuity is the temporal equivalent of spatial consistancy.

Cinema, on the other hand, consists of adjacent
frames which are either continuous (those within a shot) or not (those
between shots, the "cuts"). Cinema is the counterpoint between "respect
for spatial unity"21 and its "first and foremost" characteristic, montage.22
Noteworthy is Alfred Hitchcock's Rope, a feature film shot with
a carefully orchestrated camera, which has virtually no cuts.

6.2. Dynamic Non-Visual Cues

6.2.1. Audio

Audio in synchronization with image is part
of our association with cinema's ability to convey presence, and audio
has its own resolution specifications. Of particular relavence here is
the spatialization of sound. Sound can be spatialized one of two ways:
by using multiple speakers each positioned in the point of origin of the
sound source or by using binaural sound.23

6.2.2. Inertial Motion

In addition to visual and auditory cues,
we receive temporal cues by how we physically feel. This feeling of motion
is based primarily in the vestibular system in the inner ear and is sensitive
to linear and angular acceleration.24 Flight simulators (as well as Disney's Star Tours and Body Tours) move the viewers on a motion
platform sychronized with the image and sound to enhance their effect.

6.2.3. Force Feedback

Force feedback is the ability
to "touch" a virtual object inside an image. For example, a
force-feedback joystick has been used to simulate textures.25 Similarly,
a hand grip made of a four inch bar with three computer-actuated springs
on each end can simulate angular and lateral force, and has been used
successfully to augment visual display for spatial tasks.26

7. AFTERWORD

Each of these elements of realspace imaging
can either be respected or violated. An image either is orthoscopic, stereoscopic,
or panoramic or it's not. Sometimes violations of these elements are by
default: it's more convenient to carry around non-orthoscopic images of
your family than "actual size," stereoscopic cameras are expensive, and
panoramic movies require special theaters.

But sometimes violations of these elements
are intentional: a cut in a film, slow frame rate in a rock video, a simple
line drawing rather than a high resolution image, silence rather than
sound. The very idea of respecting all elements of realspace imaging is
ultimately a losing battle. Giving the user everything is rarely possible.
There is never enough bandwidth. There will always be artifacts.

The trick is to give the sense of everything
without actually giving everything. The question, then, is how to chose
what is most important. And what is most important is always context-dependent.
This report is an attempt to lay out the choices. Choose wisely: that
is where the art lies.

8. ACKNOWLEDGEMENTS

This paper is a much-condensed version of
a forthcoming Apple Computer Technical Report entitled "Elements of Realspace
Imaging" written for the multimedia community and supported by the Apple
Multimedia Lab in San Francisco. The author wishes to thank Phil Agre,
Doug Crockford, Scott Fisher, Brenda Laurel, Robert Mohl, and Rachel Strickland
as well as the members of the Apple Multimedia Lab, particularly its Director,
Kristina Hooper, all for their lively discussions and criticisms throughout
the course of that report.