MPEG-4 AUTHORING TOOL FOR THE COMPOSITION OF 3D AUDIOVISUAL SCENES

Transcription

1 MPEG-4 AUTHORING TOOL FOR THE COMPOSITION OF 3D AUDIOVISUAL SCENES P. Daras I. Kompatsiaris T. Raptis M. G. Strintzis Informatics and Telematics Institute 1,Kyvernidou str Thessaloniki, GREECE Abstract Bringing much new functionality, MPEG-4 offers numerous capabilities and is expected to be the future standard for multimedia applications. In this paper a novel authoring tool fully exploiting the 3D functionalities of the MPEG-4 standard is described. It is based upon an open and modular architecture able to progress with MPEG-4 versions and it is easily adaptable to newly emerging better and higher-level authoring features. I. INTRODUCTION MPEG-4 is the next generation compression standard after MPEG-1 and MPEG-2. MPEG-4 specifies a standard mechanism for coding of audio-visual objects whereas the previous two MPEG standards dealt with coding of audio and video. Apart from natural objects, MPEG-4 also allows coding of two-dimensional and three-dimensional, synthetic and hybrid, audio and visual objects. Coding of objects enables content-based interactivity and scalability. It also improves coding and reusability of content (Figure 1). MPEG-4 Systems facilitates organization of the audio-visual objects that are decoded from elementary streams into a presentation [1]. The coded stream that describes the spatial-temporal relationships between the coded audio-visual objects is called the Scene Description or BIFS (Binary format for scenes) streams. Scene description in MPEG-4 is an extension from VRML (Virtual Reality Markup Language) to include coding and streaming, timing, and integration of 2D and 3D objects [2]. MPEG-4 Authoring is quite a challenge. Far from the past simplicity of MPEG-2 onevideo-plus-2-audio-streams, MPEG-4 allows the content creator to compose together spatially and temporally large numbers of objects of many different types: rectangular video, arbitrarily shaped video, still image, speech synthesis, voice, music, text, 2D graphics, 3D, and more. In [3], the most widely known MPEG-4 authoring tool for the

2 Figure 1: Overview of MPEG-4 Systems. composition of 2D scenes only is presented. This tool can read/write BIFS text or binary, read/write MP4 file format, import JPEG, AAC, or MPEG-4 video into a MP4 file, create self-contained MP4 files as well as multi-file scenes, can use BIFS and OD as a media, etc. In [4], a MPEG-4 authoring tool, compatible with the 2D player is presented. However, the user cannot preview the objects, which have been inserted in the scene until the scene viewed on the MPEG-4 player. In this paper, we present a 3D MPEG-4 authoring tool, our solution to help authors creating MPEG-4 contents with 3D functionalities from the end-user interface specification phase to the cross-platform MP4 file. We show our choice of an open and modular architecture of an MPEG-4 Authoring System able to integrate new modules. In the following section MPEG-4 BIFS are presented. In Section III an overview of the authoring tool architecture and the graphical user interface is given. The implementation specifics issues and more specifically how OpenGL was used in order to enable a 3D preview of the scene are given in Section IV. Experimental results demonstrate a 3D scene composed with the authoring tool in Section V. Finally, conclusions are drawn in Section VI. II. BINARY FORMAT FOR SCENES (BIFS) The BIFS description language [5], which has been designed as an extension of the VRML 2.0 [2] specification, is a compact binary format representing a pre-defined set of scene objects and behaviors along with their spatio-temporal relationships. In particular, BIFS contains the following four types of information: The attributes of media objects, which define their audio-visual properties. The structure of the scene graph, which contains these objects. The pre-defined spatio-temporal changes of these objects, independent of user input.

3 The spatio-temporal changes triggered by user interaction. Audiovisual objects have both a spatial and a temporal extent. Temporally, all objects have a single dimension, time. Objects may be located in 2-dimensional or 3-dimensional space. Each object has a local coordinate system. A local coordinate system is one in which the object has a fixed spatio-temporal location and scale (size and orientation). Objects are positioned in the scene by specifying a coordinate transformation from the object s local coordinate system into another coordinate system defined by a parent node in the tree. The coordinate transformation that locates an object in a scene is not part of the object, but rather is part of the scene. This is why the scene description has to be sent as a separate Elementary Stream. This is an important feature for bitstream editing, one of the content-based functionalities in MPEG-4. The scene description follows a hierarchical structure that can be represented as a tree. Each node of the tree is an audiovisual object. Complex objects are constructed by using appropriate scene description nodes. The tree structure is not necessarily static. The relationships can evolve in time and nodes may be deleted, added or be modified. Individual scene description nodes expose a set of parameters through which several aspects of their behavior can be controlled. Examples include the pitch of a sound, the color of a synthetic visual object, or the speed at which a video sequence is to be played. There is a clear distinction between the audiovisual object itself, the attributes that enable the control of its position and behavior, and any elementary streams that contain coded information representing some attributes of the object. The scene description does not directly refer to elementary streams when specifying a media object, but uses the concept of object descriptors. The purpose of the object descriptors framework is to identify and properly associate elementary streams to media objects used in the scene description. Those media objects that necessitate elementary stream data point to an object descriptor by means of a numeric identifier, an ObjectDescriptorID. Each object descriptor is itself a collection of descriptors that describe the elementary streams comprising a single media object. An ES_Descriptor identifies a single stream with a numeric identifier, ES_ID. Each ES_Descriptor contains the information necessary to initiate and configure the decoding process for the stream. A set of descriptors determine the required decoder resources and the precision of encoded timing information. III. MPEG-4 AUTHORING TOOL III-A. System Architecture

4 Figure 2: System Architecture. The process of creating MPEG-4 contents can be characterized as a development cycle with four stages: Open, Format, Play and Save (Figure 2). In this somewhat simplified model, the contents creators can: edit/format their own scenes inserting 3D objects, such as spheres, cones, cylinders, text, boxes and background. Also, group objects, modify the attributes (3D position, color, texture, etc) of the edited objects or delete objects from the content created. Insert sound and video streams, add interactivity to the scene, using sensors and interpolators and control dynamically the scene using an implementation of the BIFS-Command protocol. Generic 3D models can be created or inserted and modified using the IndexedFaceSet node. The user can insert a synthetic animated face using the implemented Face node. During these procedures the attributes of the objects and the commands as defined in the MPEG-4 standard and more specifically in BIFS, are stored in an internal program structure, which is continuously updated depending on the actions of the user. At the same time, the creator can see in real-time a 3D preview of the scene, on an integrated window using OpenGL tools. present the created content by interpreting the commands issued by the edition phase and allowing the author to check the correctness of the current description. open an existing file. save the file either in custom format or after encoding/multiplexing and packaging in a MP4 file [6], which is expected to be the standard MPEG-4 file format. The MP4 file format is designed to contain the media information of an MPEG-4 presentation in a flexible, extensible format which facilitates interchange, management, editing and presentation of the media.

5 Figure 3: Main window indicating the different components of the user interface. III-B. User Interface To improve the authoring process, powerful graphical tools must be provided to the author [7]. The temporal dependence and variability of multimedia applications, hinders the author from obtaining a real perception of what he is editing. The creation of an environment with multiple, synchronized views and the use of OpenGL was implemented to overcome this difficulty. The interface is composed of three main views, as shown in Figure 3. Edit/Preview: By integrating the presentation and editing phases in the same view we enable the author to see a partial result of the created object on an OpenGL window. If any given object is inserted in the scene, it can be immediately seen on the presentation window (OpenGL window) located exactly in the given 3D position. But if a particular behavior is assigned to an object, for example a video texture, it can be seen during the scene play only. If an object already has a video texture (image texture) and the user tries to map an image texture (video texture) on it, a message appears and give a warning to the user. For example, if a sound is inserted, a saxophone is displayed on the upper left corner on the presentation window. The integration of the two views is very useful for the initial scene composition. Scene Tree: This attribute provides a structural view of the scene as a tree (a BIFS scene is a graph, but for ease of presentation, the graph is reduced to a tree for display). Since the edit view cannot be used to display the behavior of the objects, the scene tree is used to provide more detailed information concerning them. The drag-n-drop and copypaste modes can also be used in this view. Object Details: This window offers object properties that the author can use to assign values other than those given by default to the objects. These properties are: 3D

6 position, 3D rotation, 3D scale, color (diffuse, specular, emission), shine, texture, video stream, audio stream, cylinder and cone radius and height, textstyle (plain, bold, italic, bolditalic) and fonts (serif, sans, typewriter), sky and ground background, texture for background, interpolators (color, position, orientation) and sensors (sphere, cylinder, plane, touch, time) for adding interactivity and animation to the scene. Furthermore, the author can insert, create and manipulate generic 3D models using the IndexedFaceSet node. Simple VRML files can be straightforward inserted. Synthetically animated 3D faces can be inserted by the Face node. The author must provide a FAP file [8] and the corresponding EPF file (Encoder Parameter File which is designed to give FAP encoder all the information related to the corresponding FAP file, like I and P frames, masks, frame rate, quantization scaling factor and so on). Then, a bifa file (binary format for animation) is automatically created, which is used in the Scene Description and Object Descriptor files. IV. IMPLEMENTATION SPECIFICS The 3D MPEG-4 authoring tool was developed using C/C++ for Windows, specifically Builder C and OpenGL interfaced with MPEG-4 implementation group (IM1) decoders. The IM1 3D player is a software implementation of a MPEG-4 Systems player [9]. The player is built on top of the Core framework which includes also tools to encode and multiplex test scenes. It aims to be compliant with the Complete 3D profile. OpenGL [10] is a software interface to graphics hardware. The main purpose of OpenGL is to render two- and three- dimensional objects into a framebuffer. These objects are described as sequences of vertices (that define geometric objects) or pixels (that define images). OpenGL performs several processes on this data to convert it to pixels forming the final desired image in the buffer. V. EXPERIMENTAL RESULTS In this section we present a scene that can be easily constructed by the authoring tool. The scene represents a virtual studio (Figure 5). The scene contains several groups of synthetic objects including a synthetic face, boxes with textures, text objects and IndexedFaceSets (Figure 4). The logo group which is located on the upper left corner of the studio is combined of a rotating box and a text object that describes the name of the channel. The background contains four boxes (left-right side, floor and back side) with image textures. The desk is created with another two boxes. On the upper right corner of the scene a box with video texture is presented. On this video-box relative videos are loaded according to the news. The body of the newscaster is an IndexedFaceSet imported from a VRML 3D model. The 3D face was inserted by using the corresponding button. Finally, a rolling text is inserted in the scene for the headlines. After the selection of a

7 Figure 4: The virtual studio scene in the authoring tool. FAP (Face Animation Parameters) file and an audio stream (a saxophone appears on the upper left corner), the face is configured to animate according to the selected FAP file. The video stream (H.263) and the audio stream (G.723) are transmitted as two separate elementary streams according to the object descriptor mechanism. All the animation (except the face animation) is implemented using interpolator nodes. VI. CONCLUSIONS In this paper an authoring tool with 3D functionalities for the MPEG-4 multimedia standard was presented. After a short introduction in MPEG-4 BIFS, the proposed editing environment and the underlying architecture were described. The 3D authoring tool was used for the creation of complex 3D scenes and has proven to be user friendly and fully compatible with the MPEG-4 standard. ACKNOWLEDGMENTS This work was supported by the PENED99 project of the Greek Secretariat of Research and Technology. REFERENCES [1] MPEG-4 Systems ISO/IEC : Coding of Audio-Visual Objects: Systems, Final Draft International Standard,'' ISO/IEC JTC1/SC29/WG11 N2501, October 1998.

MPEG-4 Structured Audio Systems Mihir Anandpara The University of Texas at Austin anandpar@ece.utexas.edu 1 Abstract The MPEG-4 standard has been proposed to provide high quality audio and video content

A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin Dept. of Electronics Engineering and Center for Telecommunications Research National Chiao

This is a preview - click here to buy the full publication INTERNATIONAL STANDARD ISO/IEC 14496-15 First edition 2004-04-15 Information technology Coding of audio-visual objects Part 15: Advanced Video

Bluray (http://www.blu-ray.com/faq) MPEG-2 - enhanced for HD, also used for playback of DVDs and HDTV recordings MPEG-4 AVC - part of the MPEG-4 standard also known as H.264 (High Profile and Main Profile)

Optical Storage Technology MPEG Data Compression MPEG-1 1 Audio Standard Moving Pictures Expert Group (MPEG) was formed in 1988 to devise compression techniques for audio and video. It first devised the

INTERNATIONAL STANDARD This is a preview - click here to buy the full publication ISO/IEC 23000-13 First edition 2014-05-15 Information technology Multimedia application format (MPEG-A) Part 13: Augmented

INTERNATIONAL STANDARD This is a preview - click here to buy the full publication ISO/IEC 14496-15 Third edition 2014 07 01 Information technology Coding of audio- visual objects Part 15: Carriage of network

This is a preview of "INCITS/ISO/IEC 14496...". Click here to purchase the full version from the ANSI store. INCITS/ISO/IEC 14496-2:2004[R2012] (ISO/IEC 14496-2:2004, IDT) Information technology - Coding

TSBK06 video coding p.1/47 Video coding Concepts and notations. A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds. Each image is either

Introduction to X3D Roberto Ranon ranon@dimi.uniud.it HCI Lab University of Udine, Italy http://hcilab.uniud.it Outline What is X3D? X3D basics including some new capabilities and details on the (near)

Unconstrained Free-Viewpoint Video Coding The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Published Version Accessed

MPEG-7 Multimedia Content Description Standard Abstract The purpose of this presentation is to provide a better understanding of the objectives & components of the MPEG-7, "Multimedia Content Description

Envivio Mindshare Presentation System for Corporate, Education, Government, and Medical Introducing the Envivio Mindshare Presentation System The Envivio Mindshare Presentation system is a highly optimized

5: Music Compression Mark Handley Music Coding LPC-based codecs model the sound source to achieve good compression. Works well for voice. Terrible for music. What if you can t model the source? Model the

INTERNATIONAL STANDARD This is a preview - click here to buy the full publication ISO/IEC 23003-3 First edition 2012-04-01 Information technology MPEG audio technologies Part 3: Unified speech and audio

About MPEG Compression HD video requires significantly more data than SD video. A single HD video frame can require up to six times more data than an SD frame. To record such large images with such a low

MPEG-4: a newer standard. Besides compression, pays great attention to issues about user interactivities. MPEG-4 departs from its predecessors in adopting a new object-based coding: Offering higher compression