Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

An example method is provided and includes identifying an active speaker
of a video session; analyzing a signal from an originating endpoint
associated with the active speaker; identifying a target participant with
whom the active speaker seeks to interact; and providing a notification
to the target participant that alerts the target participant that the
active speaker is seeking to interact with the target participant. In
more particular embodiments, the identifying of the target participant
includes detecting a gaze of the active speaker; and identifying a target
screen to which the gaze is directed.

Claims:

1. A method, comprising: identifying an active speaker of a video
session; analyzing a signal from an originating endpoint associated with
the active speaker; identifying a target participant with whom the active
speaker seeks to interact; and providing a notification to the target
participant that alerts the target participant that the active speaker is
seeking to interact with the target participant.

2. The method of claim 1, wherein the identifying of the target
participant comprises: detecting a gaze of the active speaker; and
identifying a target screen to which the gaze is directed.

3. The method of claim 1, further comprising: determining coordinates of
a location of the gaze on the target screen; and identifying the target
participant, whose image is positioned at the coordinates.

4. The method of claim 1, further comprising: determining a target
participant's identity by face recognition.

5. The method of claim 1, further comprising: detecting a speech pattern
of the active speaker; and using the speech pattern to identify the
target participant.

6. The method of claim 1, further comprising: detecting a head direction
of the active speaker; and using the head direction to identify the
target participant.

7. The method of claim 1, further comprising generating the notification;
and overlaying the notification on a video signal sent to a target
endpoint associated with the target participant.

8. The method of claim 1, wherein the notification comprises a selected
one of a group of notifications, the group consisting of: a) a blinking
icon provided on a screen; b) an audible sound provided for an endpoint;
c) a text message provided on a screen; d) a textual rendering of a
sentence spoken by the active speaker and provided on a screen; e) a
graphic provided on a screen; f) an avatar provided on a screen; and g) a
vibration provided for an endpoint.

9. Logic encoded in non-transitory media that includes code for execution
and when executed by a processor operable to perform operations,
comprising: identifying an active speaker of a video session; analyzing a
signal from an originating endpoint associated with the active speaker;
identifying a target participant with whom the active speaker seeks to
interact; and providing a notification to the target participant that
alerts the target participant that the active speaker is seeking to
interact with the target participant.

10. The logic of claim 9, wherein the identifying of the target
participant comprises: detecting a gaze of the active speaker; and
identifying a target screen to which the gaze is directed.

11. The logic of claim 9, the operations further comprising: determining
coordinates of a location of the gaze on the target screen; and
identifying the target participant, whose image is positioned at the
coordinates.

12. The logic of claim 9, the operations further comprising: determining
a target participant's identity by face recognition.

13. The logic of claim 9, the operations further comprising: detecting a
speech pattern of the active speaker; and using the speech pattern to
identify the target participant.

14. The logic of claim 9, the operations further comprising: detecting a
head direction of the active speaker; and using the head direction to
identify the target participant.

15. The logic of claim 9, the operations further comprising: generating
the notification; and overlaying the notification on a video signal sent
to a target endpoint associated with the target participant.

16. An apparatus, comprising: a memory element; a processor operable to
execute instructions associated with electronic code; and an analyzer
operable to analyze audio and video signals such that the apparatus is
configured for: identifying an active speaker of a video session;
analyzing a signal from an originating endpoint associated with the
active speaker; identifying a target participant with whom the active
speaker seeks to interact; and providing a notification to the target
participant that alerts the target participant that the active speaker is
seeking to interact with the target participant.

17. The apparatus of claim 16, the apparatus being further configured
for: detecting a gaze of the active speaker; and identifying a target
screen to which the gaze is directed.

18. The apparatus of claim 16, the apparatus being further configured
for: determining coordinates of a location of the gaze on the target
screen; and identifying the target participant, whose image is positioned
at the coordinates.

19. The apparatus of claim 16, the apparatus being further configured
for: generating the notification; and overlaying the notification on a
video signal sent to a target endpoint associated with the target
participant.

20. The apparatus of claim 16, further comprising: a database configured
for storing: information associated with an identity of the target
participant; and information associated with a target endpoint
corresponding to the target participant.

Description:

TECHNICAL FIELD

[0001] This disclosure relates in general to the field of communications
and, more particularly, to a system and a method for alerting a
participant in a video conference.

BACKGROUND

[0002] Video services have become increasingly important in today's
society. Enterprises of various sizes and types can collaborate through
video conference tools. A video conference allows people at two or more
locations to interact with each other via two-way video and audio
transmissions. Such video conference technology can allow enterprises to
cut costs, while boosting productivity. Video conference architectures
can simulate face-to-face interactions between people using advanced
visual, audio, and collaboration technologies. While video conferencing
performance has steadily increased, component manufacturers, service
providers, and engineering developers continue to be challenged to offer
a lifelike meeting experiences for their end users.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] To provide a more complete understanding of the present disclosure
and features and advantages thereof, reference is made to the following
description, taken in conjunction with the accompanying figures, wherein
like reference numerals represent like parts, in which:

[0004] FIG. 1 is a simplified schematic diagram of a system for rendering
video data in a communication environment in accordance with one
embodiment;

[0005]FIG. 2 is a simplified block diagram of example details of the
system in accordance with one embodiment;

[0006]FIG. 3A is a simplified block diagram of an embodiment of the
system according to the present disclosure;

[0007]FIG. 3B is a simplified block diagram showing an example view of an
embodiment of the system; and

[0008]FIG. 4 is a simplified flowchart illustrating example operations
associated with an embodiment of the system.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

Overview

[0009] An example is method is provided and includes identifying an active
speaker of a video session; analyzing a signal from an originating
endpoint associated with the active speaker; and identifying a target
participant with whom the active speaker seeks to interact (e.g.,
communicate, share information, solicit information from, etc.). The
method also includes providing a notification to the target participant
that alerts the target participant that the active speaker is seeking to
interact with the target participant. In more particular embodiments, the
identifying of the target participant includes detecting a gaze of the
active speaker, and identifying a target screen to which the gaze is
directed.

[0010] In more specific implementations, the method may include
determining coordinates of a location of the gaze on the target screen;
and identifying the target participant, whose image is positioned at the
coordinates. Additionally, the method may include determining a target
participant's identity by face recognition. In detailed instances, the
method may include detecting a speech pattern of the active speaker; and
using the speech pattern to identify the target participant. The method
may also include detecting a head direction of the active speaker; and
using the head direction to identify the target participant. In addition,
the method may include generating the notification, and overlaying the
notification on a video signal sent to a target endpoint associated with
the target participant.

Example Embodiments

[0011] Turning to FIG. 1, FIG. 1 is a simplified schematic diagram
illustrating a system 10 configured for providing an alert to a
participant of a video conference in accordance with one embodiment of
the present disclosure. FIG. 1 includes multiple endpoints, which can be
associated with various participants and end users in the video
conference. In general, endpoints may be geographically separated, where
in this particular example, a set of endpoints 12a-c are located in San
Jose, Calif., while a set of counterparty endpoints are located in
Chicago, Ill. FIG. 1 includes a multipoint manager element 20 associated
with a multipoint control unit (MCU) 16, which can be coupled to
endpoints 12a-c. Note that the numerical and letter designations assigned
to the endpoints do not connote any type of hierarchy; the designations
are arbitrary and have been used for purposes of teaching only. These
designations should not be construed in any way to limit their
capabilities, functionalities, or applications in the potential
environments that may benefit from the features of system 10.

[0012] In this example of FIG. 1, each endpoint is fitted discreetly along
a desk, where each endpoint is provided proximate to its associated
participant. Such endpoints could be provided in any other suitable
location, as FIG. 1 only offers one of a multitude of possible
implementations for the activities discussed herein. In one example
implementation, endpoints 12a-c are video conference endpoints, which can
assist in receiving and communicating video and audio data. Other types
of endpoints are certainly within the broad scope of the outlined
concept, and some of these example endpoints are further described below.
Each endpoint 12a-c can be configured to interface with a respective
multipoint manager element (e.g., multipoint manager element 20), which
can help to coordinate and to process information being transmitted by
the participants.

[0013] As illustrated in FIG. 1, a number of cameras 14a-c, screens 15a-c,
and microphones 18a-b are provided for the conference participants.
Screens 15a-c can render images to be seen by the participants and, in
this particular example, reflect a three-screen design (e.g., a
`triple`). Note that as used herein in this specification, the term
`screen` is meant to connote any element that is capable of rendering an
image during a video conference. This would be inclusive of any panel,
display device, Telepresence display or wall, computer display, plasma
element, television, monitor, or any other suitable surface or element
that is capable of such rendering. Moreover, the screen can encompass
each window in a "picture in picture" display on a single display device,
where multiple videos or images may be displayed simultaneously, for
example, in separate adjacent windows, or in one or more inset windows
inside a larger window.

[0014] In operation, the video conference technology of system 10 can
simulate an in-person meeting experience for its participants. In many
conferencing scenarios, not all participants may be visible to an active
speaker. The number of remote participants that are viewable to the
active speaker at any given time may be limited to a number of local
screens available for display in the active speaker's conference room.
When the number of remote participants exceeds the number of available
screens, any particular remote participant may be unaware that he or she
is being viewed by the active speaker and, thus, is unaware that the
active speaker's conversation is being directed towards him or her.

[0015] In accordance with the teachings of the present disclosure, and to
better replicate a true conference room experience, the architecture of
system 10 is configured to provide a mechanism for intelligently (and
autonomously) rendering images (on video conference displays) of certain
participants. Components of system 10 may overlay notifications (i.e.,
alerts) on appropriate screens to alert participants (e.g., when an
active speaker is attempting to converse with a target participant). This
can better simulate the experience of a conversation that commonly occurs
in an actual conference room.

[0016] Note that system 10 is capable of providing on-screen graphics and
text overlay to provide visual status updates and to improve the
effectiveness and security of the video session. For example, a
conference moderator can see when the meeting is locked or is being
encrypted from the graphics or text overlay. Graphics and text overlay
may have various other uses also, such as menu generation, special
effects, assistance for hearing impaired, etc.

[0017] System 10 is also capable of switching screens to an active
speaker. As used herein, an "active speaker" can refer to a participant
who is speaking relatively louder than other participants in the video
session at a particular moment of interest, or alternatively, the loudest
speaker for a predetermined time interval (e.g., approximately two
seconds). If more than one screen is available, one screen may show the
active speaker, where the other screens may show the other participants.
The active speaker could then readily see the previous active speaker on
one of his/her screen(s).

[0018] When a video conference has participants from multiple locations,
the group of participants may be not displayed on the screen(s). In such
a scenario, participants (other than the active speaker) may be displayed
randomly. Participants generally do not control which participants are
shown on the screen(s). For example, the active speaker may see other
participants on his or her screen(s), but the other participants may not
realize that they are being shown on the active speaker's screen(s). Such
display limitations may negatively affect a meeting experience, for
example, when participants do not realize that they are being invited
into a conversation (e.g., through physical cues). This stands in
contrast to a face-to-face scenario in a group setting, which video
conferencing platforms seek to emulate.

[0019] More specifically, in the context of face-to-face scenarios in a
group setting, people frequently rely on physical cues to recognize when
a participant in the group is attempting to converse with a second
participant. In this subtle way, physical cues are being used to attract
the second participant's attention. The physical cues can include any
number of items such as eye gaze, body orientation, hand and arm
gestures, facial movements (e.g., raised eyebrows, nodding,), etc. If the
target participant (i.e., the person that is a target of the speaker's
conversation) is within eyesight, usually, the speaker may direct his/her
gaze at the target participant without calling out the target
participant's name. On the other hand, if the target participant is not
within eyesight, it is likely that the speaker may address the target
participant by name.

[0020] Participants in a video conference scenario could utilize these
same physical cues in the framework of system 10. For example, when the
target participant's image is displayed on the active speaker's screen of
a single or multi-screen system, the active speaker may address the
target participant without calling out the target participant's name.
However, in a multipoint video conference (i.e., multiple participants
from multiple locations participate in a video conference), the target
participant may not realize that the active speaker is attempting to
converse with him/her. The target participant may see an image (or video)
of the active speaker, similar to all other participants at various
endpoints, but the target participant may not realize that the active
speaker is conversing with him (to the exclusion of the other
participants). The active speaker also may not be aware that the target
participant does not realize that he is even being addressed. For
example, such a situation can happen frequently in meetings where one of
the participants is more active than the other participants.

[0021] System 10 is configured to address these issues (and others) in
offering a system and method to intelligently and systematically alert a
participant in a video session (e.g., a video conference, a video call
involving a group, a video chat, a Telepresence call, etc.) about a
current (or a potential) interaction. In an example implementation,
speech behavior pattern, head direction, and eye gaze of an active
speaker may be detected and monitored to determine whether the active
speaker is attempting to converse with a target participant, whose image
would be displayed on a target screen. For example, when the active
speaker stops speaking with a questioning tone, with his head and eyes
directed at a target location on a target screen for a certain time
interval, then the target participant (whose image is positioned at the
target location) may be notified via an appropriate notification (e.g.,
blinking icon on display, beep, text message, etc.). In addition, a last
sentence spoken by the active speaker can be displayed on a screen
visible to the target participant. Additionally, system 10 can empower an
administrator to control notifications and images (to be rendered on a
given set of screens) based on the active speaker's physical cues (e.g.,
eye gaze, speaker behavior, etc.).

[0022] Hence, components of system 10 may analyze the active speaker's
visual behavior (e.g., actions or reactions of the active speaker in
response to a visual stimulus) to determine the target participant and
subsequently alert the target participant appropriately, so as to more
closely approximate a face-to-face meeting scenario. In certain example
implementations, an active speaker's visual behavior may be analyzed
using an ocular tracking system. The ocular tracking system may leverage
cameras 14a-c, for example, to detect head direction and gaze of the
active speaker. Alternatively, any suitable method for measuring the
active speaker's eye movements may be used in the ocular tracking system.

[0023] In an example embodiment, video images of the active speaker may be
used to extract a position of the active speaker's head and eyes. A
camera (e.g., cameras 14a-c) can focus on one or both eyes of the active
speaker and, therefore, record their movements as the active speaker
looks at the target participant on a target screen (e.g., screen visible
to the active speaker, and to which the active speaker has directed his
or her gaze). Gaze angles can be measured to determine coordinates of a
target location of the active speaker's gaze. The ocular tracking system
can detect the target location of the active speaker's gaze (e.g., where
the active speaker is looking).

[0024] Multipoint manager element 20 can facilitate the analysis of audio
and video signals from an originating endpoint (i.e., the active
speaker's endpoint where the audio and video signals originated).
Additionally, multipoint manager element 20 is configured to identify the
target screen to which the gaze is directed and the coordinates of a
target location of the gaze on the target screen. Multipoint manager
element 20 could have information about which endpoints are currently
displayed on the active speaker's screens, an identity of the active
speaker, the remote participants who are displayed on the active
speaker's screens, etc. In combination with information from the ocular
tracking system, multipoint manager element 20 can identify the target
participant whose image is positioned at the target location of the
active speaker's gaze on the target screen. Having determined the target
participant with whom the active speaker is conversing, multipoint
manager element 20 may facilitate a display of a notification (i.e., a
light indicator, an icon, a text, a proprietary graphic, etc.) on a
screen visible to the target participant, thereby alerting the target
participant that the active speaker is conversing (or attempting to
converse) with him or her.

[0025] In certain implementations, a picture-in-picture clue (e.g., active
presence for each participant) can be implemented in the architecture of
the present disclosure. For example, if system 10 detects that an
individual is gazing at user A, then on user A's screen, the PIP can
blink to let user A know that the individual is currently looking at him.
It should also be noted that the architecture of the present disclosure
can also readily handle instances in which a given participant in the
video conference is not currently on any screen. This could involve, for
example, the initiating individual using a soft button configuration, an
instant messaging mechanism, or body movements, facial gestures, eye
gazing, etc. to signal an attempted interaction with the target.

[0026] Note also that the architecture has the ability to not only notify
the remote participant being addressed, but to rearrange local display(s)
to override the last active speaker model with the images of the
individual being addressed. For example, because of screen arrangements,
an individual could be addressing someone on a screen not associated with
that individual's camera. The individual may be looking obliquely (or
sideways, or to the side) to address the participant: causing a lack of
eye contact on both near and far end. System 10 is configured to
rearrange participants such that the participant being addressed by the
individual is switched to the individual's screen (and vice versa, in
certain implementations). Such activities would enable direct eye contact
between the participant and the individual. Additional details associated
with these activities are provided below with reference to corresponding
FIGURES.

[0027] Turning to the infrastructure of FIG. 1, the example network
environment of FIG. 1 may be configured as one or more networks.
Additionally, networks of FIG. 1 may be provisioned in any form
including, but not limited to, local area networks (LANs), wireless local
area networks (WLANs), virtual local area networks (VLANs), metropolitan
area networks (MANs), wide area networks (WANs), virtual private networks
(VPNs), Intranet, Extranet, any other appropriate architecture or system,
or any combination thereof that facilitates communications in a network.
In some embodiments, a communication link may represent any electronic
link supporting a LAN environment such as, for example, cable, Ethernet,
wireless technologies (e.g., IEEE 802.11x), ATM, fiber optics, etc. or
any suitable combination thereof. In other embodiments, communication
links may represent a remote connection through any appropriate medium
(e.g., digital subscriber lines (DSL), telephone lines, T1 lines, T3
lines, wireless, satellite, fiber optics, cable, Ethernet, etc. or any
combination thereof) and/or through any additional networks such as a
wide area networks (e.g., the Internet).

[0028] Elements of FIG. 1 may be coupled to one another through one or
more interfaces employing any suitable connection (wired or wireless),
which provides a viable pathway for electronic communications.
Additionally, any one or more of these elements may be combined or
removed from the architecture based on particular configuration needs.
System 10 may include a configuration capable of transmission control
protocol/Internet protocol (TCP/IP) communications for the electronic
transmission or reception of packets in a network. System 10 may also
operate in conjunction with a user datagram protocol/IP (UDP/IP) or any
other suitable protocol, where appropriate and based on particular needs.
In addition, gateways, routers, switches, and any other suitable network
elements may be used to facilitate electronic communication between
various elements.

[0029] The components of system 10 may use specialized applications and
hardware to create a system that can leverage a network. System 10 can
use Internet protocol (IP) technology and run on an integrated voice,
video, and data network. System 10 can also support high quality,
real-time voice, and video communications using broadband connections.
The architecture of system 10 can further offer capabilities for ensuring
quality of service (QoS), security, reliability, and high availability
for high-bandwidth applications such as video. Power and Ethernet
connections for participants can also be provided. Participants can use
their laptops to access data for the meeting, join a meeting place
protocol or a Web session, or stay connected to other applications
throughout the meeting

[0030] Endpoints 12a-c may be used by a participant in a video conference
in system 10. The term `endpoint` may be inclusive of devices used to
initiate a communication, such as a switch, a console, a proprietary
endpoint, a telephone, a bridge, a computer, a personal digital assistant
(PDA), a laptop or electronic notebook, an i-Phone, an iPad, a Google
Droid, any other type of smartphone, or any other device, component,
element, or object capable of initiating voice, audio, or data exchanges
within system 10. Endpoints 12a-c may also be inclusive of a suitable
interface to a participant, such as a microphone, a display device, or a
keyboard or other terminal equipment. Endpoints 12a-c may also include
any device that seeks to initiate a communication on behalf of another
entity or element, such as a program, a database, or any other component,
device, element, or object capable of initiating a voice or a data
exchange within system 10. Data, as used herein, refers to any type of
video, numeric, voice, or script data, or any type of source or object
code, or any other suitable information in any appropriate format that
may be communicated from one point to another.

[0031] MCU 16 can be configured to establish, or to foster, a video
session between one or more participants, who may be located in various
other sites and locations. MCU 16 and multipoint manager element 20 can
coordinate and process various policies involving endpoints 12a-c. In
general, MCU 16 and multipoint manager element 20 may communicate with
endpoints 12a-c through any standard or proprietary conference control
protocol. Multipoint manager element 20 includes a switching component
that determines which signals are to be routed to individual endpoints
12a-c for rendering on screens. Multipoint manager element 20 can also
determine how individual participants are seen by other participants in
the video conference. Multipoint manager element 20 can add visual
information to video signals sent to target participants. For example,
multipoint manager element 20 can generate notifications and send the
notifications to target participants (e.g., after mixing and overlaying
text messages, audio cues, graphics, etc. on outgoing video signals to
the target endpoints). Furthermore, multipoint manager element 20 can
control the timing and coordination of these activities. Multipoint
manager element 20 can also include a media layer that can copy
information or data, which can be subsequently retransmitted or simply
forwarded along to one or more endpoints 12a-c.

[0032] Turning to FIG. 2, FIG. 2 is a simplified block diagram 30
illustrating example details of system 10 in accordance with one
embodiment. Multipoint manager element 20 may be provisioned in MCU 16
and may include a processor 32 and a memory 34. Multipoint manager
element 20 may communicate with a gaze/speech analyzer 36, which may
access a database 38. Gaze/speech analyzer 36 may receive audio and video
signals from multipoint manager element 20. In an example embodiment,
gaze/speech analyzer 36 may determine a speech pattern of the active
speaker, and use the information as a basis for sending a notification
through the architecture.

[0033] Speech patterns to be detected can include a distinctive manner of
oral expression. For example, the active speaker's tone of voice (e.g.,
vocative tone) may indicate a question is being asked. Gaze/speech
analyzer 36 may analyze the audio signals and determine (from the active
speaker's speech pattern) that the active speaker is asking a question.
Hence, system 10 can be configured to provide enhanced intelligence that
dynamically adjusts its image rendering operations based on vocative
speech inputs from the participants. This would enhance the user
experience by offering an effective placement of participant images on
screens for a multiscreen endpoint. In operation, the architecture of
system 10 can utilize speech vocative tone for smarter segment switching.
For example, after the name of each participant has been identified and
associated to the corresponding camera that captures their video, the
speech pattern analysis can be initiated. With a speech pattern detection
feature, when a user A addresses a remote user B by his/her name, the
speech being emitted is analyzed, and subsequently used to determine the
video segment for user A's video display. The video segment shown for
user A would contain user B (even though user B is not necessarily
speaking).

[0034] Hence, the mechanisms of system 10 can use basic speech, words,
and/or pattern-recognition to identify a specific name. Once that name is
detected, the speech segment containing it can be further analyzed to
capture the change in the frequency (e.g., f0 frequency). For example, if
the f0 frequency increases and then decreases, the speech portion can be
classified as a vocative tone. In a particular implementation, the
architecture can detect an H*L pattern (i.e., a falling intonation). As
used herein in this Specification, the broad term `vocative parameter` is
meant to encompass any suitable vocative characteristic, as detailed
herein. More generally, the vocative detection mechanisms of system 10
can apply to the case of a noun identifying a person (animal, object,
etc.) being addressed and/or (occasionally) the determiners of that noun.
A vocative expression can be an expression of direct address, where the
identity of the party being addressed is set forth expressly within a
sentence. For example, in the sentence "I don't know, John", the term
`John` is a vocative expression indicating the party who is being
addressed. This is in contrast to the sentence "I don't know John", where
John is the direct object of the verb `know.` The phonetic manifestation
of an L* tone on the final vocative is indicative of its contrastive
behavior.

[0035] When the active speaker addresses a target participant by name,
this can be identified by gaze/speech analyzer 36. Note that certain user
information may be populated in gaze/speech analyzer 36 and/or database
38. This user information may include user IDs, names, user profiles,
policies to be applied for particular video conferencing arrangements,
user preferences, organizational titles, speech patterns associated with
individuals, linguistic information, any suitable identifier, etc.
Moreover, gaze/speech analyzer 36 may be configured to detect sounds,
syllables, tone, etc. in the context of detecting and analyzing speech
patterns. Gaze/speech analyzer 36 may include any appropriate combination
of hardware and/or software modules for providing any of the features
discussed herein.

[0036] In an example embodiment, gaze/speech analyzer 36 may detect a gaze
of the active speaker. For example, gaze/speech analyzer 36 may analyze
video signals during the video conference and determine that the active
speaker is staring (somewhat continually) at a target location for a
period of time (e.g., two to three seconds). Gaze/speech analyzer 36 may
be configured to inform multipoint manager element 20 that the active
speaker's gaze is detected. Multipoint manager element 20 may also
analyze the video signals further to determine coordinates of the target
location of the gaze. In another example embodiment, gaze/speech analyzer
36 may detect the gaze and determine coordinates of the target location
of the active speaker's gaze. Gaze/speech analyzer 36 may then return the
coordinates to multipoint manager element 20.

[0037] In an example embodiment, gaze/speech analyzer 36 may be part of an
ocular tracking system that measures the target location of a gaze of the
active speaker. Logistically, gaze/speech analyzer 36 could be
implemented as a computer application on a non-transitory computer
readable medium. In certain example implementations, gaze/speech analyzer
36 can be implemented in MCU 16. In yet another example embodiment,
gaze/speech analyzer 36 may be part of multipoint manager element 20. In
yet another example embodiments, gaze/speech analyzer 36 may be located
on one or more of the endpoints, or on a device that is accessible by
multipoint manager element 20 (e.g., over a network connection). Various
other potential implementations of gaze/speech analyzer 36 may be
employed without departing from the broad scope of the present
disclosure.

[0038] Database 38 may include information about the identity of
participants 40-48; locations of corresponding endpoints; number of
screens at respective endpoints of participants 40-48; profiles of
participants 40-48, policies associated with participants 40-48,
references associated with a particular host, administrator, or of
participants 40-48, and any other information that may be used by
gaze/speech analyzer 36, an administrator, and/or multipoint manager
element 20 to perform the intended functionality of system 10, as
described herein. Database 38 may be provisioned internally within
multipoint manager element 20, outside multipoint manager element 20
(e.g., in a network device coupled to multipoint manager element 20), or
locally at a particular network location, which could foster
communications with multipoint manager element 20 and/or gaze/speech
analyzer 36.

[0039] In a particular implementation, multipoint manager element 20 is a
server provisioned to perform the activities discussed herein. More
generally, multipoint manager element 20, MCU 16, and/or gaze/speech
analyzer 36 are network elements, where the term "network element" is
meant to encompass computers, network appliances, servers, routers,
switches, gateways, bridges, load balancers, firewalls, processors,
modules, software applications, or any other suitable device, component,
element, or object operable to exchange information in a network
environment. Moreover, the network elements may include any suitable
hardware, software, components, modules, interfaces, or objects that
facilitate the operations thereof. It is imperative to note that
multipoint manager element 20, MCU 16, and/or gaze/speech analyzer 36 can
be consolidated, rearranged, and/or provisioned within each other in any
suitable arrangement without departing from the scope of the present
disclosure.

[0040] Turning to FIGS. 3A and 3B, FIG. 3A a simplified block diagram of
an example configuration associated with system 10. In this particular
example, participant 40 is currently the active speaker in a video
conference. A microphone 18 and a video camera 14 at an originating
endpoint (corresponding to the active speaker: participant 40) may record
audio and video signals from participant 40. Participant 40 may see
participant 42 on screen 15a, participant 46 on screen 15b, and
participants 48 on screen 15c. In this particular example, an assumption
is made that participant 40 is conversing with participant 48a ("Mary")
on screen 15c. Participant 48a is a target participant, where screen 15c
is a target screen in this example. Participant 40 may direct his gaze at
target screen 15c and (as he speaks) fix his gaze on target participant
48a: located at a target location (with corresponding coordinates
L.sub.(x, y, z)) on target screen 15c.

[0041] Multipoint manager element 20 may continuously receive audio and
video signals from microphone 18 and video camera 14. In an example
embodiment, gaze/speech analyzer 36 may analyze the audio and video
signals and determine that participant 40 is directing his gaze at
coordinates L.sub.(x, y, z). In another example embodiment, gaze/speech
analyzer 36 may detect a gaze of participant 40 and inform multipoint
manager element 20 that this gaze is being detected. Multipoint manager
element 20 may further analyze the video signals and determine that
participant 40 is directing his gaze at coordinates L.sub.(x, y, z). In
one embodiment, gaze/speech analyzer 36 may also determine from a speech
pattern of participant 40 that a question is being asked of "Mary" (e.g.,
if participant 40 addresses "Mary" in his speech).

[0042] In an example embodiment, multipoint manager element 20 may access
information from database 38 and determine that coordinates L.sub.(x, y,
z) correspond to a target location, where an image of target participant
48a is displayed on target screen 15c. Multipoint manager element 20 may
recognize that a target endpoint corresponding to participants 48 is
being displayed on target screen 15c. Multipoint manager element 20 may
determine from incoming signals (received from the target endpoint) that
an image of target participant 48a is located at the target location on
target screen 15c. Multipoint manager element 20 may also identify that
target participant 48a corresponds to Mary. For example, database 38 may
include identities of participants 48. In another example embodiment,
multipoint manager element 20 may employ face recognition methods (e.g.,
using suitable face recognition modules and/or other elements) to
identify individual participants being displayed on target screen 15c, as
well as their relative locations thereon. In an example embodiment, a
face recognition method may include one or more computer applications for
automatically identifying or verifying a person's identity from a video
frame (e.g., from a video source). For example, selected facial features
from the video frame may be compared with facial features stored in a
database (e.g., database 38).

[0043] Multipoint manager element 20 may generate any suitable
notification that alerts participant 48a that participant 40 is speaking
to her. As used herein in this Specification, the term `notification`
includes any suitable visual, audio, textual information. Such
notifications may include a text message (e.g., an instant message), a
blinking light, a colored light, any illumination feature, a muted sound,
a beep, a proprietary sound, a vibration, an icon, a text, a symbol, an
avatar, an e-mail address, a picture, a proprietary graphic, or any other
suitable notification that is conducive to alerting a given participant
in a video conference. The multipoint manager element 20 is also
configured to mix and overlay the notification on an outgoing video
signal, and subsequently send the outgoing video signals to the target
endpoint. The notification may be displayed on one or more screens, which
are visible to target participant 48a.

[0044]FIG. 3B is a simplified block diagram showing another configuration
for the system of the present disclosure. FIG. 3B illustrates the video
conference from a perspective of participants 48. Participants 48 may see
participant 42 displayed on a screen 15d, participant 44 displayed on a
screen 15e, and participant 40 (who is the active speaker in this example
scenario) displayed on screen 15f. Screens 15d-f are visible to
participants 48, including target participant 48a. Multipoint manager
element 20 may facilitate a display of a notification 50 (e.g., text
message, "Asking Mary a Question") on screen 15f. In this example,
notification 50 includes a beep, a blinking icon on screen 15f, and a
text message that alerts target participant 48a that the active speaker
is speaking to her. In certain embodiments, notification 50 may include a
textual rendering of a last sentence spoken by the active speaker.

[0045] In a particular implementation, notification 50 may be displayed on
screen 15f to the exclusion of screens 15d and 15e. In another example
embodiment, notification 50 may be displayed on all three screens 15d-f
simultaneously. Participant 48a may be alerted to the question, and have
an opportunity to respond. When participant 48a responds, she becomes an
active speaker in this paradigm, and the process may be restarted, for
example, by analyzing audio and video signals from the endpoint
corresponding to participant 48a.

[0046] Turning to FIG. 4, FIG. 4 is a simplified flowchart illustrating
example operational activities 100 associated with embodiments of the
present disclosure. The particular flow of FIG. 4 may begin at 102, when
multipoint manager element 20 is activated. In 104, multipoint manager
element 20 may receive video and audio signals from an originating
endpoint corresponding to the active speaker. In 106, gaze/speech
analyzer 36 may analyze the video and audio signals. In 108, a
determination can be made whether a gaze is detected. For example, if the
active speaker is looking at no one particular participant, a gaze may
not be detected, in which case, the process may revert to 104. However,
if the video signals indicate that the active speaker is directing his
gaze to a target location on a target screen, and the target location
corresponds to a target participant, then a gaze may be detected.

[0047] If a gaze is detected, an endpoint relationship may be determined
in 110. As used herein, an "endpoint relationship" encompasses a
relationship between a target location of the gaze on a target screen and
a target participant positioned at the target location. In an example
embodiment, gaze/speech analyzer 36 can be configured to provide (or at
least assist) the determination. In another example embodiment,
multipoint manager element 20 may independently make the determination.
Multipoint manager element 20 may identify a target screen to which the
gaze is directed, and determine a target endpoint displayed on the target
screen. Coordinates of the target location of the gaze on the target
screen may also be determined. In an example embodiment, gaze/speech
analyzer 36 may return the coordinates of the target location to
multipoint manager element 20 based on video signals from one or more
cameras in the active speaker's conference room. Multipoint manager
element 20 may identify the target participant whose image is positioned
at the coordinates.

[0048] In 114, multipoint manager element 20 may mix and overlay
notification 50 on an outgoing video signal to the target endpoint. In
116, the outgoing video signal may be sent to the target endpoint.
Notification 50 may be displayed on one or more screens visible to the
target participant in 118. The process may end in 120, where similar
operations can be repeated for subsequent flows (e.g., when the active
speaker changes).

[0049] In example implementations, at least some portions of the
activities related to alerting a participant in a video conference
outlined herein may be implemented in software in, for example,
gaze/speech analyzer 36 and/or multipoint manager element 20. In some
embodiments, one or more of these features may be implemented in
hardware, provided external to these elements, or consolidated in any
appropriate manner to achieve the intended functionality. MCU 16,
gaze/speech analyzer 36, and/or multipoint manager element 20 may include
software (or reciprocating software) that can coordinate in order to
achieve the operations, as discussed herein. In still other embodiments,
these elements may include any suitable algorithms, hardware, software,
components, modules, interfaces, or objects that facilitate the
operations thereof. In addition, MCU 16 and/or multipoint manager element
20 described and shown herein (and/or their associated structures) may
also include suitable interfaces for receiving, transmitting, and/or
otherwise communicating data or information in a network environment.

[0050] In some of example embodiments, one or more memory elements (e.g.,
memory element 34) can store data used for the operations described
herein. This includes the memory element being able to store software,
logic, code, or processor instructions that are executed to carry out the
activities described in this Specification. A processor can execute any
type of instructions associated with the data to achieve the operations
detailed herein in this Specification. In one example, processor 32 could
transform an element or an article (e.g., data) from one state or thing
to another state or thing. In another example, the activities outlined
herein may be implemented with fixed logic or programmable logic (e.g.,
software/computer instructions executed by a processor) and the elements
identified herein could be some type of a programmable processor,
programmable digital logic (e.g., a field programmable gate array (FPGA),
an erasable programmable read only memory (EPROM), an electrically
erasable programmable read only memory (EEPROM)), an ASIC that includes
digital logic, software, code, electronic instructions, flash memory,
optical disks, CD-ROMs, DVD ROMs, magnetic or optical cards, other types
of machine-readable mediums suitable for storing electronic instructions,
or any suitable combination thereof.

[0051] In operation, components in system 10 can include one or more
memory elements (e.g., memory element 34) for storing information to be
used in achieving the operations as outlined herein. These devices may
further keep information in any suitable type of memory element (e.g.,
random access memory (RAM), read only memory (ROM), field programmable
gate array (FPGA), erasable programmable read only memory (EPROM),
electrically erasable programmable ROM (EEPROM), etc.), software,
hardware, or in any other suitable component, device, element, or object
where appropriate and based on particular needs. The information being
tracked, sent, received, or stored in system 10 could be provided in any
database, register, table, cache, queue, control list, or storage
structure, based on particular needs and implementations, all of which
could be referenced in any suitable timeframe. Any of the memory items
discussed herein should be construed as being encompassed within the
broad term `memory element.` Similarly, any of the potential processing
elements, modules, and machines described in this Specification should be
construed as being encompassed within the broad term `processor.`

[0052] Additionally, some of the processors and memory elements associated
with the various network elements may be removed, or otherwise
consolidated such that a single processor and a single memory location
are responsible for certain activities. In a general sense, the
arrangements depicted in the FIGURES may be more logical in their
representations, whereas a physical architecture may include various
permutations, combinations, and/or hybrids of these elements. It is
imperative to note that countless possible design configurations can be
used to achieve the operational objectives outlined here. Accordingly,
the associated infrastructure has a myriad of substitute arrangements,
design choices, device possibilities, hardware configurations, software
implementations, equipment options, etc.

[0053] Note that with the numerous examples provided herein, interaction
may be described in terms of two, three, four, or more network elements.
However, this has been done for purposes of clarity and example only. It
should be appreciated that the system can be consolidated in any suitable
manner. Along similar design alternatives, any of the illustrated
computers, modules, components, and elements of the FIGURES may be
combined in various possible configurations, all of which are clearly
within the broad scope of this Specification. In certain cases, it may be
easier to describe one or more of the functionalities of a given set of
flows by only referencing a limited number of network elements. It should
be appreciated that system 10 of the FIGURES and its teachings are
readily scalable and can accommodate a large number of components, as
well as more complicated/sophisticated arrangements and configurations.
Accordingly, the examples provided should not limit the scope or inhibit
the broad teachings of system 10 as potentially applied to a myriad of
other architectures.

[0054] Note that in this Specification, references to various features
(e.g., elements, structures, modules, components, steps, operations,
characteristics, etc.) included in "one embodiment", "example
embodiment", "an embodiment", "another embodiment", "some embodiments",
"various embodiments", "other embodiments", "alternative embodiment", and
the like are intended to mean that any such features are included in one
or more embodiments of the present disclosure, but may or may not
necessarily be combined in the same embodiments. Furthermore, the words
"optimize," "optimization," and related terms are terms of art that refer
to improvements in speed and/or efficiency of a specified outcome and do
not purport to indicate that a process for achieving the specified
outcome has achieved, or is capable of achieving, an "optimal" or
perfectly speedy/perfectly efficient state.

[0055] It is also important to note that the operations and steps
described with reference to the preceding FIGURES illustrate only some of
the possible scenarios that may be executed by, or within, the system.
Some of these operations may be deleted or removed where appropriate, or
these steps may be modified or changed considerably without departing
from the scope of the discussed concepts. In addition, the timing of
these operations may be altered considerably and still achieve the
results taught in this disclosure. The preceding operational flows have
been offered for purposes of example and discussion. Substantial
flexibility is provided by the system in that any suitable arrangements,
chronologies, configurations, and timing mechanisms may be provided
without departing from the teachings of the discussed concepts.

[0056] Although the present disclosure has been described in detail with
reference to particular arrangements and configurations, these example
configurations and arrangements may be changed significantly without
departing from the scope of the present disclosure. For example, although
the present disclosure has been described with reference to particular
communication exchanges involving certain network access and protocols,
system 10 may be applicable to other exchanges or routing protocols.
Moreover, although system 10 has been illustrated with reference to
particular elements and operations that facilitate the communication
process, these elements and operations may be replaced by any suitable
architecture or process that achieves the intended functionality of
system 10.

[0057] Numerous other changes, substitutions, variations, alterations, and
modifications may be ascertained to one skilled in the art and it is
intended that the present disclosure encompass all such changes,
substitutions, variations, alterations, and modifications as falling
within the scope of the appended claims. In order to assist the United
States Patent and Trademark Office (USPTO) and, additionally, any readers
of any patent issued on this application in interpreting the claims
appended hereto, Applicant wishes to note that the Applicant: (a) does
not intend any of the appended claims to invoke paragraph six (6) of 35
U.S.C. section 112 as it exists on the date of the filing hereof unless
the words "means for" or "step for" are specifically used in the
particular claims; and (b) does not intend, by any statement in the
specification, to limit this disclosure in any way that is not otherwise
reflected in the appended claims.