Status of this proposal

This charter proposal was approved by the IDPF membership in January 2012.

Need for this proposal

Dictionaries, glossaries, thesauri, and similar works are ubiquitous published resources that users expect to have available in the EPUB3 ecosystem. The primary use of a dictionary or glossary from a user point of view is the ability to search for a term and quickly retrieve its definition or translation. Currently, EPUB has no mechanism for an author to mark up the needed semantic information to enable such reading system search features, making it impossible to publish a dictionary in EPUB that serves its primary purpose. While EPUB-based reading systems often bundle dictionaries with devices and offer a word lookup feature, this is achieved by storing the dictionary in a proprietary format and essentially treating it as part of the reading system software, rather than an independent publication.

The current situation does not allow users to choose the dictionary content that best suits their needs, and instead limits them to using a single bundled dictionary. Publishers of EPUB3 content wish to make a broad range of reference resources available to users and to serve needs that cannot be met by a general monolingual dictionary typically bundled with a reading system: children need dictionaries designed for their reading level, language learners need dictionaries that translate from a foreign language to their native tongue, and users reading material in fields such as medicine and law need dictionaries covering a broad specialized vocabulary. Publishers also wish to offer users the ability to look up words in a publication's glossary while reading, thereby enhancing the user's experience of educational and other types of content.

Reading system developers wish to utilize and innovate around these types of publications.

This proposal describes the scope, required functionality, and timeline to deliver a standard for producing EPUB3 Publications that meet the use cases that are also included in this proposal.

Scope

In-scope (Deliverables)

The scope of this project is to define a declarative mechanism for the representation of
dictionaries and
glossaries in EPUB Publications sufficient to enable development of reading system features specific to these publication types. As further detailed in
Use Cases and
Needed Publication Properties below, the delivered mechanism shall have the following top-level functional properties:

Allow publishers to embed a glossary in a publication, and allow users to
look up terms in the glossary local to that publication

Allow publishers to make a wide range of monolingual dictionaries, bilingual dictionaries, and thesauri available to consumers

Allow users to select text for
lookup while reading, and to obtain results from their
preferred resources that are appropriate to the language context of the text selected

Allow publication authors mechanisms to provide all semantic information necessary for reading system support of these features

Allow reading system developers to innovate around methods for users to set their
preferred resources for dictionary
lookup (eg, by reading level, by language, preference ordering, etc.)

Out of Scope

Low-level, system oriented functionality for fast lookup and retrieval, typically described in terms of a database-like index.

Development of a full-fledged authoring language for dictionaries.

Integration Constraints

The defined mechanism shall integrate with EPUB 3 as follows:

Graceful fallback: it must allow EPUB 3 Reading Systems to open and reasonably render Publications containing the mechanism, even if the Reading System has not been updated to explicitly support the mechanism.

Native grammars and extension points: it must utilize EPUB 3 Content Document grammars to the maximum extent possible, and it must only use extension points defined within EPUB 3 and XHTML 5.

Shallow implementation: Reading System implementation of the mechanism must not require changes to underlying (browser-based or other) XHTML rendering engines; full implementations must be possible on the Reading System level alone.

Timeline and Participation

Project participation is open to IDPF members and invited experts.
(Note that invited expert status needs to be renewed for each IDPF project.)

The project charter spans one year in total. Once formed, the working group will decide on feature prioritization and possibly also versioning strategies, after which the milestones below can be dated.

Draft Charter Proposal to WG for review

December 2, 2011

Submission to Membership for Approval

January 6, 2012

WG creation, formal project start

January 23, 2012

WG Face-to-face

Feb timeframe
TBD

First WG Draft

TBD

Second WG Draft

TBD

Proposed Specification

TBD

Recommended Specification

TBD

Maintenance/Tutorials

Through Jan 2013

This project is intended to be run concurrently with the project on
indexes, and so shares the the charter span with that project.

Working Group Leads

Suggested Leads of this working group are:

Jeff Alexander, Intangible Press (Co-Chair)

Daniel Hughes, Liguori Publications (Co-Chair)

Use Cases

Actors: publishers, users

A user opens and interacts with a dictionary publication as they would any other EPUB3 publication.

A user selects text in a publication and performs a
lookup on that text, expecting to access
entry content in the same language as the selected text.

A user selects text in a publication and performs a
lookup on that text, expecting to access bilingual
entry content offering translations of the selected text into another language.

A user selects text in a publication and performs a
lookup on that text, expecting to access multilingual
entry content offering translations of the selected text into two or more other languages.

A user sets the default level of detail returned or displayed from
dictionary resource entries.

A user sets a default level of detail returned to a screen reader or other AT for the purposes of 'skimming'.

A user references a term located in a specific resource for the purposes of citation, annotation, or sharing.

A publisher indicates that an EPUB3 publication should prefer to use embedded entries, e.g. a
glossary local to publication.

A publisher indicates that an EPUB3 publication should prefer to use a specific stand-alone
dictionary resource, if available.

System: reading system, content

A publication contains an embedded glossary or dictionary, and declares this resource via package metadata.

A publication is a stand-alone dictionary and declares this in package metadata.

A publication contains
headwords in one or more languages, and declares the available languages and relationships through package metadata.

A publication indicates that it has an intended audience, such as users within a certain age range, or students in a specific grade level.

Needed Publication Properties

Package Metadata

A publication contains an embedded
glossary and declares this resource via package metadata.

A publication is a monolingual dictionary, bilingual dictionary, multilingual dictionary, or thesaurus, and declares this via package metadata.

A publication is a bilingual or multilingual dictionary and declares the languages it covers and the relationships between them via package metadata.

If a bilingual dictionary publication only consists of a French-to-English translation direction (ie, a collection of entries containing French headwords with corresponding English translations), it would need to declare one relationship: French as the
source language and English the
translation language. Such a publication contains one
dictionary resource.

If a bilingual dictionary publication contains both French-to-English and English-to-French translation directions (ie, both a collection of entries containing French headwords and their English translations, as well as a collection of entries containing English headwords and their French translations), then two different relationships need to be declared and distinguished in package metadata. A reading system must be able to determine that there are two
dictionary resources present and to distinguish between them.

A publication has an intended audience of students at specific grade levels, users in a certain age range, users who are learners (non-native speakers) of a language, or users with a specified level of vocabulary, and declares this audience via package metadata. It should be possible to combine these audience descriptors and to specify multiple combinations; for example, a dictionary for students ages 9 to 12 could also be appropriate for adults in a literacy program at a similar reading level.

Entry Structure

A dictionary or glossary
entry is both content in the publication flow that can be accessed as any other EPUB content, and syndication content that can be accessed by a reading system for presentation to a user who is reading a different publication.

The delivered mechanism must allow authors to clearly mark the boundaries of entries to enable syndication-type access to entry content during
lookup.

Authors must be able to identify an entry's domain so that all similar entries form a collection, or
dictionary resource. If two or more
dictionary resources are available in a publication, a reading system must be able to identify the collection of entries comprising each resource, and associate that collection with relevant information in package metadata.

Authors should also be able to distinguish different types of entries that may not necessarily qualify as separate dictionary resources as discussed above. For instance, monolingual dictionaries in some languages (notably French) distinguish between proper nouns and all other headwords, and group them into separate sections. There should be a mechanism to distinguish such entry types within a publication.

Authors should be able to nest an entry within another entry. This capability is not necessarily required for an initial dictionary implementation, but can help enable features that actors may choose to offer. For example, if a reading system developer decides to offer a feature to let users search the
examples in a dictionary, a publisher could enable such a search by marking up each entry's examples as nested entries with appropriate semantics.

Headwords and Inflections

The delivered mechanism must allow publication authors rich markup capabilities for
headwords (both main entry headwords and other types such as
alternate,
variant, and
run-on headwords) and their associated
inflections sufficient to enable both
lookup and
text entry searches. The following types of headword and inflection markup should be supported:

The normalized spelling of a
headword, as opposed to the form in which it may be displayed in the entry. For example, a headword that may be represented in the
entry's content as '
in
*val
*id
' or '
invalid
1' needs to be marked up as 'invalid' for search purposes.

A form of a
headword for display in search results, which may be distinct both from the displayed form in
entry content and the normalized spelling matched during a search. For instance, a publication author may wish to specify that a headword displayed in entry content as '
in
*val
*id
1' should display as 'invalid
1' or 'invalid
1(adjective)' in a list of matching results presented after a
text entry search. This distinction is more critical for some languages than others; for example, in Japanese dictionaries numerous entries may have the same
kanji (ideographic) form but distinct spellings in the
hiragana (syllabic) writing system. In an electronic Japanese dictionary's search results, these forms are typically combined in ways not reflected in entry content.

Inflected forms of a
headword. In print dictionaries, inflections generally do not have their own entries. In electronic dictionaries, however,
inflections users may look up while reading should be associated with their root headwords and stored as non-viewable content. For example, when a user performs a
lookup of an inflected form of the verb "run" such as "ran", "running", or "runs", a reading system must be able to match the dictionary entry "run"; in other words, the root form "run" and its inflections "ran", "running", and "runs" should be treated as equivalent for lookup purposes.

As there are multiple types of
headwords that can occur in dictionary entries, the delivered mechanism should allow for the possibility of multiple headword/inflection combinations located at any point within an entry.

Other Semantic Markup

The delivered mechanism should provide a vocabulary for markup of both required and optional dictionary
semantic concepts. The latter group should cover at least important semantic elements of
entry content that developers and publishers may choose either to utilize in certain reading system features (such as a narrowly-focused
idiom or
example search) or to suppress in certain contexts (for example, an
etymology in a pop-up window on a mobile phone).

Structure and Semantics

N.B.: The following terms are representative of the range of lexical and semantic qualities that will be needed to support stated use cases and also allow for innovation. For the purposes of this charter proposal to initiate a working group, these terms are not intended to be interpreted as a strict requirement for inclusion into a specification.

Definitions

affix

A prefix, infix, or suffix that is attached to another form to make a word with a distinct meaning, eg,
laugh +
ed. (1)

alternate headword

A form related to a primary
headword but generally carrying a somewhat different meaning. For example, an entry with the primary headword
aestivate might have
aestivation as an alternate headword. An alternate headword should be indexed for search purposes along with the primary headword.

antonym

Terms with opposite sense or meaning.

audio pronunciation

An audio file containing a recording of the pronunciation of a particular
headword. This feature of many electronic dictionaries can be offered in addition to or in place of the traditional written
pronunciation.

case

An
inflection of a noun, adjective, or pronoun according to its function in a sentence. German, Russian, and Latin are examples of languages in which words have many different written forms according to case.

date

definition

dictionary resource

A collection of entries that have headwords in a particular
source language and that a reading system can access to
look up terms a user selects while reading a publication.

displayed inflection

An
inflection of a headword that is part of the viewable content of an entry. Irregular inflections are often explicitly printed in entries to provide guidance to the user, eg, the displayed inflection "mice" in "
mousenoun, pluralmice"

entry

equivalence

A statement that a headword or particular sense of a headword is equivalent in meaning to another dictionary headword, typically supplied in lieu of a definition and acting as a cross-reference to the equivalent entry cited. An example would be a short entry for
color in a British English dictionary that informs the user this is a US equivalent of
colour: '
colornoun(US) =
colour'.

etymology

An explanation of the historical origin of a headword, eg, a statement that it is derived from a particular Latin word.

example

A sentence or phrase illustrating the usage of a headword in a particular
sense.

gender

A label indicating the gender of a noun, generally subsumed in part-of-speech at the beginning of an entry; in bilingual dictionaries, often a stand-alone label associated with a particular
translation.

glossary

A glossary section of a publication that a reading system can access to
look up a term a user selects while reading that particular publication.

headword

The word occurring at the start of an
entry whose meanings the entry covers; in a broader sense, a word whose meanings are discussed at any point in the entry (see
alternate headword,
variant headword,
run-in headword, and
run-on headword). In a monolingual dictionary or glossary, the headword is
defined, while in a bilingual dictionary the headword is
translated, and in a thesaurus
synonyms are provided. In most languages, entries are arranged alphabetically according to the spelling of the headword.

holonym

A relation between a whole and a part, eg, a wiki is a holonym of constituent wiki pages; 'has-parts'.

hypernym

A relation between a class and sub-class; 'has-types'.

hyponym

A relation between a sub-class and a class; 'is-type-of'.

idiom

An idiomatic expression that is defined or translated in an entry. For example, an entry for
cold might contain the
idiom '
to get cold feet'.

inflection

An
affixed form of a
headword that conveys a specific grammatical meaning; for example, the past tense of a verb (eg, 'ran' is an inflection of 'run') or plural form of a noun (eg, 'mice' is an inflection of 'mouse'). Related to the concept of
stemming in indexes.

lookup

A search for a user-selected term in dictionary or glossary
headwords (including
alternate,
variant,
run-on, and
run-in headwords) and
inflections. When a user initiates a glossary lookup, the reading system should search the local publication's embedded glossary, while when a user initiates a dictionary lookup, the reading system should search the user's
preferred resources. Matching glossary or dictionary entries are then displayed to the user, typically in a pop-up window.

meronym

A relation between a part and a whole, eg, a wiki page is a meronym of a wiki; 'is-a-part-of'.

quotation

A quotation from a cited source illustrating the usage of a headword in a particular
sense.

part-of-speech

phrasal headword

A headword of two or more words typically formed from another headword and listed within that headword's entry. For example, the items '
get out' and '
get up' listed in the entry for '
get' would be phrasal headwords.

pronunciation

register label

regional label

A label indicating geographic range of a
headword or
sense, eg,
Latin America, Western US, Australia.

run-in headword

A headword occurring in the middle of an entry, generally associated with a particular
sense.

run-on headword

A headword occurring at the end of an entry and that is derived from that entry's headword. For example, the adverb
softly at the end of the entry for the adjective
soft would be a
run-on headword.

sense

A particular meaning of a headword, and a unit for organizing information pertaining to this meaning. Sense units are typically distinguished from one another by numeric and/or alphabetic labels.

sense label

A short phrase that restricts and clarifies the meaning of a particular
sense.

source language

The language of the term(s) which a user wishes to looks up; in bilingual dictionaries, the language of the
headwords in a section of the publication.

stylistic label

A label identifying stylistic usage of a
headword or
sense, eg,
literary.

subject label

A label indicating subject area of a
headword or
sense, eg,
biology, architecture.

synonym

Terms with identical or similar meanings. Groups of synonyms are often tied to a particular
sense of a headword in a thesaurus or dictionary.

temporal label

A label indicating current usage status of a
headword or
sense, eg,
archaic.

tense

An
inflected form of a verb that indicates when the action is taking place.

text entry search

A feature by which a user can directly input text into a search field and select
entries with matching
headwords from a list. Reading system developers could implement such a feature in a variety of ways, depending on their preference: by displaying matching results only after the user has input a full string and launched the search, or displaying partial matches as the user types, or positioning a highlight in a scrollable, complete list of dictionary headwords (to cite just a few possibilities).

translation language

usage section

A note providing usage information on a headword, or a more extensive section covering the difficult and confusing aspects of a particular headword's usage.

variant headword

An alternative spelling of a primary
headword that carries the same meaning and that should be treated as of equal rank to it for search purposes. For example, an entry with the primary headword
kabbalah could have numerous variant headwords: '
kabbalahalsokabbalaorkabalaorcabalaor ...' (2)

voice

A relationship between the subject and object of a verb that is either
active or
passive.