Fully searchable and indexed database of all materials across all languages

Search field should be able to handle terms & properly parse “phrases”

Input accepts SRT, text, html, and perhaps other formats that can be parsed & represented internally
Unicode/UTF-8 conversion upon upload, as needed

Output provides Title of material, linked to specific passage or time-stamped video, as well as overall (non-located) material access.
Output ties into the Glossary described below, so that terms & phrases found in the Glossary are underscored with a dashed line and produce their Glossary definition in a tooltip when hovered over with the cursor or otherwise selected (e.g. touchpad, touchscreen, clicked, etc.). Tooltip should disappear when no longer hovered or if any other area of the screen is selected/activated.

Note: Output should not include the term or phrases actual Glossary listing. The LTI Glossary is a reference guide for translators, proofreaders & supporters developed from RBE materials, rather than a separate RBE material itself.
Could/should be expanded to provide catalog lookup by subject (e.g. Transportation, Education, Bio-engineering, Cybernetics, Artificial Intelligence, Biology, Brain, etc.) to facilitate further learning of a given topic. This could be linked to the member portal’s Weblinks feature or an advanced development of that.
Perhaps even displaying related materials by order of their comprehensiveness and/or best order of study for learning (i.e. start with this, then move onto that, etc.)

Re: Fully searchable, Index database of all materials across all languages

As the overall RBE community has grown, it has become increasingly apparent that the needs & integrations for this project have morphed from it being a separate app to one that is integrated within our custom PMS (currently in alpha state), which is currently designed to handle the movement of materials through the various transcription, proofreading, translation & final review before release efforts (i.e. who is doing what and how far along is each project across all languages).

The PMS is also undergoing a major expansion of scope as we are now looking to have it also take on the handling of automation integration across as much of the global RBE resources as we can stuff into it. I will soon create a new thread describing this much needed integration and attempting to pull together all related development projects & support info toward that end goal.

I've been investigating the use of Elasticsearch for this purpose. I took it on a while ago but my time has been dominated by other stuff so not much happened other than proving the concept.

Anyway, a progress update. I've finally managed to write something (in Perl because that's what I'm used to lately as a result of my day job). The script reads the tables of official videos from http://wiki.linguisticteam.org/w/Video_Repository and imports their metadata and their English subtitles (focused on English for the time being) into a virtual machine (Ubuntu, 1GB RAM, 1CPU) running Elasticsearch (currently on my laptop only). There is still work to do to clean up some of the scraped content amongst other bigger todos but I wanted to check the search functionality for a database containing more than 2 videos and their subtitles (what the proof of concept consisted of!) so it'll do for now. The subtitles are stored as attachments. A preliminary search on keywords (using a JSON aware front interface for Elasticsearch - http://sense.qbox.io/gist) such as RBE, creativity, behaviour, humanity, Venus, Fresco, Zeitgeist yield promising results from the title, description and file contents (along with the timestamp at which the word is found).

An index (equivalent to a database) containing data for 158 videos is taking up 15.3MB of disk space. This hasn't been tested for performance or optimised but "does the job" for a prototype!

I've been investigating the use of Elasticsearch for this purpose. I took it on a while ago but my time has been dominated by other stuff so not much happened other than proving the concept.

Anyway, a progress update. I've finally managed to write something (in Perl because that's what I'm used to lately as a result of my day job). The script reads the tables of official videos from http://wiki.linguisticteam.org/w/Video_Repository and imports their metadata and their English subtitles (focused on English for the time being) into a virtual machine (Ubuntu, 1GB RAM, 1CPU) running Elasticsearch (currently on my laptop only). There is still work to do to clean up some of the scraped content amongst other bigger todos but I wanted to check the search functionality for a database containing more than 2 videos and their subtitles (what the proof of concept consisted of!) so it'll do for now. The subtitles are stored as attachments. A preliminary search on keywords (using a JSON aware front interface for Elasticsearch - http://sense.qbox.io/gist) such as RBE, creativity, behaviour, humanity, Venus, Fresco, Zeitgeist yield promising results from the title, description and file contents (along with the timestamp at which the word is found).

An index (equivalent to a database) containing data for 158 videos is taking up 15.3MB of disk space. This hasn't been tested for performance or optimised but "does the job" for a prototype!

Running Work List
2014-11-22: Targets for next time:

*DONE* Tidy up import script and put on GitHub

*DONE* Document how to set up prototype

How to write phrase queries

2014-11-30: Targets for next time:

*DONE* How to write phrase queries (carried over from last time due to other tasks taking longer than expected)

The intention is to provide the entire world with a single place to look up a variety of things, and get the results back in a variety of formats. For example:

In what materials can I find the following keywords? Give them back to me with the timestamps or page numbers included.

How far along is the Arabic translation of [any project]? And tell me how I can join the effort.

How many people are currently part of the Serbian Team? And tell me how to join.

Where can I find the official public distribution of [any project]? And tell me if there is one specific to my language.

etc., etc., etc..

Does the [any language] Team have an active Twitter (or Facebook, etc.) account? Take me there so I can subscribe to it.

Consider that every team will have its own distribution and announcement channels (YouTube, Facebook page, Twitter account, etc.) along with their own set of team resources (team glossary, progress report spreadsheet, etc.), so these kinds of questions will be drawing on our record of their locations, as well as how the team relates to each individual project (tying into a master PMS that tracks the progress of each project across all teams).
Many of these resources already exist to some degree for some of the teams, but the system will obviously require consistency across the teams for such an approach to work smoothly.

Current Status

Search prototype has been upgraded to use Ubuntu 17.04 and Elasticsearch 5.4.

An index (equivalent to a database) containing data for 304 videos takes up 40.3 mb. This hasn't been tested for performance or optimised but it does the job for a prototype.