Interesting translation/transcription for movie subtitles

Moderators, please put this thread under appropriate subforum e.g. Development or such, not sure which exactly. Thank you.

I want to share a little php script which produces output in form of text with translation and transcription above every word. It can be useful for those who just started studying a new language.

It takes subtitles from movie or cartoon in both languages as input files. Then it's necessary to create a dictionary file by copy-past from online translation services. The only difficulty might be manual synchronization sentences in both languages to match each other by its meaning. But it's optional. Bellow are steps necessary to create the example above.

1. Php server. In Windows I used MicroApache (Apache 1.3.41, PHP 4.4.9, size 1 Mb) from http://microapache.kerys.co.uk/. Just download last version with PHP, unpack zip archive (e.g. into directory MA in disk C) and run go.bat there. Then you should be able to see page on http://127.0.0.1:8800/. If it failed than try earlier versions. Put 0001.php from attachment bellow into Apache's document directory. In our example it can be ether on disk C or as in my case in dir. www inside directory MA. If it's located properly you should be able to open http://127.0.0.1:8800/0001.php which will show you one or several warning lines in bold. If you use Linux installation of php server shouldn’t be a problem for you. Not sure about Mackintosh and other operating systems but you always can use a free php hosting service.

2. Dictionary file ointer.txt. Download first subtitles (showed in bold in picture above). I used subtitles for cartoon Tarzan and Jane from http://subscene.com/english/Tarzan-and-J…itle-22760.aspx. Extract subtitles file, rename it into isubt.txt and put in www directory. Then run 0001.php by going to http://127.0.0.1:8800/0001.php. If all goes smoothly you'll find file oword.txt in www. Copy and past its content into any online translation service. I used http://translate.google.com/. Save translation as idict.txt in www. Then do the same to get file with transcription itrans.txt. Save it in utf-8 encoding in folder www. I used free PhoTransEdit 1.7d from http://www.photransedit.com/Desktop/Default.aspx. You can use their online service on http://www.photransedit.com/Online/Text2Phonetics.aspx as well. Then run 0001.php again. If everything is ok you'll find ointer.txt inside www. If not just read output and try to edit 0001.php accordingly. It has lots of comments inside. Just open it with notepad or such, edit, save are rerun again. Words which might not be transcribed (with "<" and ">") or translated (empty or with capital letters) properly will be at the end of ointer.txt to make manual editing of this dictionary file easier.

3. Second subtitles file synchronized with first (optional). You can see them in picture above in right column. Run 0001.php. You'll find osent.txt file in www dir. Then download second subtitles. I got it from http://subtitry.ru/subtitles/437324884/?tarzan-jane. Extract, rename to ilocsub.txt and put it into www dir. Run 0001.php. You should find olocsen.txt file there. Then I used MS Office to synchronize them. I believe it's possible to use LibreOffice or similar instead. Copy content of osent.txt into first column of Excel spread sheet. Then content of olocsen.txt to second column. Copy these two columns into MS Word. Next I recorded and assigned shortcuts for two macroses there. First to insert a cell in table (Ctrl+Insert) and second to remove it (Ctrl+Delete). It took me a couple of hours to match every line (sentence) of right column (second subtitles) with relevant line (sentence) from left one (first subtitles). I did it by inserting/removing cells into right column or moving words from one cell of the column to above or bellow ones to better match translation (second subtitles) but don't change original (first) subtitles. Then I copied right column, pasted it into notepad and saved as isynsen.txt in www directory. But it could be rather difficult if you're not familiar with both languages. In that case you can try to translate it with online translators or just live isynsen.txt empty.

4. Final file opage.html. Run 0001.php. If everything is ok you'll find opage.html in www dir. Go to http://127.0.0.1:8800/opage.html or drag the file into your browser. Then you can change dictionary file ointer.txt or necessary constants at the beginning of 0001.php to tune content or layout of opage.html and run 0001.php again to apply changes.

In attachment bellow are all these files. I tested the script with these subtitles only. Now I realize that I would rather use JavaScript instead of PHP because in this case you don’t need the server, just any modern browser in any operating system.

This project idea is already in the right place, so no moving is needed.

Very interesting use of available technologies to bring language learning to anyone interested in expanding their global vocabulary! I'm now wondering how much of this can be automated so that visitors to an equipped website can choose to enable it for any page they wish, regardless of source language, and expanding it into any target language they desire to learn.

It’s quite possible but a lot of work. With transcription situation a bit easier since pronunciation of a word in most cases is the same (actually depends on language). It will be enough to have a database which provides single transcription for every word. Or more difficult and prissier solution could be an algorithm (e.g. from open source project) which pickups transcription depending on previous words in the sentence.

With translation the situation is more difficult because one word has several translations (noun, adjective, figurative meaning in set phrases, etc.) depending on context. And solutions are the same. More easier and quicker but not so good in quality is a database with single translation per word. And better result will be with some algorithm which chooses correct translation depending on context of the sentence.

I suggest you to contact with developers of PhoTransEdit or similar project since they have more skills in that area.