In this guide I am going to show you how to rip (extract) subtitles and their timings from video files (VOB) or DVDs and save them as a plain text file. This text is the .srt file. This way you can have selectable subtitle files that you can use alongside with a ripped DiVX/XviD movie (which you legally own) and format them by your media player to your needs. Moreover you can merge them to containers like mkv so can have audio, video and subtitles in a single file.

Till now this procedure in Linux you had to use a few different CLI (Command Line Interface e.g. Linux Console) programs and type many commands. However here I will use OGMRip which will allow us to easily extract the subtitles. It's main purpose is to rip and encode DVD into avi, ogm, mp4 or matroska files. However with a little "trick" we will just rip the subtitles without having to rip the movie as well.

OGMRip can be used with 3 different OCR (Optical Character Recognition) readers: gocr, ocrad or tesseract. This depends on the way you will configure and compile it or on the way that it has been compiled and packaged for the Linux distribution you use. I have tested OGMRip in Archlinux and Ubuntu. In Archlinux it is packaged with tesseract. Tesseract currently supports English, French, Italian, German, Spanish and Dutch languages. However if you are really patient you can train tesseract to support your language too. In Ubuntu run Add/Remove Applications and search for it in "All Available Applications". In Ubuntu OGMRip is packaged with ocrad support. Another advantage of OGMRip is that you don't need to have the DVD ripped in VOB files into your hard disk drive. Subtitles can be extracted directly through the DVD disk. So enough with the talking. Let's move to the tutorial part.

Once you have installed OGMRip fire it up. This is the program's main window.

A few words for the configuration first. Click Edit -> Preferences. In the General tab you can choose your Prefered Language to be automatically selected when you load a DVD so that it you don't have to do this manually each time.

Next the Advanced tab. Here select the Temporary Path e.g. the directory in which the ripped files will be temporary stored. I have /tmp selected. We are going to need this later.

Now press the Close button and click Edit -> Profiles. Select the first available profile, DivX for Standalone Player and click the Edit button.

Go to the Subtitles tab and make sure that SRT text is the selected codec. Forced subtitles are subtitles included in a movie that only a part of it was spoken in another language. For example if you have an English movie and at same place someone speaks in Japanesse the english subtitles that appear on the screen are the forced ones. So if you want to extract only those subtitles check this option. Usually you won't need this option. In Text Options select UTF-8 character set, or your language one. In the End of line option select Carriage return only (Unix) if you plan to use the subtitle file in Linux or Carriage return + Line feed (DOS) if you plan to use it in Windows. This way when you open the .srt file with a text editor the new lines will be properly displayed. Now press the Close button.

Now let's load our DVD. Click File and select either Load to load a DVD directly from your DVD drive or Open to open a local directory containing the ripped VIDEO_TS and AUDIO_TS directories. Select the Chapter you want to extract the subtitles from and make sure that in Subtitles there is the language you want. Almost every time the main movie is the chapter with the biggest duration. Once you are ready press the Extract button.

In the Options window just select the Profile we had set before, "DivX for Standalone Player" and click the press the extract button again.

Now a progress bar will appear with the title "Extracting subtitle stream 1". Here pay a little attention. When this operation completes OGMRip will continue extracting the audio stream. At this point press the Suspend button for the whole operation to pause.

Now open a terminal and type:

cp /tmp/subp.* ~/subtitle.srt

With this command we have coppied the .srt file from the temporary location /tmp into our user's home directory and we have named it subtitle.srt. Of course you can open Nautilus, Dolphin, Konqueror whatever go to /tmp find the subp.* copy it to your home directory and rename it. However the command is much faster, don't you think? When you copy the file press the Cancel button and all temp files will be deleted.

And voila, you .srt file is ready in only a few seconds! Pretty simple! Now you can open it with a text editor such as gedit, kwrite, kate or OpenOffice Writer and correct the mistakes of the optical recongition. I hope that while the OCR programs develope the recognition will become better. For example I noticed that there where some problems with the letters a and o, or h and n. but this can be easily edited and fixed in the .srt text file!

Comments (16)

Nick H.

Not to be rude but did you actually try this using English? I recently experimented with ogmrip (ver. 10 & 11 on an older computer and a recent snv on a laptop) in many cases the subtitles could best be described as gibberish. This doesn't seem to be an isolated incident as I found other complaints concerning ogmrip subtitles on the web and in the sourceforge forums of the project itself. As a free project (especially in an alpha stage), ogmrip seems to be fine as multimedia converter but the subtitling just seems extremely poor, especially compared to other projects like Handbrake or even AcidRip). Finally, it just appears reckless to recommend this project for its subtitling ability.

Did you try this in English? ,
February 14, 2009

-1

OCR

The issue of OCR is the one of tesseract used by OGMRIP - tesseract is not as good as the commercial OCR sw, but seems to be a better of those FOSS ones. The issue with tesseract is that one needs to "teach" it first, which is a tedious process (see google for wiki on training tesseract). A person with scripting abilities can partially automate this. I was able to train it for Czech characters (not even officially supported by tesseract) for DVD vobsub ripping and the major problem was that occasionally few words have not been separated by spaces. Otherwise, good work.

MK ,
April 21, 2009

+1

...

I haven't tried training tesseract MK, but it's good to know that you have almost succesfully achieved it. Since tesseract is still being developed I believe it's a matter of time till it becomes a good OCR utility.

axel ,
April 21, 2009

+0

if you're having problems with english subtitles

@All Ubuntu Users If you're trying to rip subtitles in Ubuntu with ogmrip it's important that you install the english tesseract package. For some reason the default language package for tesseract is german so anytime you install tesseract-ocr it will also install tesseract-ocr-deu. In order to rip English subtitles with ogmrip install the following packages.

sudo apt-get install ogmrip tesseract-ocr-eng

buntu ,
June 24, 2009

+1

...

Thanks for mentioning this buntu.

axel ,
June 25, 2009

+1

...

It just got easier! The developer of ogmrip has kindly added support for encoding without video streams. Instead of having to pause the encoding at a certain point and copying the srt file from your temp directory you can simply select the subtitle stream and disable the video and audio streams.

...

Do you know if it's possible to use ocropus with either ogmrip or avidemux to perform the ocr processing?

http://code.google.com/p/ocropus/

buntu ,
July 29, 2009

+0

...

Sorry buntu, I have no idea. If you try Avidemux you will see that at first you train the program and after a while it makes all the ocr automatically. It's good.

axel ,
July 30, 2009

+0

...

@buntu: I experience that the available plugins vary depending on which ocr-software that is available. If you have only tesseract, you might try to install gocr or ocrad. At least this made a difference to my SVN-compile.

However, now it doesn't seem that the "no video" option is any longer available :/