Download YouTube video files with youtube-dl

Did you know that you don't necessarily need flash to view YouTube videos? You don't actually need to even visit YouTube to be able to watch a cool video that a friend gave you a link to.

All you need is youtube-dl, a neat program written and actively maintained by Ricardo Garcia Gonzalez which allows you to download a YouTube video directly as a .flv file which can be played with mplayer or VLC.

Run the youtube-install script in the terminal as a superuser: sudo ./youtube-install.sh

If the script installed successfully you can now start downloading YouTube videos like this: youtube-dl URL, where "URL" is your YouTube video URL.

The video file will be downloaded in the current working directory.

For more information about youtube-dl check the official page on github. For other ways to download YouTube videos see below.

Other ways to download youtube videos

KeepVid.com

A great alternative to the youtube-dl script is KeepVid.com which allows you to drop the video URL, select the video site it is from (yes it supports even more than YouTube) and just click download. Most files will have an ugly file name, but you can rename it to something.flv. VLC will play them fine!

Chrome Add-On

If you use Chrome (or Chromium) install the Elite YouTube Downloader add-on and you'll get a neat download button (with a drop down menu for options) below every youtube video.

Comments

1. Being root isn't necessary to install youtube-dl; it could be installed to somewhere in a normal user's $PATH that they have write access to.
2. That script is so short it's useless. Just tell people to download the youtube-dl script to somewhere in their $PATH.
3. The GGPL requires you to include it with the script, which would be huge bloat.
4. I think Xine can play FLVs too, along with anything else that uses FFMPEG.

Once you know how to use basic commands like ls, cd, etc, you should be able to run scripts without any problems since a script works like a command.

Note that once you've downloaded the script, you will also have to make it executable by running the command:chmod 755 youtube-install.sh
This can also be done by right-clicking on the file, then "properties"->"permissions"->check "allow executing file as a program"

[Reply is directed at the bashist 4-line shell script and assumes (based on some testing) that "t:'.....'" does not appear in the "watch?v=....." page so that t="" is essentially what is grep'd out of that page.]

The script libervisco's script uses to do the heavy lifting is written in python ( http://www.arrakis.es/~rggi3/youtube-dl/youtube-dl ). I don't know python that well (at least I haven't used it before), but I think it would be interesting to take it apart and see what it is that makes it tick. We can post the results on this forum.

I had hoped that the 4 line shell script presented by bashist would do that trick, but that one is not working for me. [I did not run it as directed but someone else claims it does not work. My attempt at doing the equivalent one line script (equiv for all intents and purposes) did not pan out either.]

So does anyone else think it would be a good idea to analyze and explain the python script?

I think I will attempt it but despite being relatively easy I don't speak python very well. Thus I may give up on it if I don't get results quickly (this would be where someone else chipping in would be beneficial to this effort).

The url for it is http://www.arrakis.es/~rggi3/youtube-dl/youtube-dl . Reference it because I don't want to post it here all at once (though if someone really wants to they can -- it uses a very "liberal" license). I may quote heavily from the script though.

It is important to know right now that I make guesses throughout whenever I think I can pull it off based on context. At some point we should verify our hunches by looking up the source code for these external libraries/objects being used indirectly (eg, external python imported through the "import" statement).

...

The first section shows that some existing python libraries will be used. We can ignore this section until we get to the actual function calls later on that we don't understand. Guessing from the names of these libraries, it looks to me like they will (among other things) handle the technical details of fetching from the web what we want as well as building up the queries perhaps.

Next we have a bunch of variable/constant definitions. We can ignore these also until we get to them. The point here is probably to allow the rest of the script to be easier to read and to maintain.

Next we get into a bunch of function definitions. We ignore these too for the moment.

At this point the script actually starts to do real work (prior to this, the python interpreter would "just" be initializing itself up properly storing definitions).

Lets look at the first small section after '# Create the command line options parser and parse command line' consisting of a group of similar looking lines. This part is like administrative overhead but let's cover it quickly.

The first 3 lines set up and start to make use of one of the objects/libraries we loaded up before, "optparse". [There was an "import optparse" line earlier in the program.]

One of the main thing done by the lines that follow is to store values entered at the command line (when calling this script) into variables that can be accessed througout the script as needed. [This is a case of me making a guess so as to avoid looking up the "optparse" source code.] For example, a variable named "username" would store whatever word followed the CLI flag "--username". Other values are saved under "password" "outfile" "quiet" "simulate" "use_title" and some others.

The next line (..socket..) sets the time out period when making a call out to google (or anywhere else). [Note again that I am guessing whenever I refer to something that was imported ("import socket") unless I specifically state that I will look at that source code (in which case, I should provide a link).] The script author probably has an idea of what constitutes a healthy time out period for this script and doesn't want to rely on the defaults set up by whoever wrote the "socket" code. The actual value "const_timeout" is a variable, so we look back to find it and note that it is set to "120" at the top of the script. My guess is that this 120 represents seconds to the socket object. That's a long timeout if you are there waiting interactively but can be useful if you were running the script overnight or automatically as part of some batch processing (eg, downloading while you sleep or in the background a bunch of videos you looked at during the day). In this latter case, you want to give the network or the google servers some time to recover if they are busy. [Corrections are welcomed.. since I am not looking at source code to verify this.]

The next statement ("if" statement across 3 lines) in conjunction with the one just above the socket code, ie:
>> (cmdl_opts, cmdl_args) = cmdl_parser.parse_args()
>> if len(cmdl_args) != 1: ....

just checks that this script is called with any number of flags but only a single other argument. If not the program ends and a message is printed. That argument btw is stored into the "cmdl_args" array (or list or whatever else python uses through the "(..)=.." syntax) and represents the name of the video to download. So you might call this script as follows:

.. or so I am guessing at this point since not too many clues have been provided (but we'll keep reading the source code to see what is up ).

>> video_url_cmdl = cmdl_args[0]
Here we place (eg) "fBWoWFF3OYU" from the storage space cmdl_args[0] into video_url_cmdl. Ie, we simply now have a "nicer" variable name through which to access the value we entered at the command line (I'm assuming at this point that this value is the name of the video, but we'll eventually see how the script uses this value to see what it actually represents -- probably the name of the video .. like fBWoWFF3OYU for one of the freedomware videos).

So we are at the comment line "# Verify video URL format and convert to "standard" format". The last thing done was to put the main argument to this script (eg, fBWoWFF3OYU) into video_url_cmdl

>> video_url_mo = const_video_url_re.match(video_url_cmdl)
>> if video_url_mo is None: ...
Now we use what is called a "regular expression" to verify that this argument (eg, fBWoWFF3OYU) fits the pattern we expect. A regular expression is like a template that can be used to verify any text and can be used to grab subtext from a larger text. I won't explain the rules but the template is const_video_url_re which we look up at near the top initialization portion of this script and find to be the following monster:
^((?:http://)?(?:\w+\.)?youtube\.com/(?:v/|(?:watch(?:\.php)?)?\?(?:.+&)?v=))?([0-9A-Za-z_-]+)(?(1)[&/].*)?$

Roughly what this template (ie, the regular expression) says is that the argument can be something like fBWoWFF3OYU, but it can also take any of numerous other forms, eg, http://www.youtube.com/watch?v=fBWoWFF3OYU .

Also, at the same time that this argument is matched/verified, the fBWoWFF3OYU portion is saved. We see part of the power of regular expression handling since whether we had "fBWoWFF3OYU" or "http://www.youtube.com/watch?v=fBWoWFF3OYU" or something else that fits the template, the fBWoWFF3OYU portion would be isolated and stored apart.

Next we come to:
>> video_url_id = video_url_mo.group(2)
which is where we take the fBWoWFF3OYU and save it into the variable video_url_id

Next:
>> video_url = const_video_url_str % video_url_id
would put into the variable video_url the following string of text: "http://www.youtube.com/watch?v=fBWoWFF3OYU". [Look back to the top of the file to verify what const_video_url_str is]

To recap what looks like the most important part of these last two mini sections, we have set two variables. Using our running example of fBWoWFF3OYU, we have:
video_url_id = "fBWoWFF3OYU"
video_url = "http://www.youtube.com/watch?v=fBWoWFF3OYU"

Note that if our original cli command had been (borrowing the example from the parent post and fixing it a bit):
./youtube-dl --title sometitle --username myusernameIguess --password asupersecretyoutubepasswordIguess fBWoWFF3OYU
or
./youtube-dl --title sometitle --username myusernameIguess --password asupersecretyoutubepasswordIguess http://www.youtube.com/watch?v=fBWoWFF3OYU
or even
./youtube-dl --title sometitle --username myusernameIguess --password asupersecretyoutubepasswordIguess http://uk.youtube.com/watch.php?v=fBWoWFF3OYU
... in all three cases (and anything else that fit the template), we would end up with exactly:
video_url_id = "fBWoWFF3OYU"
video_url = "http://www.youtube.com/watch?v=fBWoWFF3OYU"
The magic was that no matter which of the ways we use to enter the info, the regular expression processing allowed us to pluck out the fBWoWFF3OYU and then use that to construct a very specific url http://www.youtube.com/watch?v=fBWoWFF3OYU.

OK.

The next section, as the comments in the script indicate [comments start with "#"], are just some tests on the flags that were passed in if any. From reading the error messages, we get a better idea of what the flags represent. For example, it seems you can't specify a specific name for the output file (ie, the name under which the download will be saved, or so that seems to be where this is leading) as well as specify that you want the title of the download to be used as the name of the file under which the download will be saved. Ie, this is a conflict, so the script ends end displays an error message. You can only specify one of --output --title or --literal.

We also note that --netrc seems to be incompatible with either --username or (--username and) --password. I don't yet know what this means though.

The next section does some testing to determine the variables account_username and account_password [the goal of this section seems to be to come up with these two values]. netrc was something we imported at the top. I am guessing netrc is used to access some system or computer wide username/password that is attached to the youtube domain and can be used for authentication. Maybe python can be configured to use system defaults. Maybe ldap.. I don't know. I am guessing. In any case, we can forget about netrc (ie, no need to worry about the --netrc flag). It's looking like we supply to this script the username and password we have for youtube. Now, is this mandatory? Well, if we keep reading the script, we should find out. We should also find out how it is that we can authenticate with youtube automatically (I've only done so through the browser by clicking on login or whatever as is the usual procedure). Let me also bring up that it looks like youtube uses cookies to know that you are returning so as to keep you logged in. Thus I expect that a dialogue with youtube through this script, if it requires logging in and if it uses the http protocol, will probably involve passing cookies back and forth to the youtube servers in the http header files. If anyone is confused, don't worry. We should get a better idea of what is going on as we make our way through this script. If cookies are used, I'll try and explain that as simply as I can at such time.. for anyone that wants to know how that works.

So, to use the running example, from this section we end up with:
account_username = myusernameIguess
account_password = asupersecretyoutubepasswordIguess

OK, time for me to take a time out. I may not get back until tomorrow since I'll be busy for a while.

Something that I will probably have to end up trying when I get stuck (and you people might want to do too) will be to load up python and run this script, but before doing so, add print statements to list out the values in variables or to poke as the need/curiosity dictates. Python can be run interactively, so you can run only a fraction of these script lines and even modify it throughout. However, running interactively, while more powerful in some ways, means you have to simulate the environment where the cli command was passed in. So to make things easy, just forget about starting up the interpreter by itself. Instead add print statements (not sure what python's version of print/echo is) so that you can see what happens (the script file we modify is youtube-dl.. ie, the file being dissected). If you are really curious, consider also adding your login info is you have a youtube account, but I don't know if that is needed.

Again, if anyone wants to share insight or continue exploring the code...

[Just looked at libervisco's script (youtube-install.sh) a little closer. It's just an installation script (gee, the name should have given it away). I guess I misunderstood the top post. Anyway, if you don't want the actual script you call, youtube-dl, to be installed into /usr/local/bin, then DON'T run libervisco's script. I'm not saying there is any harm in installing it that way. Just pointing it out in case you like to install it somewhere under your home directory, within the current scratch directory, or just somewhere else or just don't have root access. If you opt out of running youtube-install.sh, then you still have to download youtube-dl somehow. Use for example "wget -c http://www.arrakis.es/~rggi3/youtube-dl/youtube-dl" to download it. Then to run youtube-dl, make sure you have the execute bit set (that is part of what libervisco's installer does).]

This gibberish is a regular expression. Though it's not important to know regular expression notation to basically follow along the youtube-dl python script discussion, regular expressions (re's for short) are very useful so we might as well take a detour to look them over [follow along if you are curious]. BTW, regular expressions don't have to look ugly. For example,

a

is a regular expression. It can be used to match/find an "a" within a sample of text.

Another example is

dog

which matches "dog" within the sample of text being tested.

RE's are potentially huge time savers for doing many types of text processing. Perl got its reputation as the language for text manipulation because of its great support for regular expressions and integration into the perl language itself. The web protocols are based on English looking text and so perl made a name for itself early on as being the web language. Today, lots of programming languages have access to great RE capabilities through bindings to some of the same libraries that perl uses. One great thing about RE's is that they frequently enable you to do many types of searches with almost no knowledge of any syntax of the underlying language/tool since most of the work is done through re's and not through language "loop" expressions/statements. Of course, what was done in the past through re's (eg, lots of backend web processing) now is hidden behind function calls that do all that work and then some. This is so because the web is a platform that hasn't changed much in years. In many respects, that "problem" has been solved. The wheel was made and homed, and now we just use the results of that work without having to dip into RE's. But for custom or new applications, RE's continue to be as useful as ever. In fact, the better "Find" tools you see on the web or on the desktop allow you to do more advanced searches by accepting regular expression notation.

The idea for taking advantage of re's is to pair up a regular expression (the template) with some text and see if there is a match (and if so where and what). Regular expressions look just like text (so be careful you don't loose track of which is the template and which is the text). What makes it a regular expression is you. You basically say, this will be a template. Voila! Of course, you have to let the computer know your intentions, so you would use it in the proper function calls. Generally, each re symbol matches a character/byte of text that "looks" just like it, but there are some symbols that have special meaning.

Warm-up pre-exercises [informal notation]:
> dog|cat matches "dog" as well as "fsfsfddogdfsdfsdf" as well as "dsdsdsdsdcatfsdfdf". Did you notice the dog and cat hidden in the middle?
> (dog)+(cat)* matches "dog" "dogdog" "dogcat" "dogdogcat" "dogdogcatcatcatcat" "dogcatcatcatcatcatcatcatcatcatcat" "dogdogdogdogdogdogdogcatcatcat" and many more things like "asdfasdfasdfdogdogdogdogdogcatcatcatdfsafsdfasdfsdf" and more.
> dog+cat* matches "dogggggcattttt" "dogca" and many more things.

OK, when dealing with regular expressions, we have to be careful with quotation marks and in general marks of any kind. I'll try to be consistent and use quotations around text and text matches while using no quotations when referring to the re itself. The examples won't be that advanced. At the end though, we'll go over the "monster" re that led to this diversion, ie,

Let's start by covering the symbols/characters/bytes/etc that have special meaning when used inside a regular expression.

Note the "^" at the beginning and the "$" at the end of the monster RE above. These have special meaning. These are said to "anchor" the expression to the beginning of the test text (^) and to the end ($). These are optional like any other symbol. You generally avoid these if you want to find a pattern anywhere. You use them if you want to confine the pattern or be extra restrictive.

Let's see what some of the other special characters are. Note that preceding any of these characters with a "\" (that basically isn't itself preceded by a "\") allows them to match themselves. One thing you will notice is that these special symbols carry implicit conditional logic. Through this they allow compact templates to be created, even allowing you to express infinities within the matching, allowing you to express compactly what would otherwise involve potentially lots of programming constructs.

Alright...

\ this is used to escape symbols. Most of the symbols that follow have special meaning unless escaped. When escaped, they represent their ordinary looking selves. Also, some ordinary characters (eg, "n") when escaped become special (eg, "\n" represents end of line, ie, byte x0A).

^ is used to match the beginning of the text being tested. Or rather, for there to be a match, this symbol has to match up with the beginning of the text being tested. If this can't be done, then there is no match, ie, the re template does not match the text being tested.

$ is used to match the end of the stream of text being tested.

() this is for grouping. It allows for tremendous simplification of notation since it has the highest order of precedence. I'll clarify later through contrasting examples.

(?: ) this is identical to () as far as testing for a match. The difference is in whether or not we can find out immediately what matched (and not just that there was a match). Why the distinction? Because it's more efficient if you can forget what matched but simply know that something did. For some heavy duty applications, you want speed and to know yea/nay. Afterwards, there is time to go back and get the actual text that you know exists near a particular area. In other uses, however, going back repeatedly can result in a huge processing time sink and require extra coding (and time coding). It's an engineering call to be made depending on the problem being solved and the characteristics of the stream. Anyway, (xxx) gives you an immediate link to the actual matched text (if any) and (?:xxx) does not. (?:xxx) is used only for grouping (no capture).

? means that what precedes it is optional (in other words, what precedes it appears exactly one time or zero time for there to be a match).

* means that what precedes it can appear zero or any number of times.

+ means that what precedes it can appear once or more only. Ie, what precedes it must appear at least once for there to be a match.

[] is used to enclose a group of characters, any single one of which can/must be matched at that location.

[^] is used to enclose a group of characters that cannot be matched. This also only represents a single item.

[f-t] this is a specific example and it means that any letter between "f" and "t" inclusive can be matched. It's a short-hand for writing out all the letters in this range.

[-ft] in this case we can match either of "-" "f" or "t" In fact, if you want to include "-" it has to be in the initial or final position; otherwise it will designate a range.

[a^] means either "a" or "^" only is to be matched.

[^a] means any byte at all except "a". Thus at the beginning of a [], ^ means exclusion.

| matches either what precedes it or what follows it. It has a low order of precedence relative to everything else. [I'll give examples of order of precedence scenarios below.] Eg, av|bv means match "av" or match "bv".

. is used to match any single character. Eg, a.b matches "azb" and "a.b" and "a#b".

Almost anything else matches only itself and exactly once. [Remember, the above match themselves only when preceded by a \]

Also, there are a few otherwise ordinary characters that when preceded by a \ acquire special meaning (so they sort of work the opposite as these prior examples). For example, \w matches any letter or number or "_". Thus \w+ would match a word of such text.

[Keep in mind I am not looking at the python descriptions of these so there may be some subtle differences from what I say.]

OK, now for some simple examples to help clarify a bit more.

I'll list the re on one line and then below it I'll give examples (one per line) of sample text that would match. Then below that, I'll list sample text that would not match. I'll use ***** to separate the two groups of examples from each other and from the RE.

Keep this in mind. The lines in the first group all match but the actual matched text is a subtext of the whole line. If you want to know exactly what part matches and what part doesn't, use the tool provided http://www.regular-expressions.info/javascriptexample.html . The "Replace" button will indirectly show all the parts that match by putting in the word "replaced" in place of what was matched. The portions that are not matched are kept in place. Still, the line itself would be a match if any portion of the line matched. No match means there was no match anywhere within the line.

[You can practice whenever by going to that link http://www.regular-expressions.info/javascriptexample.html . The re syntax there is whatever the javascript running on your browser understands (so javascript must be enabled). This is basically what I am describing in this posting, and it should be basically what python uses.]

As an excercise make sure you believe that all examples in the first groups match and none of those in the second groups match. I may make a mistake, so ask if in doubt. In fact, I won't generally test these myself with the website tool to increase the chances of making a mistake. For practice, try testing these examples yourself in the link mentioned in the prior paragraph. Put the re into the top slot, then put the text string to test into the next slot, then hit "Test Match". If the javascript finds a match but you don't believe it, try hitting the bottom button "Replace" as it will show you where the match(es) occured. If you still don't believe it, post the example here. I discovered a bug in the re matching in mozilla some years ago [it may still be there]. ..and yes, I do make mistakes.

I know the examples can get tedious, but testing some of these with the linked tool as you find time will help out. You'll see better examples and the payoff further down after this section.

Here is a single isolated example to make the other examples clearer:
a
*****
a
*****
b
c

This means that the RE is a. There is one example provided of a line that matches ("a"). And there are two examples provided of lines each that fail to match ("b" is one line and "c" is the other line).

^ab&
*****
ab
*****
a
b
ba
aab
c
54645ygfs54ytfdgsdrytg afsret43t asdfv 4 hello everyone i don't match it
^ab&
*****
*****
^(ab)&
*****
ab
*****
(ab)
(bas)
ab)
^(ab)&
(((sdfsfasdf)f((9(dsf]]]]]]]]][dkfksdlpfkas;ldkf;lsdf these dont match because the text line MUST begin with a and end with b immediately.
*****
*****
(ab)
*****
ab
(ab)
(ab
^(ab)&
ab)sfasdf these are all ok since the ab can match anywhere.
*****
(a b) this has a space
(bas)
(((sdfsfasdf)f((9(dsf]]]]]]]]][dkfksdlpfkas;ldkf;lsdf these dont match because they don't have a followed immediately by b.
*****
*****
^(?:ab)&
*****
ab
*****
^(?:ab)&
previous line and this one don't match. for example, previous line starts with "^" not "a" as it should.
alksdjkflkasjd klfjasklfj klsdjfklsdjfk hope you get the idea. basically only "ab" matches "ab" or "(ab)" or "(?:ab)"
*****
*****
^(a|b)&
*****
a
b
*****
ab
ba
basdfsdfasfdsadf
aa
(a|b) .. in other words, either "a" or "b" but nothing else will match including this long line.. duh.
*****
*****
^\(a|b\)&
*****
(a
b)
*****
no other lines besides these prior 2 will match. note how | is low order of precedence so ^ \( and a are one item as is b \) and &
*****
*****
\&&
*****
&
afsfasdfasdf &sf sadfasdf &&&& fasdf &&&&
*****
the above match because they end in "&"
*****
*****
&\&
*****
*****
nothing can match &\& since it would have to end and then be followed by a "&" which is impossible if the text stream ended.
note that the first grouping has no lines in it. that isn't a double ***** used to separate examples.
*****
*****
^\((a|b)\)&
*****
(a)
(b)
*****
nothing else will match not even
^\((a|b)\)&
and certainly not
(ab)
a
b
(a
etc
*****
*****
(a|b)
*****
sdfsdf i match believe it or not (the reason is that if a single a or a single b appears anywhere then we match at the 1st such location).
note that I too match. the difference with the above cases is that we didn't anchor the text to the beginning and end.
*****
I don't mtch, sorry.
*****
*****
^a|^b
*****
a i match because I start with a or with b (with a actually) note that ^a|^b is the same as ^(a|b) as far as matching goes
*****
sorry, I could have a's and b's like crazy but I do not match
*****
*****
^a|b
*****
a i match because I start with a or have a b anywhere(with a actually)
*****
don't start with a or have lc B somewhere
*****
*****
a|b$
*****
zsdfsjdfsdfdb
zsdfsjdfsdfda
zadfsjdfsdfdc
*****
zsdfsjdfsdfdc
these lines don't hAve An A lc Anywhere or end in b.
*****
*****
a$|b
*****
we match just by having a single b anywhere
or alternatively, without any, we can end in an a
*****
anything that doesn't have a lower case B or end in an a won't match.
*****
*****
ab(cd$|ef)gh|ij
*****
wow, here we match if we have an ij anywhere (so we match).
we also match if we have anywhere the following "abefgh". note that we have to have all of these letters exactly
*****
I don't match because i don't have Ij lowercase nor do i have aBefgh lower case.
I don't match abcd
i don't match abcdgh
the reason the above 2 don't match is that you can't meet the requirement of *ending* in a "cd" yet be followed by a "gh". it's impossible.
that was a bit tricky, i know. i even had to test it out because i wasn't sure if the re was really legal, [editor: i added a similar example above when proofing this reply]
*****
*****
a*b*
*****

the prior empty line matches as does this one. this one matches first at the very beginning WHAT? the match is ""
b i match too. i match first at the beginning. the match is "b"
since "" is possible, any line matches.
Also the algorithm/rules that are used will likely match once after every single character except
if there is a run of a's or a's and b's, in which case, there will be a single match for that entire run
go to the testing page link and try out the following
zaazaaaaabbzbbbbbbbzbababaz
try the "Replace" button to see exactly what gets replaced. Go ahead, try it.
OK, i'll tell you. there is a match before the first z (match is ""), again before the next z (this match includes "aa" as a single match), and then again (match is "" right before z), again before the next z (the match is aaaaabb), then right after (match is ""), again before the next z (match is bbbbbbb), then again (""), again before the next a (match is b), and before the next a (match is ab), and before the next and final a (match is ab), and before the ending z (match is a), and again (match is ""), and then once more after the ending z (match is "").
This is a crazy case. Usually, you want to avoid degenerate cases where you match "" because that seems rather useless in *almost* every scenario. So be careful.
Anyway, any line will necessarily match so I am going to have an empty second group and go on the the next example.
*****
*****
*****
a?bc?de|fg
*****
this line matches because it has de.
this line matches because it has fg.
this line has ade.
this line has bcde
this line has abcde
this line has abcde as well as fg (abcde is the first match).
ditto here abcdefg.
and here fgabcde
*****
this line does not match despite having a bde. the reason is that concatenation (ie, two symbols next to each other)
function as one unit when placed next to an "?" since ? has a lower order of operation. see the next example.
*****
*****
a?b(c?de)|fg
*****
bde matches in this case. here the b is not optional at all (unless i have fg).
i can have bde, bcde, abde, abcde, or fg .. and these need not be isolated, eg, zbdet matches at the b. and bdebde matches in two places.
*****
don't match

naturally, the previous blank line doesn't match either.
*****
*****
a?b(c?d)e|fg
*****
as far as matching goes, this regular expression is identical with the prior one (eg, bde and the rest match)
*****
the only difference, the () placement, in this specific case, doesn't affect the possible matches but it does something i'll cover later.
*****
*****
a?b(c?de|)fg
*****
the | inside () has lowest order of precedence so everything to the left "or" everything to the right (within the ()) can match. bfg
*****
we can have "" match up inside (). that's how we ended up with Bfg.
note that all matches will necessarily include a b and end in fg. i don't mean fg is at end of line, but fg will
be the ending part of any match.
bdefG bcdefG w or w/o initial a would all match if these were in lower case.
also matching if in lc would be abfg.
*****
*****
a?b(c?de|f)g
*****
here we have a optional a followed by b followed by either cde, de, or f, followed by g as the possibilities, eg, abcdeg
*****
is this making sense?
*****
*****
a\?b(c?de|f)g
*****
the only difference here is that a is now mandatory as is a "?" that would follow it immediately in all cases, eg, a?bfg matches
*****
lot's of things don't match including abfg and ?bfg and a?bc?deg. however, a?bCdeg would match if all lc. as would a?Bdeg
*****
*****
[ab]cd[ef]gh
*****
matches acdfgh acdegh bcdegh bcdfgh
*****
but not abcdefgh nor abcdegh nor cdgh nor acdgh nor many other things.
*****
*****
[ab]*cd[ef]gh
*****
matches cdegh and other stuff
*****
fails to match @
*****
*****
~!@#\$%\^&*()_+|=
*****
~!@#$%^_!!!!!!
1=2
*****
i don't match; above note the escaped $ and ^. that was necessary or there would be no matching possible. Also the & was not necessary because of the * after it. Also, the () are invisible (the group nothing "") and the _ could have appeared more than once.
The !!! at the end are irrelevant since the important part appears somewhere.
finally, the | at the end means that you can also match with a line having =.
*****

OK, well I got sort of tired of doing this. Let's move on.

The parentheses are used to group and as well to capture the text within them that matched. This works for () but it specifically does not work for (?: ) which is otherwise identical.

Thus (a|b) matches a and it matches b. Ditto for (?:a|b). However, in the first case, I can query to find out if the match was a or b. This is very powerful. It saves many lines of coding. [I won't give python code examples, sorry]

Speaking about grabbing matched text, the * and + work as follows. These match the longest match possible. Sometimes you want the shortest match possible, for those cases use *? and +?. [examples will follow to help clarify]

You should also know that there is a shorthand for matching a specific number of occurances (or any range) without having to list out the pattern that many times. You can follow the portion you want repeated by {m, n} (instead of * or + or ?). This means at least m but no more than n. Also {m} for exactly m times and {m,} for m or more times. Also {m,n}? {m}? {m,}? also exists. Also ?? exists.

OK, now for examples to make some sense of this.

"aaaa" is text matched by all of the following re's (the comma is used for separation purposes to keep the discussion compact but is not a part of any re): a, aa, aaa, aaaa, a+,a+?, a{1}, a{1}?, a{2,}, a{2,}?, a{2,3}, a{2,3}?, aa?, aa??, aa*, aa*?, and numerous other things. Let's see what these match and where if allowed to repeat (always throwing away what has matched so far). [I'll use ":" as separator to keep discussion compact]

a: matches first "a"; will also match each of the next a's when the match is repeated from where left off. [as many as 4 matches]
aa: matches first "aa" pair and can also match the final pair. [as many as 2 matches]
aaa: will match only once starting with the first "a" (so "aaa" is matched but not the final "a"). After the initial match and consumption, you will get that no more matches are possible.
aaaa: matches once as expected
a+: will match only once just like aaaa.
a+?: will match four times just like a
a{1}: will match just like a
a{1}?: will match just like a
a{2,}: will match 1 times just like aaaa did.
a{2,}?: will match 2 times just like aa did
a{2,3}: will match 1 time just like aaa did
a{2,3}?: will match 2 times just like aa did
aa?: will match just like aa
aa??: will match just like a. Note that we are saying (a)(a??) but NOT ((aa)??). The latter can match "" but the former doesn't.
aa*: will match just like aaaa
aa*?: will match just like a

Note that when I say "will match just like" I don't necessarily mean in all cases. For example consider "aaa". Here aa* will match like aaa but not like aaaa (as stated in this previous example).

Consider "aabbbaabbaab". It is matched by aab{1,} (it matches the whole thing once) aab{1,}? (it matches "aab" then "aab" then final "aab") aab (matches "aab" then "aab" then final "aab") aab{2,3} (matches "aabbb" then "aabb" and no more) etc. Try these out on the javascript re test page in the link above.

Also know that the non ascii symbols can be matched. Yes they can! Some shorthands exist in some cases (eg, \t matches a tab character), but you can always rely on octal (\777) or hexadeciaml notation (\xFF). Eg, in perl/javascript and maybe python, you can use \x47 to match "F". You can match any byte even the ones that show up as invisible or with funky symbols on a simple text editor or terminal. Thus we can process binary files as well as text files (there are some differences in approach between general binary and text processing should you want to get fancy and/or efficient).

Here is a brief interpretation of this [ingoring the (?(1)[&/].*)? at the end because I don't think I know that rule -- not to imply that it would be difficult to google and learn]:

The argument we are matching has an optional component at the front. This optional component must start with an optional "http://" followed by an optional word and a dot (eg, "www." or "uk.") followed by "youtube.com/" (not optional) followed by either "v/" or "" or "watch" or "watch.php" followed by "?" followed by either "v=" or "xxxxxx&v=" where each "x" represents some character (each x can be a different character) and there can be any number of such x's but at least 1. OK, that was the optional part at the beginning.

This is followed by a grouping of characters composed of one or more lc, uc, digit, "_", and/or "-". Eg, "fBWoWFF3OYU" fits this pattern. Also note that this part is () and not (?: ) because we want to "grab" it.

Finally the last part is an extension with which I am not familiar. I think it was introduced fairly recently in perl (last few years). Whatever it matches must follow what has been covered here up to this point and must also match through to the end of the argument that is being tested.

There you have it for Regular Expressions 101 and 102-pre. If this did not make that much sense, maybe you skipped the exercises . Do use the link provided to try out examples.

Corrections, complaints, etc are welcomed.

Oooops, I can't leave yet without at least talking a bit about the "pay-off" of using regular expressions.

A simple example:

You want to know if any of the words "apple" "bear" "cat" "zoology" is found in some prose (eg, some website page). You also want to know if certain derivatives of these words are found, eg, apples, bearish, catty, etc. Finally, you want to know which words were found and where.

OK, try coding that without regular expressions. It isn't that that complicated but it is tedious and error prone. Meanwhile, I can do a regular expression that would look ugly to the uninitiated but would otherwise capture this requirement in a tiny compact space and be readable once you have mastered the art.

For example, there are many possibilities to structure this, in part depending on whether you have even more requirements than what I just finished mentioning; however, consider this re for finding matches and knowing the word that is matched:

(aaples?|bear(?:s|ish|er)|cat(?:ty|s|ch-up)|zoology)

Note that we would have access to the actual word and where it was located (because we used () on the outside). We also used (?: ) on the inside to make the searching more efficient since we would not care (in this hypothetical example) of having a link to the beginning of the particular ending; thus, the searching done behind the scenes (by the RE machine) wouldn't have to slow down and take extra processing instructions to grab links/pointers to the subparts each and every time.

Caution: regular expressions cannot all by themselves accurately find any type of match you may want, but usually by adding some code (some loops with re's embedded inside of them) you can do almost anything you may require. An example of where to pass on REs is if you need to do massive xml/html/etc processing and manipulation. RE's won't work without a bunch of other code anyway, so you would be better off cleanly using an xml parsing library (RE's don't generally accurately identify the sort of nesting one needs to be able to find in XML parsing).

You may also have noted that I did not explain how to do a single match or a match/consume cycle. Basically, this can be done by calling different functions or using extra arguments to the functions. Essentially, that is a topic that depends on the language and is not really important for understanding REs. Just know that this capability is included as part of the deal and is part of what makes re's useful since many times knowing what matched and/or where it matched is really what you want and not just to know that something?? did match somewhere??

(08:55:55 PM) ***tbuitenh just had a look at libervisco's youtube install script
(08:56:56 PM) tbuitenh: I'm sorry to say it is both useless and problematic
(08:58:00 PM) tbuitenh: you shouldn't download programs directly to the system directory you want to install them in
(08:58:40 PM) tbuitenh: if the download fails, you'll have junk in a place the user doesn't expect
(08:58:58 PM) tbuitenh: also, you're not actually saving much work
(08:59:16 PM) tbuitenh: in other words... WHAT WERE YOU THINKING?
(08:59:55 PM) libervisco:
(09:00:44 PM) libervisco: convenience mostly I suppose
(09:01:14 PM) libervisco: I guess I could've made it download in home directory somewhere and then linked it to /usr/bin..
(09:01:43 PM) tbuitenh: no, not linked, and definitely not to /usr/bin
(09:02:10 PM) tbuitenh: download it to /root, and if the download is successful, then mv it to /usr/local/bin
(09:02:22 PM) tbuitenh: or maybe cp
(09:02:32 PM) tbuitenh: then chmod 755 it
(09:03:07 PM) libervisco: isn't the result the same?
(09:03:10 PM) tbuitenh: and tell the user it was installed there in /usr/local/bin
(09:03:12 PM) libervisco: Who ever looks at /root?
(09:03:21 PM) libervisco: oh
(09:03:27 PM) libervisco: well that'd help ok..
(09:03:36 PM) tbuitenh: /root is the home directory of root
(09:03:44 PM) libervisco: yeah..
(09:03:56 PM) tbuitenh: you will see it there if you're root
(09:04:00 PM) libervisco: but nobody is logged in as root to be there by default and this was obviously for noobs..
(09:04:08 PM) tbuitenh: true
(09:04:32 PM) tbuitenh: then use /tmp, so it will be gone after the next reboot if the download failed
(09:04:47 PM) libervisco: right
(09:04:51 PM) tbuitenh: still, the script is useless
(09:05:42 PM) tbuitenh: download one script to download another script... d'oh
(09:05:58 PM) libervisco: lol

I got carried away. The problem was that I could not understand why a simpler variation of the 4 line bash script was not working for me. That 4line script itself seems not to have worked for someone else. I thought there might be a simple reason why, but no one has volunteered anything in these last couple of days since I first posted. For that reason, I thought of taking the large python program apart since presumably that one works (I haven't tried it since the main point was for educational purposes as you stated). In the meantime, I'd hope that someone would point out what was wrong or at least say that everything worked for them (the 4 liner, the 1 liner, or the much larger youtube-dl).

Both the analysis of youtube-dl and the regular expression mini tutorial are a topic unto themselves. I agree.

Yes, it is funny, but the logic I saw was that the first script could be throw away while the second one would reside in a permanent place.

Some people might really find it useful to use the installer for the real script. The best part for me was just reading through the code (and understanding it) even though nothing too special was done. I don't want to use it to install youtube-dl though since I will be experimenting with youtube-dl in its own directory (and not as root).

As for the installer being GPL, whoever is serious about growing that file may want to check out other installers to gain as much experience as possible. Educational purposes is one thing; otherwise, we should try and reuse others' work whenever possible (even if indirectly by studying how others have solved the problem or to learn what are the difficulties not yet solved).

I using the Netvideohunter Firefox addon for downloading videos and music not from youtube, but from almost any site with embedded players. I need only tow click to download a video:http://netvideohunter.com

There are enough videos out there that can be freely copied and shared, relating to specific fields - like CC-licensed "howto" videos.

The problem is that although I can get a long search result listing of these videos by using good keywords, matching Video_ID's and keywords to make a download bundle is difficult at the moment. I need to browse each video and get it's Video_ID and then make a script like this:

Sounds like a handy way to watch youtube vids without the need for flash or even go to youtube if you have a link. I like this idea actually, it's an easy alternative for those who aren't able to go to the site and view for any reason, or just choose not to. As you say we can either keep them or ditch them when we've finished watching. youtube videos are good to have when thinking about search engine optimisation as they do get the search engines attention so many more people will be using them to get information out in the future. Having alternative ways to view this information is always good.