Unlike regex solution, this one doesn't list invalid URLs. But it doesn't list some valid ones! Namely, 9 and 10. Looks like this is known issue with some CSS syntax, and it can't be fixed without rewriting the whole library from scratch. ANTLR rewrite seems to be abandoned.

Question: How to extract all URLs from CSS files? (I need to parse any CSS files, not only the one provided as an example above. Please don't heck for "noimg" or assume one-line declarations.)

N.B. This is not a "tool recommendation" question, as any solution will be fine, be it a piece of code, a fix to one of the above solutions, a library or anything else; and I've clearly defined the function I need.

Its even harder than you think. There is an additional case that should not match: URLs within quoted strings: e.g. p[example="...url(link)..."] { color: red }. (See: the CSS spec.) Thus, you cannot simply pluck out the urls - you must parse the CSS file from start to end and correctly handle all quoted strings, comments and CSS tokens. That said, I'm pretty sure a single (non-trivial) regex solution can neatly do the trick, but will require using a callback function. Stand by...
–
ridgerunnerAug 26 '13 at 16:46

Do you have choice of language? I would solve the problem in Perl..
–
Owen BeresfordAug 26 '13 at 21:43

9 Answers
9

RegEx is a very powerful tool. But when a bit more flexibility is needed, I prefer to just write a little code.

So for a non-RegEx solution, I came up with the following. Note that a bit more work would be needed to make this code more generic to handle any CSS file. For that, I would also use my text parsing helper class.

What you appear to be asking seems beyond the scope of a simple how-to question for stackoverflow. I do not believe you will get satisfactory results using regular expressions. You will need some code to parse your CSS, and handle all the special cases that come with it.

Since I've written a lot of parsing code and had a bit of time, I decided to play with this a bit. I wrote a simple CSS parser and wrote an article about it. You can read the article and download the code (for free) at A Simple CSS Parser.

My code parses a block of CSS and stores the information in data structures. My code separates and stores each property/value pair for each rule. However, a bit more work is still needed to get the URL from the property values. You will need to parse them from the property value.

The code I originally posted will give you a start of how you might approach this. But if you want a truly robust solution, then some more sophisticated code will be needed. You might want to take a look at my code to parse the CSS. I use techniques in that code that could be used to easy handle values such as url('img(1)'), such as parsing a quoted value.

I think this is a pretty good start. I could write the remaining code for you as well. But what's the fun in that. :)

Again a version optimized for the sample I've provided in the question. I need to handle any CSS, not just the one provided above. The solution you've given is no better than regex. It'll also fail to parse url('img(1)').
–
DiscordAug 21 '13 at 5:24

@Authari: I've done a lot of parsing code and could easily extend this to write code to parse CSS more generally, as I suggested in my answer. But then I'd need to know more about how you wanted it structured, etc, as there could potentially be a lot of information. Your question seemed more focus on how you could specifically get the URL value.
–
Jonathan WoodAug 21 '13 at 15:46

url (img) syntax is incorrect, because space is not allowed between url and ( in CSS grammar. Therefore, "img6", "img7" and "img8" should not be returned as URLs.

An unclosed quote in url function (url('img)) is a serious syntax error; web browsers, including Firefox, do not seem to recover from it and simply skip the rest of the CSS file. Therefore, requiring the parser to return "img9" and "img10" is unnecessary (but necessary if the two problematic lines are removed).

You're correct with point 1, the grammar in the CSS spec is "url(" whitespace (string or urlchar* ) whitespace ")". However it is reasonable that a User Agent wouldn't be as strict and allow the whitespace.
–
Daniel GimenezAug 26 '13 at 5:58

The choice was simple as this is the only complete regex solution without "cheating", that is, without assumption that CSS would look exactly like in the provided example. It is also the most maintainable regex solution as it does not try to fit all logic into one huge "clever" regex.
–
DiscordAug 27 '13 at 6:08

Some notes on the code quality: 1) Unless you're forced to use old version of .NET, new List<string>{ new string[] { a, b } } can be rewritten as new List<string>{ a, b }. 2) validProperties can be an array (declared outside function), as LINQ contains Contains method which works on arrays too. 3) Function can return IEnumerable<string> and use yield return to return items. 4) I haven't checked yet, but the cycle do while seems unnecessary as Regex.Replace should replace all occurences. 5) Calls to ToLower should be replaced with string.Equals with ...
–
DiscordAug 27 '13 at 6:18

Although this'll probably work for 99.9% of the cases, to demonstrate why indeed a CSS-parser would be better (as noted by OP), this would fail: content:'/*'; background:url(img1); content:'*/'; Just adding for future readers.
–
funkwurmAug 28 '13 at 12:46

You need negative lookbehind to see if there is no /* without a following */ like this:

(?<!\/\*([^*]|\*[^\/])*)

This seems unreadable, it means:

(?<! -> preceding this match may not be:

\/\* -> /* (with escape slashes) followed by

([^*] -> any character that isn't *

|\*[^\/]) -> or a character that is*, but is itself followed by anything that isn't /

*) -> of this not a * or a * without a / character we can have 0 or more, and finally close the negative lookbehind

And you need positive lookbehind to see whether the property being set is a css property that accepts url() values. If you only are interested in background: and background-image: for instance, this would be the entire regex:

Since this version requires the css property background: or background-image: to precede the url(), it will not detect the 'url(noimg4)'. You could use simple pipes to add more accepted css properties: (?<=(?:border-image|background(?:-image)?):\s*)

I've used \1 rather than \k<Quote> because I'm not familiar with that syntax, which means you need the ?: to not capture unwanted subgroups. As far as I can test this works.

Finally I used [^\n'"] for the actual url because I understand from your comments that url('img(1)') should work and [^\)] from your OP won't parse that.

1) CSS allows comments inside declarations, AFAIK, so checking for comments only on the declaration's boundaries is incorrect. 2) If you want to make your regex more readable without exaplaining every symbol, you can use (?n) and (?x) options. 3) See backreference constructs to learn about \k syntax.
–
DiscordAug 25 '13 at 12:53

This solution can avoid comments, and deals with background-image. It deals too with background which can contain properties like background-color, background-position, or repeat, that is not the case with background-image. This is why I have added these cases: noimg5, img11, img12.