The first issue I have is that there appears to be an error in the url string itself. For some reason, the letter "o" is being represented by "#243;" in "Spanish Primera Division", which is jacking up the preg_match function. Don't know how much that is going to impact what I am looking to do as a whole. As it's a problem with the source data itself, there's nothing I can do about it, but thought I would mention it anyway.

Here's what I can't figure out how to do... when the url string is fetched using file_get_contents, I want to break out the results further so that only results such as this:

(as well as any others with a "_right" value of "Italian Serie A") are returned. I've used regex before on these type of url strings, but only to return things as a whole. Would the process for using the regex to do what I am looking to do be the same?

Apples and oranges. get_file_contents returns the actual HTML, in which case I would use DOM to parse it. However, your post seems to imply you want to parse the URL itself not the HTML to which it points.

There are 10 kinds of people in the world. Those that understand binary and those that don't.

The URL string is more or less broken. The people who wrote this ran some (all?) parameters through an HTML escaping function (using the old ASCII encoding). This turns characters like "ó" (an "o" with an accent) into HTML entities like &#243.

When you decode the URL, you bring up those "&" characters from the HTML entities, and they clash with the "&" characters separating the query parameters.

Parsing the URL by hand is a bad idea, anyway. But parse_url() makes no sense either. What you want is parse_str(). This parses the query part of a URL.

The procedure is this:

Run the URL through parse_str()

URL-decode the parameters

HTML-decode the parameters

You might consider sending a bug report to that site telling them about the HTML entities problem.

Why can’t I use certain words like "drop" as part of my Security Question answers?
There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".

Now, as for using parse_str to get the desired outcome, I am at a loss, because of the dependencies involved. The game scores are all indicated with an index of "_left(digit) = " . Thus, there are four scores. However, like I said in my original post, the goal here is to take the output above and pull out only those for a certain European football league. The league is indicated with an index of "_right(digit)_1" . Example.. say I only want scores returned for games in the Italian Serie A. With the current data, that would mean this would be the only score returned:

I'm not sure how to set things up with parse_str so that it checks the "_right" index for "Italian Serie A", and returns the values from the corresponding "_left(digit) = " and "_url(digit) = " index values if true. I have a good enough grasp on the parse_str function to get the job done if I wanted to pull everything, but I this seems like for what I am looking to do, it requires a bit of working backwards... Am I totally off with that assumption? Any hints, tips or otherwise?

First of all, manually removing that "ó" stuff is a terrible idea. Who says that's the only special character that will ever appear in the data for the lifetime of your application? It most likely won't be the only one.

This is a general problem, so it needs a general solution. The problem is not that some evil Spaniard injected a "ó" into the string to annoy you. The problem is that any character outside of the ASCII range is represented as an HTML entity. So forget about the "ó" and use the steps described above.

Next thing is that you confuse two different things: Parsing and filtering the data have nothing to do with each other. Sure, you could do both at the same time using those ugly regexes. But I recommend you don't. First parse the data with str_parse(). And then filter it by checking the array indices with a regex.

This will reduce your code to a few simple lines.

The underlying problem, however, is that the data source is extremely poor, because it's an unstructured bunch of values. Is this actually the official API? Don't they have XML or JSON or something?

Why can’t I use certain words like "drop" as part of my Security Question answers?
There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".

First of all, manually removing that "ó" stuff is a terrible idea. Who says that's the only special character that will ever appear in the data for the lifetime of your application? It most likely won't be the only one.

This is a general problem, so it needs a general solution. The problem is not that some evil Spaniard injected a "ó" into the string to annoy you. The problem is that any character outside of the ASCII range is represented as an HTML entity. So forget about the "ó" and use the steps described above.

Next thing is that you confuse two different things: Parsing and filtering the data have nothing to do with each other. Sure, you could do both at the same time using those ugly regexes. But I recommend you don't. First parse the data with str_parse(). And then filter it by checking the array indices with a regex.

This will reduce your code to a few simple lines.

The underlying problem, however, is that the data source is extremely poor, because it's an unstructured bunch of values. Is this actually the official API? Don't they have XML or JSON or something?

No, this is not the official API and no XML or JSON source is available to my knowledge. The URLs are from from ESPN's bottom line desktop widget. There's a separate one for all the different professional sports leagues, except when it comes to soccer. Hence my current parsing endeavor. What I've done with the other URLs (e.g. MLB, NFL, NBA, etc) is strip them down so that each individual score is reported as an rss item. I then use these in a rss reader on my website and viola... I have a homebrewed sports score ticker. But anyway...

Like I said: The actual values of the URL are encoded as HTML entities. If you view the URL-decoded string with a browser, you'll see the original characters.

Technically, this is bad, because you're supposed to get the raw content and not some preprocessed stuff. But if the only thing you'll ever do with the data is outputting it on the screen, then this bug shouldn't be a problem.

Why can’t I use certain words like "drop" as part of my Security Question answers?
There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".

Like I said: The actual values of the URL are encoded as HTML entities. If you view the URL-decoded string with a browser, you'll see the original characters.

Technically, this is bad, because you're supposed to get the raw content and not some preprocessed stuff. But if the only thing you'll ever do with the data is outputting it on the screen, then this bug shouldn't be a problem.

I also ran things using some archived data (since the current data reflected in this post doesn't show games in multiple leagues) to ensure my $pattern regex was truly working as it should. After that, it was simply a matter of adding the appropriate code for RSS 2.0 and modifying the echos to produce each feed item.

So the one remaining question I have is in regards to my decision to use regexes in lieu of parse_str and how I subsequently grouped each game's score-league-link combination into it's own sub-array using the array_chunk function... Was this the best way to accomplish what I wanted? I just think back to what Jacques said about using parse_str and how it would reduce my code to a few simple lines... Did I make this harder than it really needed to be?