Web Scraping with Html.Table in Power Query

The older I get, the more consistently I encounter things that make me feel old. And it’s not just the usual things, like those damned kids on my lawn. It’s things like technology. And sometimes it’s things like new features in Power BI.

Let’s take, for example, the new “from web” experience that was introduced in Power BI Desktop over the summer, and which is generally available (out of preview) this month. This is a cool feature that at first glance looks pretty straightforward, but it caused me no end of problems.

When it became clear that I wasn’t going to be able to do what I needed with only the UI, I started looking into the Power Query functions being used in the query created by the UI. The code in my PBIX file didn’t make any sense to me, so I look in more places, like:

And still, I couldn’t quite figure out how to solve my problem. I was frustrated and annoyed , not just because I couldn’t do what I needed, but because I was convinced that there was some simple, easy thing that I was missing.

And I was right.

This brings me back to feeling old.

When I was your age[1], we only had one HTML tag, and it was <marquee> and we were glad to have it. We didn’t need any of those fancy stylesheets that kids these days have cascading all over the place, because we had H1 and H2 and that was more than enough for any page we could think of.

All pretense aside, I didn’t know that CSS selectors existed before I started trying to use the Html.Table function in Power Query, and that was a real problem. The inline help mentioned CSS selectors in the function description, but without more prior knowledge and context, this wasn’t helpful to me. At all. So I asked for help.

One of the developers on the Power BI team took the time to listen to what I was trying to accomplish, and gave me the code that I’d been so consistently failing to write on my own[2]. Once I had something simple that worked, I was able – with a little more research and some reformatting – to build the queries that I needed, but I don’t think I could have done it without that help.

So, working under the assumption that I’m not the only old guy out there who knows data pretty well, but hasn’t kept up to date on web technologies like CSS, I wanted to share the story and the code.

Here’s the pattern I was trying to implement. This is a relatively common pattern in my experience, although the details are different each time:

Extract a set of links and related data from a parent web page

For each of the links extracted from the parent, extract a set of data from the child web page, where each of the child pages has a consistent format

Use the data from the parent and the child pages to build a data set for reporting and analysis

I started off by using the UI, stepping through the new “From web” experience in Power BI Desktop. I knew that this wouldn’t give me everything I needed – you can’t get the URLs behind a link, you can only get the text – but I hoped it would give me a starting point.

It gave me this:

I’m sharing this query as an image and not as text because I want to highlight one of the challenges I faced: the formatting. Because the Html.Table function accepts a complex parameter, I needed to “pretty format” it before I could visually parse it and compare the specific syntax to the available help, and to the source of the web page I was trying to use as a data source.

Even then, I was still stuck. There were not enough points of reference between the HTML and the query for me to relate them and to start modifying the query script. After fighting with this for longer than I care to admit, I asked for help. In return, I got this query:

That was much better! With this code, formatted so it was clear how the various parts of the columnNameSelectorPairs parameter related to each other, I could finally make some progress on my own. In short order I had this query:

The second tool is the SelectorGadget Chrome extension, which lets you click on elements on the page, and see the CSS selector that would select them. This may not have been an absolute requirement, but it definitely saved me time and swearing.

Hopefully this has been useful. I may be the only old dog out there trying to learn new tricks, but then again, I may not be. Ether way, get off my lawn!

[1] Please go back and read that in your best “crotchety old man” voice. Thank you.

[2] Thanks again, Ehren!

[3] Or maybe build a demo for the BI Power Session at the PASS Summit next month. It’s hard to tell the difference sometimes.