The gist: the small company I work for advertises its products through Google Merchant. We upload the products in an XML file as per Google's requirements.

The problem: manually formatting thousands of products into XML is an arduous task. What I want is a rapid-fire way to convert the relevant information on each product page into formatted XML. I'm looking for a (semi-)automatic way to go from bigHTMLSourceCode --> formattedXML.

If I'm not being clear, imagine wanting to format an Amazon product page into XML. You want the cost, description, weight, etc., arrayed in a certain way, with the appropriate XML tags, etc., and doing so for thousands of products isn't tenable.

I've Googled extensively, but haven't had any luck finding programs that can help with this.

@OliverSalzburg Much of the product information is manually maintained; each page also contains automatically generated information, but I don't have access to the 'back end' of things, and have been asked to come up with a solution with what's available (and all the needed information is definitely contained in the raw source code).
–
MrTApr 5 '12 at 15:40

2 Answers
2

You'll find many success stories with the Python module Beautiful Soup, and it is widely recommended for web scraping , which I would categorize this under (if you suggest solutions with regular expressions, you'll quickly get reprimanded by the SU and SO users :-) ). That is what I would have used to scrape your example amazon.com, and I have used it in other contexts.

If you have some very basic Python experience you can probably look at examples and quickly have a working solution. If you have some common programming habit, you can probably do the same with a fraction of more time.

(I don't like when people say "Oh, it is real easy!" when it in practice takes a long time for someone not used to the tool, but I believe Beautiful Soup and Python is a simple and robust solution. If you find a solution that fits you better: great :-) ).

Addendum: what kind of system do you have where all pages are static HTML? Is the data not stored in a database somewhere? I guess not because of your question. This can pose a problem (for any automatic solution) if the HTML is not consistent across the product pages.

Thanks! I was hoping this problem ('web scraping' -- new term!) was common enough that there'd be programs dedicated to the task, but modules might be good enough. I don't have Python experience, but I have taken courses in C++ and Java. Product information is stored in a database, but I don't have access to it; my boss has asked me to come up with a solution with what I have, since all the information needed is in the source.
–
MrTApr 5 '12 at 15:41

Thanks! I'll look at these tools. I'm hoping to avoid writing programs and scripts (I'm a baaad programmer), but I'll dive into it if I have to. The HTML->XML converters I've found haven't proved suitable.
–
MrTApr 5 '12 at 15:45