Shivahn wrote:Ok, so from time to time a task comes up at work that is mind-numbing and repetitive, and I am lazy, so as soon as something like that shows up I try to find a way to make the computer do that for me. Well, one of those is coming up, but fixing it requires knowledge in an arena I have not coded in: internet interfaces. I need to basically take a massive file with information and format it (no big deal), then do stuff to it. I guess I'll describe what I do now. I log into a website (which creates a pop-up window that's the one I actually communicate with), use one of the fields on the site to search and see if the entity I'm working on is in the system, if not, click a link which brings me to a registration page and then fill that page in with data about the entity, click another hyperlink and mess with a couple of drop-down menus, click (I THINK it's a hyperlink) in a calendar-type thing, then click a checkbox next to a specific time, then click a save button.

I have a big file from which I can get all the data I need for the forms, but I don't know how to write a program that communicates with a website like this. I'd probably write the thing in Python, but a language-agnostic tutorial would be excellent. Does anyone have any suggestions? I know very basic network theory, but I wouldn't know how to begin either logging in with a console-based program or navigating menus, boxes, search fields, and so on with one.

There are two aspects to what you need to deal with. The first part is HTTP, which is pretty straightforward. You won't have to know the specifics of the protocol, just the nature of the GET and POST messages. You could write code to handle that (it's not hard), but every language already has an API written for doing so.

The second part is navigating the HTML DOM. This could be a bit finicky. If the web pages are poorly written and don't conform to standards, some HTML parsers could fail miserably. Anyways, you'll have to understand the HTML tree so you can figure out whether your entity is present. The code will look something like this:

I'm a PHP dev, and haven't had to do a task like this, but if I were to do it, I'd likely use Python as well (the basics of what I'd do are pretty much the same, just using different libraries/syntax) - a fine opportunity to learn a bit of Python as well

First off, you should check whether your pop-up is loaded via AJAX or as a separate page or whatever. Now, assuming the popup has a <form> element (use Firebug in FF or Chrome developer tools or whatever, to find this out), it's probably POSTing your search query to some page (I have no idea how experienced you are with web-stuff - <form action="page.php" method="POST"> means that the form data is being sent via the POST method to page.php). Basically, you need to find out the exact URL that the form is posting to, and send data there from your script - it might also be that this is done via javascript, so you would have to dig into javascript and check where the search query is being sent to.

I'd post my actual search query using cURL (I see that Python has pycurl - I'd either try that, since I'm somewhat familiar with cURL, or play around with urllib2, as demonstrated here). You don't really need to know much about HTTP for this, since the libraries you use will take care of most of that (though if you can spare the time, lightweight projects like this one are the ideal place to learn something new!)

Let's say you find out that this is http://www.site.com/search.php - since this is behind a login, your script should also do that as well. As a quick hack, I'd likely just log in with my browser, and check the cookies - there will be some kind of session id or something there - I'd probably just copy all the cookies there and send them from my script as well, including also the user-agent string my browser sends (since a lot of sites try to match user-agents within a single session, to make session hijacking harder) - another important thing is to check whether (and how often) the site regenerates session IDs (basically, how often does the site send the session cookie with a different value), so I'd know whether my script also has to accept and set cookies - though I can't imagine this would be hard to implement anyway (cookies are basically just an associative array you keep memorized).

(don't forget to urlencode somethingsomething), which is what you need to send to the search script - it's very likely though that your library (pycurl/urllib2/whatever) already deals with this, and all you need to provide is the raw POST array (PHP's cURL library has this so I assume others do as well).

Okay, so now we can fetch the resources we're interested in, and send data - but what do we do with them?

You probably want to use some DOM parser, because that's the easiest (and the correct) way to do stuff like this. A problematic situation would be if their pages have malformed HTML, but luckily, Python has Beautiful Soup which deals incredibly smart with bad HTML code.

At this point, you should manually inspect the HTML structure of all the relevant pages (ie. the link that brings you to the registration page, the fields in the registration page, etc.) and from then on it's just a matter of POSTing the correct data once again to the correct link (all of which you can find out with a quick inspection from your browser). Of course, I assume it's easy enough to retrieve your data from your big file, so that's mostly that...

Helpful hint: when traversing the DOM, you are pretty much only interested in how to get to the element you want in the easiest possible way, while identifying it uniquely. To be more specific, say you have something like this:

Here, one way to reach the link (assuming you want to, say, find out the href the link is pointing to) is "html > body > div#container > a#go-to-drop-down-menus", which would be traversing the tree from the root - but since you know element IDs are unique within a document (assuming, again, that the HTML isn't awful), you can just find it via its id - "a#go-to-drop-down-menus" - in Beautiful Soup, you'd do something like soup.find(id="go-to-drop-down-menus").get('href') and that's it. Of course, I'm mostly just rambling here about stuff that comes to mind - this is really where you need to figure out the document structure on your own and improvise from there. Also, if you're familiar with CSS selectors (or jQuery selectors which mostly just augment those in CSS), you could maybe use pyquery, for a more familiar syntax.

Mostly, all that this mini-project comes down to is sending appropriate data to appropriate URLs (implementing cookies as well, since you need a login), and doing some light DOM traversal along the way to control some of the flow (ie. whether or not an entity should be registered), and find URLs that your hyperlinks lead to (assuming they even change at all - you should investigate to find this out).

I'm not very familiar with Python (I've played with it a bit for Project Euler purposes, but not much beyond that), but I'd probably rate this at about a few hours of work (including researching all the libraries I need, as well as how to use them). Ideally, if everything works on the first try, I'd do it in, say, an hour (maybe half an hour in PHP, but PHP isn't really well-suited to tasks like this), but since you're dealing with a third-party system you have no control over, unforeseen problems are likely to come up, and that's where most of the time would be lost. Well, ideally, you could ask the website-owners to just grant you direct access to their database, and do everything in a few minutes, but it seems that this is not an option

Of course, if you have any other questions, feel free to ask - I've tried to be language-agnostic, but since you've already mentioned Python, I've done some quick google searches to find some suitable libraries for this, so I hope I've helped you at least a bit.

Another relevant Python library is mechanize, which is based on a PERL module of the same name. I'm sure cURL is available for PERL as well, or any number of languages. I haven't done much work of this sort, but you might find one of them is a better fit for you than the other, or they might be both useful for distinct subtasks, etc.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.-- Antoine de Saint-Exupery

I'd recommend, instead of using Python or Perl, using Node.JS with JSDOM. Since it's just JavaScript and a DOM, you can use all the knowledge, documentation, and muscle memory of writing web apps to write your scraper, or even execute code from the page itself, if it relies on JS.

Any good tutorial for VB.Net, slightly (but not too much) beyond the basics?

Im pretty swish at C with embedded systems, understand the basics of the concepts involved in OOP and have thrown programs together in VB6 some time ago.So the "introductions" or "hello world" styly tutorials to VB dont normally go far enough. Other stuff I've come across has missed a giant gap of stuff from that level though...I have tried working my way through MSDN but it seems not organised too well for linear learning, more for referencing.Cheers

I have been trying to learn this now for a few weeks, but it has the steepest learning curve in the history of learning curves. Most resources out there seem to assume you already know everything there is to know about C# or .net, both of which I have 0 experience with. And with xaml everything seems to depend on everything else, so I don't know where to begin.

It's one of those irregular verbs, isn't it? I have an independent mind, you are an eccentric, he is round the twist- Bernard Woolley in Yes, Prime Minister

So, I need to build a windows app in c++ that has a nice interface and such, instead of just taking arguments as I an used to doing for my coursework. I have been told that .NET is the way to go for this. (Qt appears to be out for legal reasons, and I have to stick with my c++ code because of another library I use)Advice on where to learn the basics of .NET and how to integrate it with a c++ program?

Edit: upon further reading, it appears that although using c++ with winforms or wpf is "supported", it is not the suggested route to go down, the c++ winforms editor has been deprecated since VS2010, and there are few tutorials or useful repositories of knowledge for me to reference. So it looks like the best way forward is to design the interface in c#, and make a wrapper around the underlying CPP functionality. Any advice on how to get started with that?

The c++ side is written as a win32 console application. I wrote a basic wrapper around it last week, but have been away since. I am working on learning c# and putting together a simple UI in c# with WPF, but I am at a loss on how to translate my c++ code into the c++/CLI format that seems to be required to put the pieces together.

Edit (7-16-14): Once I found the right resource, I was able to compile the c++ code into a DLL, after only a little hair pulling. I haven't successfully called it from C#, but that is only because I haven't learned enough c# to hack a test together yet.For future googlers, this tutorial got me on the right path. I had to do a little additional searching and a fair bit of debugging to get my code to compile properly, but it only ended up being a few lines of source code modification for the DLL itself, another dozen or two to test it, and of course setting it all up as a new project in VS 2013 and fiddling with the properties.Thanks for all of the help! It might not seem like much, but you comments fixed a couple of bits of incomplete/incorrect knowledge, and really set me on the right path.

Another option would be to write pure C++ and compile to a DLL; then you can make native calls from C# to that DLL. I don't know how ugly this is and you might have to write a bit of glue code, but it may or may not be nicer than C++/CLI stuff.

By "know Java", I assume you don't consider generics complicated (Ie, you have a reasonable level of expertise). Could you implement your own class-instance OO framework in Java or C (a "class" is plain old data that describes the layout of data in instances of that class, and methods to operate on said instances of the class (including creation and destruction). A reasonable framework also handles inheritance, virtual and non-virtual functions, and optional instance-local method overrides).

Can anyone recommend any good in depth resources on spring and/or hibernate? I could through together a pet shop app easily enough, but I"m responsible for a few legacy apps, so I'm actually much more interested in how things don't work than how hey do, if that makes any sense.

The thing about recursion problems is that they tend to contain other recursion problems.