Started another data scraping script similar to the post about rental rates in Houston; except this time i picked Sydney, Australia. The site that i’ve selected uses an awful lot of javascript, so the rvest package won’t be enough in this case. I’m going to have to use RSelenium.

The script for scraping the site is below. I say this every time i post a scraping script, and i’ll say it again: the URL to the actual site has been replaced for obvious reasons. Although i place plenty of comments in my code, this time i’m going to try and break down the code in little chunks and explain each part:

The packages dplyr, rvest, RSelenium, and stringr are all packages that i’ve used before so i’m not going to explain what they’re used for. The only thing that’s different is the tcltk package. The only reason i loaded this library is so that i can create a nice progress bar for the loops i use.

Unlike many other rental sites, this site uses some kind of javascript function that shows more and more posts as you scroll down the page. So, i would have to somehow simulate this scrolling. That’s what the loop at the end of the script is for. The line mybrowser$executeScript("scroll(0, 500000);") scrolls down all the way to the bottom of the page while the line Sys.sleep(3) waits three seconds for the posts to be generated before scrolling down again. This is done 100 times before the HTML of the page is extracted and stored in source.

The next step is to try and retrieve as much information from the posts that have been generated without actually having to navigate to each post.

All the information i needed to examine the rentals rates are in one dataframe. Now it’s a matter of plotting them, starting off with a break down of the number of posts. The next three plots will show:

1. The total number of posts per room type.
2. The total number of posts per room type and bathroom type
3. The total number of posts per room type, bathroom type, and gender.

As a side note, the “gender” field represents what sort of gender does the poster prefer to live with. The choices are male, female, couple, male/female but not couple, or anyone.

The obvious point you’d have to take away from this is that most people have no real preference when it comes to gender, given how the “Anyone Welcome” category is selected in most posts. If you’re looking to find a room to share with some one else, you might be out of luck due to the fact that a big portion of the total number of posts are about private rooms and not shared rooms, however most posts show that the bathrooms are shared.

The next two plots reveal the frequency distribution of the weekly rentals rates. The first plot will show the frequencies with all the posts in scope, while the second one only shows the weekly rent distribution for only shared room posts.

The weekly rental range for most posts fall under the A$200 to A$300 range. If i isolate only the non-private rooms, the range drops down to A$150 to $200. I think a boxplot should give us a more clearer picture on the median weekly rental rate for each kind of post. The following plots show the median rentals and range in the following order:

You might be able to save a decent amount of money if you go for a private room with a shared bathroom rather than a room with an ensuite/private bathroom. If a private room is a little pricy, you can go for a shared room with an ensuite bathroom, since a shared bathroom is going to cost roughly the same amount. Judging from the third plot, the median weekly rental for most private rooms that are furnished where you get to have your own bathroom float around the A$300 to A$400 range. The best deal out there seems to be private room posts where the gender is tagged as “Anyone welcome”, the furnishing is “flexible with furnishing”, and the bathroom is “Own bathroom”.

There is one other point though, and that’s the question of location. I honestly don’t think that the most ideal room will cost the same if you compare between one near the city center and one that is in the outskirts. So i’m guessing there must be some kind of middle ground where you can get what you want for a reasonable rate.

First off, a rough density plot showing where most of the posts are concentrated.

#Density plot, total
SYD_map

It looks like most of the posts are around the University of Sydney, Univeristy of Technology Sydney, and Surry Hills. However, the issue with a density plot using ggmaps is that i can’t break down this concentration. But you could do that using the Leaflet package.

If you’re not too familiar with a leaflet cluster map, you can perhaps try clicking on any of the clusters. You’ll notice that the clusters are then broken down further. It’s pretty obvious now that most of the posts are concentrated in the polygon shaped section making up Darlinghurst, Paddington, Randwick, Kensington, Mascot, Newtown, and Campterdown. To see if there really is any real difference between two posts with identical details but different locations, it’s best to just plot all the posts but map the weekly rental amount to a spectrum of colors.

It would appear the suspicion that certain locations are more expensive than others could very well be true. If we start from the left and move on to the right towards the city center, the weekly rental amount gradually rises. This information, along with the average plots i made earlier, are normally enough for me to make an informed decision on what sort of posts should i be focusing on. But i’ve recently become more and more interested in plotting polylines. The reason being that it would be interesting to see if driving distance can be added in to the list of variables to be considered when moving to a place.

I still think that the map seems a bit too noisy, although it looks a little better if you zoom in closer. But regardless of how you look at this map or the data on the rental averages in this city, you’ll have to come to one conclusion, especially if you’re someone who’s living in Malaysia like myself. Rent in Sydney is just too damn high! Someone needs to contact Jimmy McMillan to open an Australian chapter of his famously hilarious, but very relatable political party.

As always, if you’d like a copy of the data set, do get in touch and i’ll try to send you a copy of the CSV file.