Browsing the broken web: a software developer behind the Great Firewall of China

16 March 2012

I absolutely love coming to China. It’s a country that manages to hold onto a long, rich history whilst also moving into the future at an extraordinarily rapid pace. We also all know that China heavily censors the websites that can be accessed via the Internet.

I work quite frequently with a number of people in China and I’m always conscious that there is certain material I’d like to share with them which they won’t be able to access. I’m not talking about anything politically or culturally subversive, I’m talking about simple resources which software developers work with on a daily basis as part of their job. Or at least those outside China do.

So what’s the experience like for them? What can those of us outside of China learn about what our peers behind the Great Firewall can do, and more importantly, cannot do? Here’s a little snapshot of what it’s like trying to access typical resources I’d use on a daily basis from home but were not accessible to me whilst in Shanghai this week.

Technology blogs

Let me start somewhere close to home (for me anyway): my own blog. Searching for my name takes me out through Google in Hong Kong (remember there was that little issue in China a while back), and sure enough, I still exist:

Problem is, I’m not very, uh, “visual”:

You see, all these images are pulled off the blogspot.com domain which is a no-no in China so things like my profile photo can’t load. I had added these images to a Blogger page via the web UI which appears to have caused them to end up on this domain. Sound odd? Wikipedia seems to back this behaviour (but it’s still odd!)

Actually, my preconception was that any domain bound to Google’s Blogger service was out of bounds so I was surprised to see as much as I did. A quick reload soon confirmed those original suspicions:

I’m not entirely sure why this is the case, certainly the earlier versions weren’t cached (I cleared it beforehand) and the connection was fine at times when it wouldn’t load. I also observed the same erratic behaviour across PC and iPhone.

Once the content actually loads again, things get a bit weird because when I look at a post with images in it, they work just fine:

The image above was loaded from ggpht.com domain which isn’t blocked. These images were posted via Live Writer when the blog was published and clearly this domain is allowed. But scroll down a bit in the same post and things soon break again:

Now we’re running into problems because I’ve embedded YouTube content and YouTube is definitely a no-no in China (Youku and the newly acquired Tudou are the big players here). And don’t bother with Vimeo – that’s blocked too (I guess it’s an equivalent service in terms of content). In fact embedded content in general is a real problem, as we’ll see shortly.

But of course it’s not only me. I just went to take a look at Dave Tchepak’s blog and, well, not very much happened after that:

Now I couldn’t quite recall what platform Dave had his blog on (turns out its Heroku), but that’s really the point; you normally have no idea, it’s just a matter of clicking through from the email / link / tweet and seeing what happens. And it turns out that Heroku is blocked.

Other websites

But it’s not just blogs; I ended up trying to load Nova Software as a result of a Google search unrelated to this post and got absolutely nowhere whilst on a China internet connection. Now this isn’t on Blogger or Heroku – in fact it’s an ASP.NET website running on IIS. Why can’t I access it? It’s not exactly controversial material:

I’d say that about 20% of dedicated technology sites I tried to access simply weren’t accessible. Sometimes. And that’s the thing about this situation – you just never quite know what’s going to work and what isn’t, regardless of the platform it runs on. Searching for content and following links becomes like playing Russian Roulette – but with worse odds.

Embedded content from blocked domains

Just about every site is a mashup these days in one way, shape or form. For example, under normal circumstances, here’s what you see down the bottom when you run an ASafaWeb scan:

But run it by going out through a China internet gateway and you’ll see something quite different:

Clearly the social media connectors for Twitter, Google Plus and Facebook have broken. Actually, I had thought that Google Plus was unexpectedly allowed in China just recently but perhaps this was a very short-lived situation.

But the problem goes beyond embedded content – what happens if your chosen means of signing into Stack Overflow is by using your Facebook account:

Now of course if you’re Chinese and you use Stack Overflow you’re going to use a different auth provider but if you’re travelling here and elected to (quite legitimately) use Facebook to login, well, now you’ve got a problem.

Now I know that Scott puts a lot of effort into making his blog accessible and he’s done an excellent job of it – it even loads quite happily in China. But what happens if you’re a Chinese reader and would like to read a bit of Hanselman in your favourite RSS reader? Not very much, unfortunately:

Like many of us, Scott uses Feedburner and that’s off limits over here too. Well actually, feedburner.com loads just fine, it’s just the feeds themselves on the feeds.feedburner.com domain which won’t load. Of course these would give you the ability to circumvent access controls on blocked sites so rather than risk unauthorised access, all Feedburner feeds are canned.

I also wonder about the performance impact of loading sites with dependencies which can’t load; what does it do to the browser when you’re loading an ASafaWeb scan and it’s just spinning its wheels looking for those social media connectors which never arrive? Or what about all those blogs which embed the author’s latest tweets and require the browser to pull them direct from Twitter? Open connections making ultimately failing requests is never a good thing performance wise.

Search results

By now it’s pretty clear you never really know what you’re going to get. But this paradigm raises its head in some odd places. Remember how Google wasn’t going to censor in China anymore? Well here’s the preview of my Twitter account after a vanity search whilst not going through the China gateway:

Ignoring the “Sorry that page doesn’t exist” message for a moment (I have no idea why this is coming up – but it is coming from Twitter), now let’s load that through the Great Firewall:

All other preview images come up just fine so obviously there’s some filtering going on with the Twitter preview. This, of course, makes the following image all the more odd:

There’s that consistency issue again – you just never quite know what you’re going to get.

Local Chinese searches

So what do you do when you’re browsing from within China? Well first of all, you don’t “Google” (remember that whole China brouhaha thing), you “Baidu”. Baidu is China’s leading search engine and is probably the biggest tech company you’ve never heard of – we’re talking the world’s 5th most visited website serving a couple of billion searches each quarter and annual revenue around the ten billion mark.

Now remember – this is a Chinese search engine and China’s number one most visited site. Yet here we see links to my blog (somewhat off limits) and my Facebook account (completely off limits). And of course because you don’t know if you can access a site until you click on it and even then, if you can’t you just don’t get a response, it makes you wonder how many folks over here simply think a site has gone down when trying to access a search result.

Exceptions

Clearly Twitter is out – we know that. But that doesn’t mean you can’t get push notifications on iOS devices (excuse the babycino covered face of my two year old in the background):

Without being too familiar with the push mechanism behind iOS, it’s clear that it’s not dependent on the same channel as a standard HTTP request to the social media service. But of course this isn’t really circumventing the Great Firewall – that’s where proxy and VPN services come into play (more on that a little later on).

How many of you use Instapaper? It’s a very excellent way of taking material offline in a highly readable fashion (i.e. stripping out all the distracting material around the actual content), and best of all in the context of this post, it works in China. In fact it works so well that it totally circumvents the controls the Great Firewall puts in place which means guys like Dave Tchepak from earlier on can still get his message out there:

While we’re in Instapaper, how about a bit of tweeting:

It seems that a tweet via the Instapaper app goes back through a channel which isn’t blocked so I can tweet just fine from my favourite offline reader. And while we’re talking about Twitter, obviously emails generated by the service will arrive just fine, but any images embedded in it also load without a problem:

It turns out that images such as the avatar above come from si0.twimg.com so no problems on that front. But then you can’t access yfrog.com – another strange paradigm where two seemingly equivalent services have different access control policies.

Remember that issue with Google Plus? The one where you couldn’t load it in the browser, then you could for a while and now you can’t? Seems the story is a bit different on the iPhone:

My assumption – which seems reasonable – is that the API is on a different domain which is allowed from China. Same site, identical content but loaded over a different channel into a mobile device with different Great Firewall access control rules to its browser-based peer. Go figure.

For years now, there’s been a cat-and-mouse game played with Tor with the service regularly being blocked then blocks circumvented. It seems that the blocking may involve some degree of deep packet inspection – no easy task on the volume of data transiting through China’s gateway (actually, it may explain some of the erratic behaviour observed above). Certainly this remains one way that citizens of many nations with internet access controls reach content their governments would prefer they didn’t

Then of course there are VPN services where secure tunnels are created between PC and providers outside the country who then gateway out to the internet as we know it. There are numerous commercial offerings available to do this as well as obviously services provided by corporates.

The point is that the web can’t simply be blocked off piece by piece, at least not in any consistent reliable fashion. Regardless of the motives of the end-user, the technical facilities exist to start poking holes in the Great Firewall.

Why should we care about China?

In discussions I’ve had with people in the past – including while researching this post – quite often I’ve heard that China isn’t something they’ve thought about or even care about. Let’s start with the big headline – the top 10 countries by internet users:

Rank

Country

Internet Users

Population

Percentage of Population

1

China

456,238,464

1,330,141,294

34.3%

2

United States

243,542,822

308,282,053

79.0%

3

Japan

102,063,316

127,579,145

80.0%

4

India

87,983,101

1,173,108,013

7.5%

5

Brazil

81,748,504

201,103,331

40.7%

6

Germany

66,825,986

81,644,454

81.9%

7

Russia

59,937,788

139,390,205

43.0%

8

United Kingdom

52,996,180

62,348,447

85.0%

9

France

51,879,480

64,768,390

80.1%

10

Nigeria

45,944,229

161,604,745

28.4%

Take this in for a moment – there are 87% more internet users in China than there are in the US. We often think about China in similar terms to India (an emerging market, populations not too far off each other, rapid growth rates), but look at the numbers above – there are more than 5 times more internet users in China than India.

These numbers alone are staggering, but consider also that China only has an internet uptake of 34% – there are 1.3 billion people in the nation now and only a comparatively small number of them are online. Compare that again to the US which has 79% of people online and you can see where the future growth lies.

The fact that my blog isn’t fully accessible from China remains a bit of a bugbear. Over the last year, Chinese visitors ranked 28th out of all nations, way behind places like Russia (13th) and Finland (18th). The eastern nation with all the internet users only accounted for 0.5% of my traffic – that’s a massive audience I’m missing.

But I’m just running a technology blog, not a business. If you’re running any sort of site where traffic is important to business success, you’d want to take a pretty serious look at accessibility from China. Chances are you might be inadvertently excluding a huge potential audience simply due to the choice of hosting model or site structure.

Summary

Speaking to folks in China, it seems that the filtering is somewhat erratic. Some resources that couldn’t be accessed yesterday, can be today. And vice versa. This is consistent with the observation I made earlier on; sometimes things load just fine, sometimes they don’t. Who knows what you’re going to get each time you fire up the browser; will the deep packet inspection kick in and withhold the request? Who knows.

For me, that was really the most frustrating thing as I never knew whether the connection had just dropped (hotel WiFi has a habit of doing that), or if the site was being censored. Personally, I would have been happy to have seen a government page saying the site was off limits – at least its explicit feedback. And it’s exactly the same problem when I send folks in China a link as I have no idea whether they can access it or not. You can’t tell from domain alone whether it might be on Blogger or Heroku and therefore off limits (you’re not going to run a WHOIS against every link you send).

Speaking of connections dropping, I frequently – actually, make that very frequently – found that after attempting to access content which should be blocked my connection disappeared for about a minute. Whilst I couldn’t emphatically identify a pattern, this happened very frequently across different browsers and on different devices. Interesting…

At the end of the day, folks here just seem to get on with business and not let it worry them too much. The developers I spoke to have a high dependence on MSDN (and increasingly Stack Overflow) so perhaps rather than starting with Google (or Baidu) as many of us would do – but then not know what you’re going to be able to access – they go direct to proven, trusted sources.

There’s also a lesson in all this for those not in China: If you’re building a site and want to tap into the audience in the world’s most populace nation with one of the fastest economic growth rates, think very, very carefully about your design and choice of host. If you don’t, chances are it won’t function correctly over here or even be accessible at all. It would be a shame to lose that audience of half a billion internet users.