Answer: The US, apparently. I was having a discussion with someone recently, and since I have a rather extensive log of comment failures for debugging & analysis purposes (dating back to February 2015!) they suggested that I render a map of where the spam is coming from.

It was such a good idea that I ended up doing just that - and somehow also writing this blog post :P

Unfortunately, I didn't think about parsing it programmatically when I designed the log file format.... Oops! It's too late to change it now, I suppose :P

Anyway, as an output, we want a list of countries in 1 column, and a count of the number of IP addresses in another. First things first - we need to extract those IP addresses. awk is ideal for this. I cooked this up just quickly:

BEGIN {
FS="|"
}
{
gsub(" ip: ", "", $2);
print $2;
}

This basically tells awk to split lines on the solid bar character (|), extracts the IP address bit (ip: p.q.r.s), and then strips out the ip: bit.

With this done, we're ready to lookup all these IP addresses to find out which country they're from. Unfortunately, IP addresses can change hands semi-regularly - even across country borders, so my approach here isn't going to be entirely accurate. I don't anticipate the error generated here to be all that big though, so I think it's ok to just do a simple lookup.

If I was worried about it, I could probably investigate cross-referencing the IP addresses with a GeoIP database from the date & time I recorded them. The effort here would be quite considerable - and this is a 'just curious' sort of thing, so I'm not going to do that here. If you have done this, I'd love to hear about it though - post a comment below.

Actually doing a GeoIP lookup itself is fairly easy to do, actually. While for the odd IP address here and there I usually use ipinfo.io, when there are lots of lookups to be done (10,479 to be exact! Wow.), it's probably best to utilise a local database. A quick bit of research reveals that Ubuntu Server has a package I can install that should do the job called geoip-bin:

It doesn't look like geoiplookup supports multiple IP addresses at once, which is a shame. In that case, the above will take a while to execute for 10K IP addresses.... :P

Next up, we need to remove the annoying label there. That's easy with sed:

(...) | sed -E 's/^[A-Za-z: ]+, //g'

I had some trouble here getting sed to accept a regular expression. At some point I'll have to read the manual pages more closely and write myself a quick reference guide. Come to think about it, I could use such a thing for awk too - their existing reference guide appears to have been written by a bunch of mathematicians who like using single-letter variable names everywhere.

Anyway, now that we've got our IP address list, we need to strip out any errors, and then count them all up. The first point is somewhat awkward, since geoiplookup doesn't send errors to the standard error for some reason, but we can cheese it with grep -v:

(...) | grep -iv 'resolve hostname'

The -v here tells grep to instead remove any lines that match the specified string, instead of showing us only the matching lines. This appeared to work at first glance - I simply copied a part of the error message I saw and worked with that. If I have issues later, I can always look at writing a more sophisticated regular expression with the -P option.

The counting bit can be achieved in bash with a combination of the sort and uniq commands. sort will, umm, sort the input lines, and uniq with de-duplicate multiple consecutive input lines, whilst optionaly counting them. With this in mind, I wound up with the following:

(...) | sort | uniq -c | sort -n

The first sort call sorts the input to ensure that all identical lines are next to each other, reading for uniq.

uniq -c does the de-duplication, but also inserts a count of the number of duplicates for us.

Lastly, the final sort call with the -n argument sorts the completed list via a natural sort, which means (in our case) that it handles the numbers as you'd expect it too. I'd recommend you read the Wikipedia article on the subject - it explains it quite well. This should give us an output like this:

With our list in hand, I imported it into LibreOffice Calc to parse it into a table with the fixed-width setting (Google Sheets doesn't appear to support this), and then pulled that into a Google Sheet in order to draw a heat map:

At first, the resulting graph showed just a few countries in red, and the rest in white. To rectify this, I pushed the counts through the natural log (log()) function, which yielded a much better map, where the countries have been spamming just a bit are still shown in a shade of red.

From this graph, we can quite easily conclude that the most 'spammiest' countries are:

The US

Russia

Ukraine (I get lots of spam emails from here too)

China (I get lots of SSH intrusion attempts from here)

Brazil (Wat?)

Personally, I was rather surprised to see the US int he top spot. I figured that with with tough laws on that sort of thing, spammers wouldn't risk attempting to buy a server and send spam from here.

On further thought though, it occurred to me that it may be because there are simply lots of infected machines in the US that are being abused (without the knowledge of their unwitting users) to send lots of spam.

At any rate, I don't appear to have a spam problem on my blog at the moment - it's just fascinating to investigate where the spam I do block comes from.

Found this interesting? Got an observation of your own? Plotted a graph from your own data? Comment below!

I've blogged about the comment key system I utilise on this blog to prevent spam before (see also). Today, I've given it another upgrade to make it harder for spammers to fake a comment key!

In the last post, I transformed the comment key with a number of reversible operations - including a simple XOR password system. This is, of course, very insecure - especially since an attacker knows (or can at least guess) the content of the encrypted key, making it trivial (I suspect) to guess the password used for 'encryption'.

The solution here, obviously, is to utilise a better encryption system. Since it's the 5th November and I'm not particularly keen on my website going poof like the fireworks tonight, let's do something about it! PHP 7.2+ comes with native libsodium support (those still using older versions of PHP can still follow along! Simply install the PECL module). libsodium bills itself as

A modern, portable, easy to use crypto library.

After my experiences investigating it, I'd certainly say that's true (although the official PHP documentation could do with, erm, existing). I used this documentation instead instead - it's quite ironic because I have actually base64-encoded the password.......

Anyway, after doing some digging I found the quick reference, which explains how you can go about accomplishing common tasks with libsodium. For my comment key system, I want to encrypt my timestamp with a password - so I wanted the sodium_crypto_secretbox() and its associated sodium_crypto_secretbox_open() functions.

This pair of functions, when used together, implement a secure symmetric key encryption system. In other words, they securely encrypt a lump of data with a password. They do, however, have 2 requirements that must be taken care of. Firstly, the password must be of a specific length. This is easy enough to accomplish, as PHP is kind enough to tell us how long it needs to be:

With that taken care of, we can look at the other requirement: a nonce. Standing for Number used ONCE, it's a sequence of random bytes that's used by the encryption algorithm. We don't need to keep it a secret, but we do need to to decrypt the data again at the other end, in addition to the password - and we do need to ensure that we generate a new one for every comment key. Thankfully, this is also easy to do in a similar manner to generating a password:

$nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);

With everything in place, we can look at upgrading the comment key generator itself. It looks much simpler now:

/**
* Generates a new comment key.
* Note that the 'encryption' used is not secure - it's just simple XOR!
* It's mainly to better support verification in complex setups and
* serve as a nother annoying hurdle for spammers.
* @param string $pass The password to encrypt with. Should be a base64-encoded string from key_generate_password().
* @return string A new comment key stamped at the current time.
*/
function key_generate($pass) {
$pass = base64_decode_safe($pass);
$nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);
$encrypt = sodium_crypto_secretbox(
strval(time()), // The thing we want to encrypt
$nonce, // The nonce
$pass // The password to encrypt with
);
return base64_encode_safe($nonce.$encrypt);
}

I bundle the nonce with the encrypted data here to ensure that the system continues to be stateless (i.e. we don't need to store any state information about a user on the server). I also encode the encrypted string with base64, as the encrypted strings contain lots of nasty characters (it's actually a binary byte array I suspect). This produces keys like this:

They are a bit long, but not unmanageable. In theory, I could make it a bit shorter by avoiding converting the integer output from time() to a string, but in my testing it didn't make much difference. I suspect that there's some sort of minimum length to the output string for some (probably good) reason.

Now that we're encrypting comment keys properly, it isn't much good unless we can decrypt them as well! Let's do that too. The first step is to decode the base64 and re-split the nonce from the encrypted data:

Explicitly returning null in key_decode() requires a small change to key_verify(), in order to prevent it from thinking that a key is really old if the decryption fails (null is treated as 0 in arithmetic operations, apparently). Let's update key_verify() to handle that:

A simple check is all that's needed. With that, the system is all updated! Time to generate a new password with the provided function and put it to use. You can do that directly from the PHP REPL (php -a):

Recently, I took the time to add the official nginx ppa to my server to keep nginx up-to-date. In doing do, I jumped from a security-path-backported nginx 1.10 to version 1.14..... which adds a bunch of very cool new features. As soon as I leant that HTTP/2 Server Push was among the new features to be supported, I knew that I had to try it out.

In short, Server Push is a new technology - part of HTTP/2.0 (it's here at last :D) - that allows you to send resources to the client before they even know they need them. This is done by enabling it in the web server, and then having the web application append a specially-formatted link header to outgoing requests - which tell the web server what resources it bundle along with the response.

First, let's enable it in nginx. This is really quite simple:

http {
# ....
http2_push_preload on;
# ....
}

This enables link header parsing serve-wide. If you want to enable it for just a single virtual host, the http2_push_preload directive can be placed inside server blocks too.

With support enabled in nginx, we can add support to our web application (in my case, this website!). If you do a HEAD request against a page on my website, you'll get a response looking like this:

Particularly of note here is the link header. it looks long and complicated, but that's just because I'm pushing multiple resources down. Let's pull it apart. In essence, the link header takes a comma (,) separated list of paths to resources that the web-server should push to the client, along with the type of each. For example, if https://bobsrockets.com/ wanted to push down the CSS stylesheet /theme/boosters.css, they would include a link header like this:

link: </theme/boosters.css>; rel=preload; as=style

It's also important to note here that pushing a resource doesn't mean that we don't have to utilise it somewhere in the page. By this I mean that pushing a stylesheet down as above still means that we need to add the appropriate <link /> element to put it to use:

<link rel="stylesheet" href="/theme/boosters.css" />

Scripts can be sent down too. Doing so is very similar:

link: </js/liftoff.js>; rel=preload; as=script

There are other as values as well. You can send all kinds of things:

script - Javascript files

style - CSS Stylesheets

image - Images

font - Fonts

document - <iframe /> content

audio - Sound files to be played via the HTML5 <audio /> element

worker - Web workers

video - Videos to be played via the HTML5 <video /> element.

The full list can be found here. If you don't have support in your web server yet (or can't modify HTTP headers) for whatever reason, never fear! There's still something you can do. HTML also supports a similar <link rel="preload" href="...." /> element that you can add to your document's <head>.

While this won't cause your server to bundle extra resources with a response, it'll still tell the client to go off and fetch the specified resources in the background with a high priority. Obviously, this won't help much with external stylesheets and scripts (since simply being present in the document is enough to get the client to request them), but it could still be useful if you're lazily loading images, for example.

In future projects, I'll certainly be looking out for opportunities to take advantage of HTTP/2.0 Server Push (probably starting with investigating options for Pepperminty Wiki). I've found the difference to be pretty extraordinary myself.

Of course, this is hardly the only feature that HTTP/2 brings. If there's the demand, I may blog about other features and how they work too.

Found this interesting? Confused about something? Using this yourself in a cool project? Comment below!

(Above: An example output with debugging turned on from my placeholder generation service)

For a quite a considerable amount of time now, I've been running my own placeholder image generation service here at starbeamrainbowlabs.com - complete with etag generation and custom colour selection. Although it's somewhat of an Easter Egg, it's not actually that hard to find if you know what you're looking for (hint: I use it on my home page, but you may need to view source to find it).

I decided to post about it now because I've just finished fixing the angle GET parameter - and it was interesting (and hard) enough to warrant a post to remind myself how I did it for future reference. Comment below if you knew about it before this blog post came out!

The script itself is split into 3 loose parts:

The initial settings / argument parsing

The polyfills and utility functions

The image generation.

My aim here was to keep the script contained within a single file - so that it fits well in a gist (hence why it currently looks a bit messy!). It's probably best if show the code in full:

Aside from implementing the random value for fg_colour and bg_colour, the angle has been a huge pain. I use GD - a graphics drawing library that's bundled with practically every PHP installation ever - to draw the placeholder image, and when you ask it to draw some rotated text, it decides that it's going to pivot it around the bottom-left corner instead of the centre.

Naturally, this creates some complications. Though some people on the PHP manual page said method (imagettftext) have attempetd to correct for this (exhibits a and b), neither of their solutions work for me. I don't know if it's because their code is just really old (13 and 5 years respectively as of the time of typing) or what.

Anyway, I finally decided recently that enough was enough - and set about fixing it myself. Basing my code on the latter of the 2 pre-existing solutions, I tried fixing it - but ended up deleting most of it and starting again. It did give me some idea as to how to solve the problem though - all I needed to do was find the centre of where the text would be drawn when it is both not rotated and rotated and correct for it (these are represented by the green and blue crosses respectively on the image at the top of this post).

PHP does provide a method for calculating the bounding box of some prospective text that you're thinking of drawing via imagettfbbox. Finding the centre of the box though sounded like a horrible maths-y problem that would take me ages to work through on a whiteboard first, but thankfully (after some searching around) Wikipedia had a really neat solution for finding the central point of any set of points.

It calls it the centroid, and claims that the geometric centre of a set of points is simply the average of all the points involved. It just so happens that the geometric centre is precisely what I'm after:

$$
C=\frac{a+b+c+d+....}{N}
$$

...Where $C$ is the geometric centre of the shape as an $\binom{x}{y}$ vector, $a$, $b$, $c$, $d$, etc. are the input points as vectors, and $N$ is the number of points in question. This was nice and easy to program:

The odd and even indexes of $bbox there are referring to the $x$ and $y$ co-ordinates of the 4 corners of the bounding boxes - imagettfbbox outputs the co-ordinates in 1 single array for some reason. I also calculate the original width and height of the text - in order to perform an additional corrective translation later.

With these in hand, I could calculate the actual position I need to draw it (indicated by the yellow cross in the image at the top of this post):

In short, I calculate the distance between the 2 centre points calculated above, and then find the pivot point by taking this distance plus half the size of the text from the original central position we wanted to draw the text at. Here's that in PHP:

The first command downloads all the required frames (3 at a time), and the second stitches them together. The -q:v 3 bit of the ffmpeg command is of note - by default webm videos apparently have a really low quality - this corrects that. Lower is better, apparently - it goes up to about 40 I seem to remember reading somewhere. 3 to 5 is supposed to be the right range to get it to look ok without using too much disk space.

That's about everything I wanted to talk about to remind myself about what I did. Let me know if there's anything you're confused about in the comments below, and I'll explain it in more detail.

When post is released I'll be in an exam, but I wanted to post again about the perfectly fascinating spam situation here on my blog. I've blogged about fending off spam on here before (exhibits a, b, c), but I find the problem is detecting it in a transparent manner that you as the reader don't notice very interesting, so I think I'll write another post on the subject. I could use a service like Google's ReCAPTCHA, but that would be boring :P

Recently I've had a trio of spam comments make it all the way through my (rather extensive) checks and onto my blog here. I removed them, of course, but it still baffled me as to why they made it through.

It didn't take long to find out. When I was first implementing comments on here, I added a logger specifically for purposes such as this that saves everything about current environmental state to a log file for later inspection - for both comments that make it through, and those that don't. It's not available publically available, of course (but if you'd like to take a look, just ask and I'll consider it). Upon isolating the entries for the spam comments, I discovered a few interesting things.

The comment keys were aged 21, 21, and 17 seconds respectively (the lower limit I have set is 10 seconds)

All 3 comments claimed that they were Firefox 57

2 out of 3 comments used HTTP 1.0 (even though they claimed to be Firefox 57, and despite my server offering HTTP/1.1 and HTTP/2.0)

All 3 comments utilised HTTPS

The IP Addresses that the comments came form were in Ukraine, Russia, and Canada (hey?) respectively

All 3 appear to be phishing scams, with a link leading to a likely malicious website

The 2 using HTTP/1.0 also asked my server to close the connection after sending a response

The last comment had some really weird capitalisation. After consulting someone experienced on the subject, I learnt that the writer likely natively spoke an eastern language, such as Chinese

This was most interesting. From this, I can conclude:

The last comment was likely submitted by a Chinese operator - even though the source IP address is located in Ukraine

All three are spoofing their user agent string.

Firefox 57 uses HTTP/2.0 by default if you're really in a browser, and the spam comments utilised HTTP/1.0 and HTTP/1.1

Curiously, all of this took place over HTTPS. I'd be really curious to log which cipher was used for the connection here.

In light of this, if I knew more about HTTP client libraries, I could probably identify what software was really used to submit the spam comments (and possibly even what operating system it was running on). If you know, please comment below!

To combat this development, I thought of a few options. Firstly, raising the minimum comment age, whilst effective, may disrupt the user experience, which I don't want to do. Plus, the bot owners could just increase the delay even more. To that end, I decided not to do this.

Secondly, with the amount of data I've collected, I could probably write an AI that takes the environment in and spits out a 'spaminess' score, much like SpamAssassin and rspamd do for email. Perhaps a multi-weighted system would work, with a series of tests that add or take away from the final score? I might investigate upgrading my spam detection system to do this in the future, as it would not only block spam more effectively, but provide a more distilled overview of the characteristics of each comment submission than I have currently.

Lastly, I could block HTTP/1.0 requests. While not perfect (1 out of 3 requests used HTTP/1.1), it would still catch some more bots out without disrupting user experience - as normal browsers (include text-based ones IIRC) use HTTP/1.1 or above. HTTP/1.1 has been around since 1991 (27 years!), so if you're not using it by now - upgrade! For now, this is the best option I can see.

From today, if you try to submit a comment and get a HTTP 505 HTTP Version Not Supported error and see a message saying something like this:

You sent your request via HTTP/1.0, but this is not supported for submitting comments due to high volume of spam. Please retry with HTTP/1.1 or higher.

...then you'll have to upgrade and / or reconfigure your web browser. Please let me know (my email address is on the homepage) if this causes any issues for anyone, and I'll help you out.

Found this interesting? Know more about this? Got a better solution? Comment below!

A while ago I implemented ETag support to the dynamic preview generator in Pepperminty Wiki. While I thought it worked during testing, for some reason on a private instance of Pepperminty Wiki I discovered recently that my browser didn't seen to be taking advantage of it. This got me curious, so I decided to do a little bit of digging to find out why.

It didn't take long to find the problem. For some reason, all the responses from the server had a Cache-Control: no-cache, no-store, must-revalidate header attached to them. How strange! Even more so that it was in capital letters - my convention in Pepperminty Wiki is to always make the headers lowercase.

I checked the codebase with via the Project Find feature of Atom just to make sure that I hadn't left in a debugging statement or anything (I hadn't), and then I turned my attention to Nginx (engine-x) - the web server that I use on my server. Maybe it had added a caching header?

A quick grep later revealed that it wasn't responsible either - which leaves just one part of the system unchecked - php-fpm, the PHP FastCGI server that sits just behind Nginx that's responsible for running the various PHP scripts I have that power this website and other corners of my server. Another quick grep returned a whole bunch of garbage, so after doing some research I discovered that php-fpm, by default, is configured to send this header - and that it has to be disabled by editing your php.ini (for me it's in /etc/php/7.1/fpm/php.ini), and changing

;session.cache_limiter = nocache

to be explicitly set to an empty string, like so:

session.cache_limiter = ''

This appears to have solved by problem for now - allowing me to regain control over how the resources I send back via PHP are cached. Hopefully this saves someone else the hassle of pulling their entire web server stack apart and putting it back together again the future :P

Found this helpful? Still having issues? Got a better way of solving the problem? Post a comment below!

I recently found myself reimplementing the comment key system I use on this blog (I posted about it here) for a another purpose. Being more experienced now, my new implemention (which I should really put into use on this blog actually) is stand-alone in a separate file - so I'm blogging about it here both to help out anyone who reads this other than myself - and for myself as I know I'll forget otherwise :P

The basic algorithm hasn't changed much since I first invented it: take the current timestamp, apply a bunch or arbitrary transformations to it, put it as a hidden field in the comment form, and then reverse the transformations on the comment key the user submits as part of the form to discover how long they had the page loaded for. Bots will have it loaded for either less than 10-ish seconds, or more than 24 hours. Humans will be somewhere in the middle - at least according to my observations!

Of course, any determined spammer can easily bypass this system if they spend even a little bit of time analysing the system - but I'm banking on the fact that my blog is too low-value for a spammer to bother reverse-engineering my system to figure out how it works.

This time, I chose to use simple XOR encryption, followed by reversing the string, followed by base64 encoding. It should be noted that XOR encryption is certainly not secure - but in this case it doesn't really matter. If my website becomes a high-enough value target for that to matter, I'll investigate proper AES encryption - which will probably be a separate post in and of itself, as a quick look revealed that it's rather involved - and will probably require quite a bit of research and experimentation working correctly.

As I explained above, this first XORs the timestamp against a provided 'passcode' of sorts, and then it reverses it, base64 encodes it, and then returns it. I discovered that I needed to repeat the passcode to make sure it's at least as long as the timestamp - because otherwise it cuts the timestamp short! Longer passwords are always desirable for certain, but I wanted to make sure I addressed it here - just in case I need to lift this algorithm from here for a future project.

Next up is the decoding algorithm, that reverses the transformations we apply above:

Very similar. Again, the XOR passphrase has to be repeated to make it long enough to apply to the whole encoded key without inadvertently chopping some off the end. Additionally, we also convert the timestamp back into an integer - since it is the number of seconds since the last UNIX epoch (1st January 1970 as of the time of typing).

With the ability to create and decode keys, let's write a helper method to make the verification process a bit easier:

It's fairly self-explanatory, really. It takes an encoded key, decodes it, and verifies that it's age lies between the specified bounds. Here's the code in full (it updates every time I update the code in the GitHub Gist):

(Above: The full comment key code. Can't see it? Check it out on GitHub Gist here.)

You have arrived in the third of three parts in my mini-series on how I implemented rich snippets. In the last two parts I tackled open graph and becoming an oEmbed provider. In this part, I'll be talking a bit about twitter cards, and how I implemented them.

Twitter's take on the problem seems to be much simpler than Facebook's, which makes for easy implementing :D Like in the other two protocols too, they decided to have multiple different types of, well, in this case, cards. I decided to implement the summary card type. Like open graph, it adds a bunch of <meta /> tags to the header. Sigh. Anyway, here are the property names I needed to implement:

twitter:card - The type of card. In my case this is set to summary

twitter:site - This one's confusing. Although it's called 'site', it should actually be set to your own twitter handle - mine is @SBRLabs.

twitter:title - The title of the content. Practically identical to open graph's og:title.

twitter:description - The description of the content. The same as og:description.

twitter:image - A url pointing to an image that should be displayed next to the title and description. Unlike Facebook's open graph, twitter appears to support https urls here with no problem at all.

Since after implementing open graph I already had 90% of the infrastructure and calculations in place already, throwing together values for the above wasn't too difficult. Here's an example set of twitter card <meta /> tags generated by the updated code:

<meta property="twitter:card" content="summary" />
<meta property="twitter:site" content="@SBRLabs" />
<meta property="twitter:title" content="Running Prolog on Linux" />
<meta property="twitter:description" content="Hello! I hope you had a nice restful Easter. I've been a bit busy this last 6 months, but I've got a holiday at the moment, and I've just received a .... (click to read more)" />
<meta property="twitter:image" content="https://starbeamrainbowlabs.com/blog/images/20151015-learning-swi-prolog-banner.svg" />

Easy peasy. Next up was testing time. Thankfully, Twitter made this easy too by providing an official testing tool. Interestingly, they whitelist domains based on whether the webmaster has run a url through their tool - so if you want twitter cards to show up, make sure you plug at least one of your website's page urls through their tool.

After a few tweaks, I got this:

With that, my work was complete. This brings us to the end of my mini-series on rich-snippet integrations (unless I've missed a protocol O.o Comment below if I have)! I hope you've found it useful. If you have (or even if you haven't!) please let me know in the comments below :D

Welcome to part 2 of this impromptu miniseries! In this second part of three, I'll be showing you a little about how I set up and tested a simple oEmbed provider for my blog posts - I've seen lots of oEmbed client information out there, but not much in the way of provider (or server) implementations.

If you haven't read part one about the open graph protocol yet, then you might find it interesting.

oEmbed is a bit different to open graph in that instead of throwing a bunch of meta tags into your <head />, you instead use a special <link /> element that points interested parties in the direction of some nice tasty json. Personally, I find this approach to be more sensible and easier to handle - the kind of thing you'd expect from an open standard.

To start with, I took a read of their specification, as I did with open graph. It doesn't have as many examples as I'd have liked, and I had to keep jumping around, but it's certainly not the worst I've seen.

oEmbed is built on the idea of providers (that's me!) and consumers (the programs and website you use). Providers, erm, provide machine-readable information about urls passed to them, and consumers take this information provided to them and display it to the user in a manner they think is appropriate.

To start with, I created a new PHP file to act as my provider over at https://starbeamrainbowlabs.com/blog/oembed.php and took a look at the different oEmbed types available - oEmbed has a type system of sorts, similar to open graph. I decided on link - while a rich would look cool, it would be almost impossible to test with every client out there, and I can't guarantee how the html would be rendered or what space it would have either.

With that decided, I made a list of the properties that I'd need to include in the json response:

version - The version of oEmbed. Currently 1.0 as of the time of typing.

type - The oEmbed type. I chose link here.

title - The title of the page

author_name - The name of the author

author_url - A link to the author's homepage.

provider_name - The provider's name.

provider_url - A link to the provider's homepage. I chose my blog index, since this script will only serve my blog.

cache_age - How long consumers should cache the response for. I put 1 hour (3600 seconds) here, since I usually correct mistakes after posting that I've missed, and I want them to go out fairly quickly.

thumbnail_url - A link to a suitable thumbnail picture.

thumbnail_width - The width of the thumbnail image, in pixels.

thumbnail_height - The width of the thumbnail image, in pixels.

Then I looked at the data I'd be getting from the client. It all comes in the form of GET parameters:

format - Either json or xml. Personally, I only support json.

url - The url to send oEmbed information for.

With all the information close at hand, I spent a happy hour or so writing code, and ended up with a script that outputs something like this:

Though the specification includes requirements for satisfying 2 extra GET parameters, maxwidth and maxheight, I chose to ignore them since writing a dynamic thumbnail rescaling script is both rather complicated and requires a not insignificant amount of processing power every time it is used.

After finishing the oEmbed script, I turned my attention to one final detail: The special <link /> tag required for auto-discovery. A quick bit of PHP in the article page renderer adds something like this to the header:

and with that, my oEmbed provider implementation is complete - but it still needs testing! Unfortunately, testing tool for oEmbed are few and far between, but I did manage to find a few:

oEmbed Tester - A basic testing tool. Appears to work well for the most part - except the preview. Not sure why it says "Preview not available." all the time.

Iframely URL Debugger - Actually a testing tool for some commercial tool or other, but it still appears to accurately test not only oEmbed, but open graph and twitter cards (more on them in the next post!) too!

After testing and fixing a few bugs, my oEmbed provider was complete! Next time, I'll be taking a look at twitter's take on the subject: Twitter cards.