December 31, 2007

Online advertisement is the economical engine of many current startups. Nevertheless, according to Dave Morgan, only 1% of the Internet population click on those ads. I’m not talking about context search ads like the ones displayed by Google, but about the ads and banners displayed in web sites and social networks. The problem with them is that, despite yielding a very low efficiency in terms of Clicks Per Ads vs. Website Users, most sites keep cluttering their space with them. Some time ago I was watching a video clip at Reuter’s site and I had to watch some seconds of ads before I could watch the real content. That got me asking myself, why is everything full of ads? Why is it increasingly rare to find a site with no ads? The answer I got was that the problem was US. Yes, you heard right, the problem is that there is such a big competition in many Internet market sectors, that user loyalty is attached to which services are free AND better (in that order), “Excuse me Mr. CEO, do you want me to pay for your service when I can do the same on this site for free? Hell no!”

Internet tech savvy users are so used to surf the web that are reluctant to pay for most web applications. This is specially true if the service a website is offering can be found elsewhere. One example of this are online email readers. If you ask if someone would pay to use Yahoo Mail, I’m sure they’ll say they wont. Why? Because there are other “free” alternatives such as Hotmail or Gmail. Now, if you try to start a website that operates in a market niche where others are giving a similar service for free, you won’t have any option apart from giving yours also for free. This behavior produces a funny paradox. Most startups don’t have a viable business model yet, but they still offer free content so they can compete with other companies in their market. The problem is that, as more companies enter the same niche, competition grows and all services become free of charge. Once the market reaches that point, most startups take one of these two paths: ads driven business model or premium accounts driven model. The first one clutters the website with ads, needing an increasingly high volume of users to transform that 1% click rate into something valuable. The problem is that most of them just alienate their users, risking loosing some of them. That’s because most users just don’t look at the ads. This is what Jakob Nielsen calls, banner blindness. On the other side, premium accounts imply a higher risk. If you don’t offer a service that’s unique, most users won’t bother to pay for it. That means that the startup won’t get any revenue at all.

Finally, because we are so used to finding several options for the same service, we force startups to take the ad driven business model as a way to cash in their product. The problem is that, very few of them really make real money with it. Why is that? Paraphrasing Dave Morgan: “ […] while clickers may be valuable audiences, they are by no means representative of the Web at large. Focusing campaigns to optimize on clicks means skewing campaigns to optimize on middle-aged women from the Midwest.“. So if your product doesn’t targets middle-aged women from the Midwest (or enough of them), your chances of survival are skim.

The question then remains, shall we, as Internet consumers, pay for our favorite services, even though we can find similar products elsewhere? I’m afraid that most of you won’t, and it’s natural, it’s one of the consequences of the freedom of markets. So, remember, every time you get bugged by an ad or a banner, it’s ultimately your fault! We shouldn’t pay for free of ads services, but because the service is worth it.

December 18, 2007

With the increasing popularity of blog platforms like WordPress or TypePad, security has become a major issue. This is the first of a two part series of posts I’m writing about password security schemes. In this post I’m going to introduce some cryptography notions and some general approaches to password security.

First of all, I’m going to explain how do most password schemes work. A user creates an account on the application (let it be web or desktop) giving, among other details, a user name and a password. The application then takes this user name and password and stores it, usually, in a database. At this point there are two approaches. The first one, and the easiest, is to store the user name and password as is in the database. A password that hasn’t experience any transformation, like in this case, is called a clear text password. Hence the method is called, clear text password storage. It’s an easy method, as you just have to retrieve the password for a given user and compare it, character by character with the one a user is giving you as part of his login. If they match, the user is allowed in the system. The problem with this method is that if anyone has access to the database where the passwords are stored, either via a legit way or not, it’s game over. The second method tries to avoid the previous problem with the use of a hash function. And what is a hash function? Well, it’s a mathematical function, used extensibly in cryptography. What it does is, it takes an input, let it be a string, and produces a unique fixed length string of gibberish which receives the name of hash. The key is that first, for the same string, it produces the same output (theoretically), and second, it’s a one way function, meaning that it’s very easy to compute the hash given an input, but it’s computationally unfeasible to retrieve the original input, given the hash. As an example:

md5sum(“AlwaysNewMistakes”) = 76dc83e4e19a1fd01bac6fbdfec92a27

You can try this at home, and you’ll see that every single time you input that string, you’ll obtain the same hash. There are several well known hash functions like md5, sha1, sha2 or blowfish. Each of them are different, and for the same input they’ll produce a different hash, so for example, an md5 hash won’t be “compatible” with an sha1 hash:

Well, back to our subject. As I was saying, the second method uses a hash function to obfuscate the password. So, right before the password is stored in the database, it’s process through a hash function (usually md5) and instead of storing the password, it’s the hash of the password what gets stored. Let’s remember that theoretically, there can only be one hash for that password. So, when a user needs to login, the system takes the user’s password, generates a hash for it and then retrieves the one stored in the database and compares both:

IF md5(stored_password) == md5(user_input_password) THEN ACCESS!

So, with this approach, if someone tampers with the database where the passwords are stored, they won’t have the passwords, just the hashes. The intruder will need to find the string that generates that hash, and as I said before, hash functions are one way, so it’s impossible to retrieve the original string from the hash. Now, this method is way cooler than the first one, much more secure and pretty inexpensive to implement. But, as you’ve already figured, it still has some problems.

The question is, if storing the hash is so secure, then what’s all that fuzz about cracking passwords that you see in hackers movies? The truth is that, although you can’t reverse the hash to obtain the original string, you can test different strings to see if they generate the same hash. This process is what is known as password cracking (also known as password cryptanalysis). There are three approaches to eventually crack the password. The first one is called Bruteforce, and as it names indicates, it’s based on generating all possible combinations of letters, numbers and symbols and passing them through a hash function. Once a hash is obtained, it will be checked against the ones we are trying to crack:

You get the idea. The problem with this method is that is computationally very expensive. If you don’t know the original password length (and hash function) this process can take forever. That’s why attackers usually try a second method called Dictionary Attack. It is well known that many users tend to use non random passwords, most of them easy to guess. Taking this as a premise, we can build a dictionary of common words like “god”, “secret”, “password”, etc. and then run a little program that reads them, calculates the hash and compares it with the one from the password we are trying to crack. The Dictionary Attack has the advantage of being faster than the Bruteforce and with a higher rate of success. The drawback is that first, you need to build a dictionary and second, if that dictionary doesn’t contains the password or a derivation of it, you won’t crack it.

The third method is known as Rainbow tables and it’s an evolution from the Dictionary Attack. Computing a hash for thousands of words, as the ones in a dictionary, can be time consuming and require quite some powerful hardware. The solution? Why not precomputate all the hashes of a dictionary and store them in a table. That way, the next time you look for the hash of the word “secret” you will already have it, speeding the process of cracking a password. It’s obvious that the first time you build a Rainbow table it will take time and resources (several Gb of storage). Please note that a Rainbow table depends on the character set, the number of words and the hash algorithm that it employs. So it’s not like one Rainbow table to rule them all. Also, the lookup mechanism used in a Rainbow table is quite more complex than what I just explained, but the underlaying idea is basically the same.

Now, back to the beginning and our password schemes. Using a hash instead of the plain text password is secure, but it still can be defeated. So, that’s the part where we introduce a new concept, the salt. A salt is a value that is append to the password string before obtaining the hash value:

md5(Salt + MyPassword) = hash

Just to clarify, the salt is a value we (our application) generates and can be a random or a predefined value. The use of a salt value gives an extra protection layer. On one side it increases the complexity of a Bruteforce or Dictionary Attack against the password, as the intruder has to take the salt into account when calculating the hashes. If the salt is an undisclosed value (read hidden value), the intruder might find it impossible to crack the password as he will be computing hash(password) instead of hash(salt + password). On another side, if the salt value is a random alphanumeric value it will increase the passwords complexity and will reduce the chances of being discovered using bruteforce or a dictionary, as it won’t match lists of common or used passwords. Finally, adding a salt value avoids the use of Rainbow tables against our passwords. Why is that? Well, as I said before, a Rainbow table has to be generated for a specific character set with an specific hash function. All the precomputed hashes stored in the table aren’t generated with a salt value, so the intruder won’t be able to use already made tables. Instead he will have to generate his own set for that specific salt value (if it’s constant), which as you’ve guessed, defeats the point of using precomputed tables.

Although the previous method is quite secure, we still have a slight problem. If we use a constant salt value, there is always a possibility of someone creating an ad hoc Rainbow table for it. This is specially true if your software is very well known. For example, if WordPress used a default salt value for their passwords, someone will most probably create a Rainbow table for it. You can argue that you can change it and you’ll be protected, nevertheless, the percentage of users that actually do that is very low. On the other side, why take the risk when you can make it better?

Now, lets say that instead of a constant salt we use a different pseudo random salt for each password. That is, for each new password we store, we generate a pseudo random salt value (I stress pseudo random as there isn’t a way, yet, to obtain truly random numbers with a computer) and store the salted hash. The question that arouses then is, how do I know which value I used when salting the passwords? The answer is that you store it with the hash. In this scenario we will do the following:

Generate a pseudo random salt value, S

Obtain the hash of the password: H = hash(S + password)

Store S and H in the database

Even though we store the salt in the database, we’ve achieved an extra layer of security. With this method, it’s nearly impossible to precompute a Rainbow table and we avoid bruteforcing and dictionary attacks. The problem is that if someone breaches the database and is able to retrieve the passwords with the salt values they could, theoretically, craft a bruteforce or dictionary attack using each different salt value. To avoid this, to a certain extent, we can also use a constant hidden salt value. That is, we can hardcode a constant salt value in the configuration file of our application. That way, even if the database is leaked, they won’t have the constant salt value, rendering any possible crack attempt:

hash = md5(randomSalt + password + constantSalt)

There is a caveat, if an intruder also gains access to the application and can read the constant salt value from the configuration file, we’ll be back to square one. Nevertheless, if an intruder reaches that point, it’s already game over for your application, as it means they have access to the system where it’s installed.

For the record, there are different variations to the methods I’ve exposed. That’s the case of WordPress which processes its passwords with a double hash:

hash = md5(md5(password))

I will talk a little more about this method on the next post, but just to clarify, hashing a value twice doesn’t adds any extra security. It might render any bruteforcing or dictionary attack a little harder but nothing more.

Another important note. In the above examples I’ve been using the md5 hashing algorithm. Right now md5 has been broken. This means that there is a way of creating the same hash value with two different input strings. This is called a collision and it renders a hashing algorithm useless. As I said before, one of the key strengths of a hash is that for an input string there is only one output hash, if this doesn’t holds, then it’s useless. So, my recommendation is to use sha2 or blowfish (as sha1 is also known to have collision problems).

I hope this post has been helpful in giving a little insight into passwords implementations. I’m writing a second part, this time with real code and some problems that have been flagged in the way WordPress manages its passwords.

December 13, 2007

Mark Evans had a nice question on his blog some time ago and while commenting it I got an idea. One of the biggest problems (among others) about blogs, comments and RSS is that you can’t have both at the same time. That is, you can’t download one RSS feed with the content and the comments (as far as I know, so if I’m wrong, please correct me). I am aware that some feed readers can display both, but it’s not mainstream yet. On of those systems is the one on fav.or.it which I had the luck to see at FOWA in London this year. They are doing an awesome job and their interface is very very cool. Anyhow, my idea is based on something I read on the latest OPML draft specification (2.0) by Dave Winer.

Most blogs and content platforms offer two RSS feeds. One for the actual content and another one for the comments. The problem with it is that you need to subscribe to both to get them AND most probably, your feed reader will display them as separate feeds. In my opinion this is a burden and it might be one of the reasons why we see such low ratios of comments per subscriber. I always tell the same fact, but it’s astonishing, at least for me, that a blog like TechCrunch with 165k+ subscribers gets an average of 20-30 comments per post. So, looking at the RSS 2.0 specification I remembered that there is a <comments> tag for each RSS item where, in theory, you have a link to the comments. Now, getting back to the OPML draft, Dave added a new feature which I think could allow a better comment handling. The basic tag in OPML is the <outline>. This tag might have an attribute called type, which can take several values, among them the “link” value. Quoting the OPML specification: “An outline element whose type is link must have a url attribute whose value is an http address“. In the new specification, he added what he calls inclusion. That is, if the url of an outline with link type ends in .opml, the file will be downloaded and included when displayed. I think the idea is very cool, specially if we give it a little twist. I was thinking that if instead of doing the inclusion with .opml files, we do it with an RSS file and we add that capability to the <comments> tag in the RSS specification we could have a wonderful tool to include comments in a feed.

So, putting all this together we might end up with something like this:

With that slight modification we can now have the number of comments we have and the xmlURL of the comment’s RSS feed for that post. What happens on the other side? Well, the feed reader must read that tag and include it as if it was part of the feed. Adding the number attribute allow us, on one side to show the readers how many comments the post has and on another hand lets the feed reader know if he needs to download the RSS feed of the comments again. Evidently, the number attribute will change with time if we have new comments and will be updated every time our feed reader requests the RSS feed.

So, what do you think of my idea? As with every thing I’m sure this issue has been discussed by someone before me and it might even have a different solution I’m unaware of. So please, if you know a cool way to handle comments drop me a comment and I’ll include it in the post.

UPDATE: This is what happens when you rush a post. You get it wrong. What I just described exists already and is known as wfw:commentRss. The only difference with what I was proposing is that the comment count is specified by a different tag (slash:comments), which by the way has nothing to do with the wfw namespace. Now, the commentRss tag was proposed in 2003. I’m quite fascinated as to why it hasn’t been adopted in feed readers yet. Some readers display the comments count, some others even parse the commentRss tag, but I don’t know yet of a reader that actually pulls that data based on the counter and collapse it with the content. Maybe you could even ask the user to activate that feature on a per post basis.

Some days ago, our good friend Duncan Riley wrote about Twittories. It’s a website where many people collaborate to write a short story. The catch is that they have to do it with a twitter message. This means, each user can only submit one message with a maximum length of 140 characters. While reading the story I was reminded of a game we used to play when we were little, I know it has another name in English but can’t remember now. In Spanish it was called the broken phone. Anyway, although it’s a fun idea and a cool experiment, there is no way you can read a story written in that fashion. The central plot is so broken it’s impossible to follow. But I must say it got me thinking about the quality of most articles. Before creating my current startup, I developed a prototype for what was going to be my first try in the entrepreneur world. I designed a semantic algorithm that was able to detect and highlight structural problems in texts. In other words, a style corrector. While I was testing it, I realized how badly written where most news articles. One of the most common problem was the use of overly large phrases in terms of words. And when I talk about large phrases I’m talking about 50 or more words per phrase without a full stop. So, back to the twittories, I realized that, even though the plot might be messed up, the 140 characters restriction could be a great thumb rule for many writers (including myself of course). Maybe we could enforce some rules like that on our blog platforms. Now that TypePad is also open source, there is no excuse 😉

December 7, 2007

Many people are talking about the current technology bubble we are experiencing. It’s something that has been written about in many places. It has even been depicted in the following video (if you haven’t seen it yet, it’s a must):

I don’t want to argue about it, as there has been quite some fuzz around it. What it’s clear to me is that it will eventually burst. When that happens many startups will have to close and many people will lost their jobs. Will it be worst than in 1999? I can’t know, although I personally think it won’t be as dreadful. Anyway, I was wondering what will be the future of the blogosphere when this bubble crashes?. It’s easy to predict that many startups will have to close due to poor or inexistent users, but what will happen to all the tech blogs out there? Brian Oberkirch has an interesting post about the value of most of the current tech blogs. Many of them just reproduce the latest news but with different words. I think we need those kind of blogs, but right now we have way too much like them. We are needing valuable bloggers that, as Brian points, should create hidden relationships between things, and that make us think and analyze current trends. So I wonder, if the bubble bursts, what will happen with all those new-shinny-startup-review-copycats? Will many of us will still be there? If there are no more tech companies, will people stop blogging or they’ll just blog about the bubble’s crash?

Take the example of TechCrunch, if there are no more startups, will they have to blog about other stories? Will the readers continue to read those new stories or will they move on? I guess my point is, should we expect a huge drop in our blog audiences after the bubble bursts? Or even better, will the overall number of posts per day decrease?

Personally I would expect the same level of blog posts, but instead of talking about new companies, they could post detailed analysis about why the startup X or company Y failed. Or for example, about how the companies that did manage to stay alive made it through the crash, what strategies they adopted and lessons learned. Nevertheless, some tech blogs will inevitably disappear or at least lower (a lot) their post rates speeds.

As a blogger or a blog reader, do you think we might be heading to a blog crisis? If not, do you foresee any situation that would eventually lead to a blog crisis?

PS: Sorry about the delay in posting. I’ve had some really stressful weeks, but I’m back with some new posts.

UPDATE: I had to change the url of the video as it had changed. It’s working again 🙂