Category Archives: Featured

Well well well, hit or flop.. only time will tell but a new Twitter based startup is on the horizon .. http://twit2tel.com

So, you want to make free international calls.. isnt it? Aha, yes you do.. well unlike what you think there is no catch in it and I aint bribed for writing this article about twit2tel (not to mention.. why would they pay some blogger who’s got handful of readers.. and umm yes that makes you special )

Ok, so lets cut through to the real deal… to make an International Call you need to do the following:

1. Get an account on Twitter ..

2. Open Twit2Tel.com and login using your Twitter account and grant Twit2Tel the necessary privileges it requires; also while you are completing your registration at Twit2Tel, do mention your mobile number (I dont know but you may not be able to register if you dont do that, so that goes in without saying.. lol )

3. You are all set .. Now get your friends to do the same… once they are also registered, all you need to do is to go to the homepage of Twit2Tel and make a call to Twitter-Id of your friend (your should find the box for that in lower right corner of the home page of Twit2Tel)

After Step3, Twit2Tel will automatically call you.. and make you hear a 10 seconds ad, and then will call the number associated with your friend’s Twitter Id. All (s)he needs to do then is accept the call and press 1 (which is the option to accept the call)

Well, thats it folks. Now enjoy a 4 minutes good voice quality call to your friend which is basically unlimited because you can make a thousand 4 minutes free call

Feel free to drop in your comments and suggestions. And hey, there is much more to Twit2tel than free calls, such as voicemail, etc! But for me, I am enjoying the free calls and care the least for the rest! Probably you may wanna explore more!

Do you fancy your website (or web-infrastructure) built over simple commodity hardware to really handle millions of users ??
(ie no sweet and cute super computers and yet so powerful!)

Really? So lets jump in and start discussing that how you can optimize your site to deliver and render content at a blazing speed to the huge volumes of user (which ofcourse in many way reduces loads on the server):

(Note: All the maths below is hypothetical and not exact figures, this may vary based on use-case)

1. Minimize HTTP Requests: For all your images, css files, js files, a http request is made to the server, which it turn responds back with the required files.. while each http request takes somewhere between a few milliseconds to few seconds to respond back! (Heck of a time spend!) So, to reduce that, you can do the following:

b). Use of image sprites, image maps.. ie. combine multiple images into one, and use them by specifying background position instead of making multiple calls for n no. of images.

Pros:

a). Increases page load time by about 80% (yes 80%, you heard it right)

b). Decreases connection pool load on server (lesser http requests, means lesser exhaustion of sockets and lesser request threads) ie added power to serve much more people in same CPU, bandwidth power!

2. Reduce DNS lookup: Uploading files at multiple hosts and using them in your html costs you some 20-120 milliseconds for each host. You can really get away with that by using a single host as your content provider (since browsers cache ie. remember the host name -> dns -> ip lookup for about 30 mins or more), so you help your self by saving that dns lookup time by following my
recommedation.
For eg. If you are reading some images from www.abc.com, some from www.xyz.com, some from www.ppp.com and some other files fromwww.kkk.com then you actually spend about 4*120 – 480 milliseconds to atleast download the content from each of these hosts.
Moreover, if you are using multiple hosts, the files downloading from them do not download in parallel. They all download one after the another. So finally, if you are downloading 20 images from 5 hosts (each serving 4 images), and if each image takes 5 seconds to download, your total download time will be somewhat like:
DNS Lookup for Host 1 : 0.4 seconds
Download 4 images from Host 1 : 5 seconds (all 4 images from host 1 downloaded in parallel)
DNS Lookup for Host 2 : 0.4 seconds
Download 4 images from Host 2 : 5 seconds
DNS Lookup for Host 3 : 0.4 seconds
Download 4 images from Host 3 : 5 seconds
DNS Lookup for Host 4 : 0.4 seconds
Download 4 images from Host 4 : 5 seconds
DNS Lookup for Host 5 : 0.4 seconds
Download 4 images from Host 5 : 5 seconds
So, Total time taken: 27 seconds
…. having said that if you were using only one host, this might have been:
DNS Lookup for Host 1 : 0.4 seconds
Download 20 images from Host 1: 5 seconds
So, Total time taken: 5.4 seconds
Compare the difference yourself

3. Avoid Redirects: Redirects – either 301 or 302 consume time! The headers are sent to the server, and there by redirecting it again leads into resending the headers and the same process being repeated again. More over, search engines “hate” such redirects, your SEO value drops down to floor, you loose ranks and trust of search engines .. do you want that to happen? Search engines are like girl friends, you should not do things like these to piss them off

4. Flush the buffer: Well, generally any dynamic scripting language takes about 400-500 milliseconds to response with any html back to the browser, and during this time the browser sits idle. However, we can avoid that to happen, by sending partly executed php’s output to the browser to start rendering with. You can accomplish this by doing a <? flush(); ?>
So, if you do that, you give something to browser to start with, while you generate your rest of the page. So if in case, browser waits for 500 milli seconds to start receiving your html, and then takes 5 seconds to render. Your total time spent is 5 seconds + 500 milli seconds but if in case you flush your headers early, you will save those 500 milli seconds. Worth a save using an extra a line of code ( <? flush(); ?> ) isnt it?
This comes really handy when your server is under big load, and processing a page takes quite some time!

5. Use of CDN (Content Delievry Network): Generally, when you serve content to a user from your server, and your server is based in say US, while the page requester can be from Korea, India or else, so for bigger files such as images, js, css the network latency is a big factor that counts. Here is where CDN comes into act, CDN is nothing but a set of servers, spanned across the earth, ideally one in every country or so meant for serving
content … so now what happens is, when user opens your website, while you serve simple html from your server (in US), you serve images, css and js through CDN (in which, the CDN server nearest to your location is invoked to deliever the content to the user) and thus, content is delievered using nearest possible CDN server to the user, hence eliminating any network latency. Now, big question: having so many servers across the world is a huge issue, how will you have so many servers (its okay for firms like Yahoo, Facebook, etc to have that, but how will you do that?.. well the answer is as simple as: upload your images, css, etc files as an Google application on Google App Engine (Google really serves App Engine files in form of CDN file serve, since it has a set of CDN servers)
Thus, you leverage the power of Google to serve content and that too its free!
(and hey, did you realize, you will also save on bandwidth allocated to you by your hosting since you are not serving images and other big
files from your server)

Another positive of having a CDN is, that CDN caches the files (some do, some do not), that is it does not reads from file-system (its harddisk) each time a user requests for a file, instead it stores in its memory (ram) and delievers it. Owing to limited RAM / memory size, only the most popular files are cached, but really it simply speeds up the whole thing to a great extend!

6. Compress your content: Well, as you know when you archive (compression of) text, it really really reduces to a very small chunk of data. Well so all the modern browsers are capable of receiving html and text in compressed form and uncompress it at the client side. They pass the following to the server:
Accept-Encoding: gzip, deflate
If the web server sees this header in the request, it may compress the response using one of the methods listed by the client. The web server notifies the web client of this via the Content-Encoding header in the response.
Content-Encoding: gzip
Yet another way, to speed up the website!
(Note: text compression works @ converting 100 KB to say 3-4 KB, that said, you said 30 times or more of your hosting bandwidth, while you deliver the content 30 times faster to the user as well)

7. Stylesheets at top and Scripts at bottom: As a good development practice, we should have all the stylesheets included in starting of a html page and all the javascripts at the bottom. The reason being very simple: javascripts add functionality and run some programming stuff on your browser and tend to slow things down, so its a good idea to do that once the page is visible in a good format to user, so that the user doesnt gets a
bad impression of your website. Also, to let all the html come in best format (visual), its good to have all styles already available, ie before rendering starts, so inclusion of CSS files is preferred to be the first thing ie. do it within <head></head> .. while include javascript files in footer (or bottom of your page)

8. MySQL Scaling: So, with all the stuff said above, we didnt talk of how we can scale the database so that it can handle huge loads? It will still hit the db a million times for a million users with thousands of requests per second! … gosh … it will be a brutal murder of the mysql db!!
So, to scale mysql db, we should run multiple hosts on different machines, where we should keep a couple of them as masters and rest as slaves. And whenever we have to write anything to the db, it should be written to the master dbs only! And whenever we want to read anything, it should be read from the slave. While, amongst themselves after a period of time, master dbs should replicate their current data state into the slaves
so that slaves have up-to-date data!
What this helps is that instead of one db handling everthing, it distributes the load amonsgt all the multiple dbs, the two master db do only write and are busy with that, while other slaves fetch the data when data is requested.
It shares the load and lets mysql databases to scale easily to handle huge load (notice, with more dbs we will have more connection pool, and more execution power)

9. Caching MySQL Reads: Well, it has been noticed, that reads are many times more than writes (for eg. only 10 users reading an article may post a comment, while other 1000 will only visit the article and move ahead : which translates into 10 writes and 110 reads on db). So, this slave thing may exhaust too! So what to do now? well we have a solution based on an analogy that since we know that 100 of 110 reads got the same content back, it makes sense to have that content cached somewhere (in RAM) instead of being queried from db each time. So we write “intelligent” php scripts to read the content from db, only if its not cached, and once its read – cache that content in RAM, so that the next time some one tries to read, it can provide the content without even reading the database! Well, we just reduced those 110 reads into 10 reads from the db

10. Use Clean URLs (almost always): Well, instead of having urls like www.xyz.com/category.php=10 .. have it like .. www.xyz.com/category/books/ This is not related to speeding up of the website or scalability, the simple reason to that is for SEO. Having such clean urls, lets search engines index your urls and tag them using the name in url (such as categories -> books ) hence giving full SEO value (much more than having stuffed keywords or super meta description in your page )

11. URL Rewrites in Apache, than htaccess: If you can change httpd.conf of apache or link your rewrite file somehow to it, must do it! Always avoid to keep the URL Rewrites rules in htaccess file. Reason being performance hit. Since the apache rules get executed only once the request is made to the server, while the htaccess file gets executed for each request, causing a super severe performance hit!

I will update my article with some links as well, may be at a later time.

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

Installation of Hive is pretty straigtforward and easy. With least chit-chatting, I will get to business for ya!

Prerequisites

Sun Java 6

Hadoop (0.17.x – 0.19.x)

We must have Hadoop already up and running (support for 0.20.x is still under progress – so 0.17.x to 0.19.x is preferable)! If you don’t have Hadoop already installed for you, try and deploy it by going through the following tutorials:

I would have written a guide for Hadoop installation, but I really find Michael’s tutorial very cool for anyone to follow and get along with Hadoop! So if you havent installed Hadoop, thats the place to learn and do it fellas!

Note:
a) For this tutorial purpose, we will be referring to a Single Node Hadoop installation

SVN

SVN aka Subversion is an open source version control system. Most of the apache projects are hosted over SVN. Thus, its a good idea to have it on your system if not already.

For the current tutorial, you will need it to grab the code out of Hive SVN Repository

Search for “recursion” on Google (without the quotes) and while showing you back the results, Google Search Engine, also suggests you : Did you mean: Recursion

So, after all your careful investigation, matching your query string against the suggestion of Google – you finally think / believe that your eyes must have lost any hidden utf-8 or yedda format character, … against which Google is trying to suggest a correct spelling. Now, you click on the suggestion. But! woha.. you get the same search results with same suggestion back again. (Recursion isnt it?)

Well like I said, you may also utter the same … it appears Google is Lost in Translation Recursion …

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.