Posts tagged with web

A week of innovation

First of all I want to thank Lifesum for having another "Innovation Week", it is a great opportunity and I hope that more companies will start following it. In a few words the idea is to allow everyone from the company to freely pick a project or idea that they want to develop and and work on it for one week.
The benefits range from just making people happy because of the break of the routines and the opportunity to work on something a bit different, to seeing some pretty amazing prototypes that can be easily implemented in the company product.

What is Service Discovery?

In summary service discovery is the possibility of the separate services in scalable infrastructure to communicate with each other and to the outside world. In other words - how to route the requests to the corresponding service while providing balanced load on the instances in the pool and monitoring their health.
Sounds simple, right?

Well, unfortunately service discovery in the real world is not that simple. In my presentation from the Stockholm Python MeetUp I talked a bit more about the complexity of service discovery, the suboptimal solutions and Smartstack - a solutions invented at AirBnB for simplifying the whole process.
You can see more about it in my presentation:

However, the whole idea sounded so awesome that me and my colegue Esma decided to team up on that and try to explore a bit more the opportunities that Smarstack provides for us.
In the matter of fact we decided to explore two different approaches: Smartstack and Consul. However we had some issue with the Consul setup and we found that it is not acting exactly the way we so at the end we focused all our attention to Smarstack.

How does it work?

Smartstack consists of two main components - Nerve and Synapse.
Never handles the service registration while Synapse reads the information about the available services and configures a local HAProxy that plays as a load balancer for the service pool.
For our tests we used Zookeeper as register for the services.

What have we built?

We created a small project consisting of a pool of Zookeeper instances, two node for service A and one node for service B. We tested multiple scenarios of crashes of one or more nodes, both zookepers instances and service instances, how the systems operates during the crashes and how it recovers after the nodes are brought back on.

Results

As a result we created a public repository service-discovery giving details about the whole setup process, our tests and their results. Actually now is the time to praise Esma and her awesome work on conducting the tests and writing the great documentation from the repo above.
So, if you have ever wondered about how service discovery works or you just want to test Smartstack, just clone the repo and follow the instruction. We would be happy to hear about your experience.

Or how to use the benefits of Django template system during the PSD to HTML phase

There are two main approaches to start designing a new project - Photoshop mock-up or an HTML prototype. The first one is more traditional and well established in the web industry. The second one is more alternative and (maybe)modern. I remember a video of Jason Fried from 37 Signals where he talks about design and creativity. You can see it at http://davegray.nextslide.com/jason-fried-on-design. There he explains how he stays away from the Photoshop in the initial phase to concetrate on the things that you can interact with instead of focusing on design details.

I am not planning to argue which is the better method, the important thing here is that sooner or later you get to the point where you have to start the HTML coding. Unfortunately frequently this happens in a pure HTML/CSS environment outside of the Django project and then we waste some extra amount of time to convert it to Django templates.

Wouldn't be awesome if you can give the front-end developers something that they can install/run with a simple command and still to allow them to work in the Django environment using all the benefits it provides - templates nesting and including, sekizai tags etc.

I have been planning to do this for a long time and finally it is ready and is available at Django for Prototyping. Currently the default template includes Modernizr, jQuery and jQuery UI but you can easily modify it according to your needs. I would be glad of any feedback and ideas of improvement so feel free to try it and comment.

Preface: Nine months ago(I can't believe it was that long) I created a script called Simple Site Checker to ease the check of sitemaps for broken links. The script code if publicly available at Github. Yesterday(now when I finally found time to finish this post it must be "A few weeks ago") I decided to run it again on this website and nothing happened - no errors, no warning, nothing. Setting the output level to DEBUG showed the following message "Loading sitemap ..." and exited.
Here the fault was mine, I have missed a corner case in the error catching mechanism i.e. when the sitemap URL returns something different from "200 OK" or "500 internal server error". Just a few second and the mistake was fix.

Problem and Solution: I ran the script again and what a surprise the sitemap URL was returning "403 Forbidden". At the same time the sitemap was perfectly accessible via my browser. After some thinking I remembered about that some security plugins block the access to the website if there is not User-Agent header supplied. The reason for this is to block the access of simple script. In my case even an empty User-Agent did the trick to delude the plugin.

Final words: As a result of the issue mention above one bug in simple site checker was found fixed. At the same time another issue about missing status and progress was raised, more details can be found at Github but in a few words an info message was added to each processed URL to indicate the progress.

If you have any ideas for improvement or anything else feel free to comment, create issues and/or fork the script.

During the development of Simple Site Checker I realised that it would be useful for test purposes if there is a website returning all possible HTTP status codes. Thanks to Google App Engine and webapp2 framework building such website was a piece of cake.

The home page provides a list of all HTTP status codes and their names and if you want to get an HTTP response with a specific status code just add the code after the slash, example:http://httpstatuscodes.appspot.com/200 - returns 200 OKhttp://httpstatuscodes.appspot.com/500 - returns 500 Internal Server Error
Also at the end of each page is located the URL of the HTTP protocol Status Codes Definitions with detailed explanation for each one of them.

... a command line tool to monitor your sitemap links

I was thinking to make such tool for a while and fortunately I found some time so here it is.

Simple Site Checker is a command line tool that allows you to run a check over the links in you XML sitemap.

How it works: The script requires a single attribute - a URL or relative/absolute path to xml-sitemap. It loads the XML, reads all loc-tags in it and start checking the links in them one by one.
By default you will see no output unless there is an error - the script is unable to load the sitemap or any link check fails.
Using the verbosity argument you can control the output, if you need more detailed information like elapsed time, checked links etc.
You can run this script through a cron-like tool and get an e-mail in case of error.

... or how to avoid duplicate content by keeping the current language in the URL

Preface: Earlier this year I posted about Django CMS 2.2 features that I want to see and one of the things mentioned there was that once you have chosen the language of the site there is no matter whether you will open "/my_page/" or "/en/my_page/" - it just shows the same content. The problem is that this can be considered both duplicate and inconsistent content.
Duplicate because you see the same content with and without the language code in the URL and inconsistent because for the same URL you can get different language versions i.e. different content.

Solution: This can be easy fixed by using a custom middleware that will redirect the URL that does not contain language code. In my case the middleware is stored in "middleware/URLMiddlewares.py"(the path is relative to my project root directory) and contains the following code.

Now a little explanation on what happens in this middleware.
Note: If you are not familiar with how middlewares work go and check Django Middlewares.
Back to the code. First we split the URL by '/' and take the second element(this is where our language code should be) and store in lang_path(8).
URLS_WITHOUT_LANGUAGE_REDIRECT is just a list of URLs that should not be redirected, if lang_path matches any of the URLs we return None i.e. the request is not changed(9-10). This is used for sections of the site that are not language specific for example media stuff.
Then we get language based on the request(11-13).
If lang_path is empty then the user has requested the home page and we redirect him to the correct language version of it(14-15).
If lang_path does not match any of the declared languages this mean that the language code is missing from the URL and the user is redirected to the correct language version of this page(16-17).
To make the middleware above to work you have to update your settings.py.
First add the middleware to your MIDDLEWARE_CLASSES - in my case the path is 'middleware.URLMiddlewares.CustomMultilingualURLMiddleware'.
Second add URLS_WITHOUT_LANGUAGE_REDIRECT list and place there the URLs that should not be redirected, example:

URLS_WITHOUT_LANGUAGE_REDIRECT=['css','js',]

Specialties: If the language code is not in the URL and there is no language cookie set your browser settings will be used to determine your preferred language. Unfortunately most of the users do not know about this option and it often stays set to its default value. If you want this setting to be ignored just add the following code after line 10 in the middleware above:

It removed the HTTP_ACCEPT_LANGUAGE header sent from the browser and Django uses the language set in its settings ad default.

URLS_WITHOUT_LANGUAGE_REDIRECT is extremely useful if you are developing using the built in dev server and serve the media files trough it. But once you put your website on production I strongly encourage you to serve these files directly by the web server instead of using Django static serve.

Final words: In Django 1.4 there will be big changes about multilingual URLs but till then you can use this code will improve your website SEO. Any ideas of improvement will be appreciated.

... or how to pull data about page visits instead of implementing custom counter

Preface: OK, so you have a website, right? And you are using Google Analytics to track your page views, visitors and so on?(If not you should reconsider to start using it. It is awesome, free and have lost of features as custom segments, map overlay, AdSense integration and many more.)
So you know how many people have visited your each page of your website, the bounce rate, the average time they spend on the page etc. And this data is only for you or for a certain amount whom you have granted access.

Problem: But what happens if one day you decided to show a public statistic about visitors on your website. For example: How many people have opened the "Product X" page?
Of course you can add a custom counter that increases the views each time when the page is open. Developed, tested and deployed in no time. Everyone is happy until one day someones cat took a nap on his keyboard and "accidentally" kept the F5 button pressed for an hour. The result is simple - one of you pages has 100 times more visits than the other. OK, you can fix this with adding cookies, IP tracking etc. But all this is reinventing the wheel. You already have all this data in your Google Analytics, the only thing you have to do is to stretch hand and take it.

Solution: In our case "the hand" will be an HTTP request via the Google Data API. First you will need to install the Python version of the API:

sudoeasy_installgdata

Once you have the API installed you have to build a client and authenticate:

SOURCE_APP_NAME is the name of the application that makes the request. You can set it to anything you like.
After you build the client(2) you must authenticate using your Google account(3-9). If you have both Google and Google APPs account with the same username be sure to provide the correct account type(8).
Now you have authenticated and it is time to build the request. Obviously you want to filter the data according some rules. The easiest way is to use the Data Feed Query Explorer to build your filter and test it and then to port it to the code. Here is an example how to get the data about the page views for specific URL for a single month(remember to update the PROFILE_ID according to your profile).

Final words: As you see it is relatively easy to get the data from Google but remember that this code makes two request to Google each time it is executed. So you will need to cache the result. The GA data is not real-time so you may automate the process to pull the data(if I remember correctly the data is updated once an hour) and store the results at your side which will really improve the speed. Also have in mind that this is just an example how to use the API instead of pulling the data page by page(as show above) you may pull the results for multiple URLs at once and compute the feed to get your data. It is all in your hands.
You have something to add? Cool I am always open to hear(read) you comments and ideas.

... five major question to ask yourself before using cache

After we learned about Why, Where, What and How of caching web sites now it is time to see when to use it.
The application cache is mainly used for speeding up and/or decreasing the load of frequently executed and(but not necessary) heavy resource using methods. So the first question is:

1) If you have method the consumes lots of CPU time and/or memory can it be optimised?

If you can optimize your code and make the method run faster and consume less resource than do this first and then reconsider whether you still need to cache it..

2) Will caching save you load/wait?

Have in mind that accessing cache has its own load. So caching the result of relatively light operations is pointless. Try to find where your biggest load/wait came from and use cache there.

3) For how long the cache should be valid?

This depend from how often the data is changed. We can split it in 3 major cases.

Case 1: The data changes on equal amounts of time, for example it is updated by cron. In this case you can set the expire time equal to the time interval in which the cron is runned. Example - you are reading a feed with the news from the last hour.

Case 2: The data is changed on random intervals of time - but not real time(i.e. from minutes to hours). In this case you should choose an average amount of time for which you think the data is persistent or even if it is changed you can serve the outdated one to the client. In this case the best way is to be able to invalidate cache when data changes. This is usable if the data is from the same type, for example - news list. Just add a line that invalidates cache on every news affecting operation(add/edit/delete). If you have composite data, for example mix of news, weather cast and currency rates for the day you'll just have to wait the cache to expire by it self.

Case 3: The data is updated real-time(every second, sometimes more than once in a second). If you are watching/playing on the market you really need real-time info for the current rates. But if don`t need the real-time date, for example you are displaying just informative graph with the daily changes you can cache and update it rarely(for example on every few minutes) then it changes.

So you have determined how long your cache should stay valid and the next question is:

4) Will the data be accessed more than once for the cache period?

If not then don't cache it or reconsider to increase cache validity time(question 2).

5) How big is the data to be cached?

Memcached has 1MB limit per cached item. For web(no matter is it site or application) this limit is fair enough for most/all of the cases. If you plan to store something bigger thing again and be sure you are not doing something wrong. If you really need to cache bigger amount of data consider to use another cache storage - database or file system.

If you are an advanced developer you will subconsciously know whether to use cache or not. But this questions may be really helpful if you are a novice.
I am open to hear what you ask yourself before using cache.

I am happy to say that I successfully passed the entry tests for a PhD student. So now I am part of SULSIT(State University of Library Studies and Information Technologies) PhD program. The program`s name is "Automated Systems for Information processing and information management" and my thesis will be "Research of the current methods and technologies for web sites and web application development". The dissertation have to be in Bulgarian but I will try to POST the table of content translated as well with a short resume for each chapter.
As always any ideas and suggestions are welcome.