There also the question of what architecture to use (like MVC or the like).
–
IvanSep 15 '11 at 23:38

Without knowing more about what exactly you are going to launch, it is very difficulty to give the answer, but keep in mind "Cloud services", probably your application fits in a kind of SaaS app. (It is centralized).
–
B4NZ41Sep 16 '11 at 0:04

8 Answers
8

Wow, this is a simple question, which a huge array of possible answers. The more explicit part of your question asks whether it is more scalable to interface with your database directly or through a web service. That answer is simple: query the database directly. Going through the web service adds a whole bunch of latency that is completely unnecessary for code operating behind a firewall (by and large). A web service for example requires some component to receive a request, deserialize it, query the DB, serialize a response and return it. So if your code is all operating behind a firewall, save yourself the trouble and just query the DB directly.

Making a web site scalable however goes way beyond the question you initially posed. So forgive me if I go off on a tangent here, but I thought it might be useful considering that you mentioned Facebook in particular.

I would recommend you read up on the work and tools built by Brad Fitzpatrick (founder of LiveJournal and now at Google). When I worked with him at Six Apart, here are some of the things I learned from him, and about LiveJournal's architecture that made it so scalable.

Use narrow database tables as opposed to wide ones. What was fascinating about this was learning what motivated this architecture, which was creating a system that was easily and quickly upgraded. If you use wide tables, or tables for which each field or property is a column in the table, when it comes time to upgrade the database schema, for example adding a new column, then the system will need to lock the table while the schema change is implemented. When operating at scale this would mean a simple change to the database schema could result in a large database outage. Which sucks obviously. A narrow table on the other hand simply stores each individual property associated with an object as a single row in the database. Therefore when you want to add a new column to the database all you need to do is INSERT records into a table, which is a non-locking operation. Ok, that is a little background, let's see how this model actually translates in working system like LiveJournal.

Let's say you want to load the last 10 journal entries on a person's blog, and let's say each journal entry has ten properties. In a classic wide table layout, each property would correlate to a column on a table. A user would then query the table once to fetch all of the data they need. The query would return 10 rows and each row would have all the data they need (e.g. SELECT * FROM entries ORDER BY date LIMIT 10). In a narrow table layout however things are bit different. In this example there are actually two tables: the first table (table A) stores simple criteria one would want to search by, e.g. the id of the entry, the id of the author, the date of the entry, etc. A second table (table B) then stores all of the properties associated with an entry. This second table has three columns: entry_id, key and value. For every row in table A, there would be 10 rows in table B (one row for each property). Therefore in order to fetch and display the last ten entries, you would need 11 queries. The first query gives you the list of entry IDs, and then the next ten queries would fetch the properties associated with each of the entries returned in the first query.

"Holy moly!" you say, "how on Earth can that be more scalable?!" Its totally counter-intuitive right? In the first scenario we just had one database query, but in the second "more scalable" solution we have 11 database queries. That makes no sense. The answer to that question relies entirely upon the next bullet.

Use memcache liberally. In case you were not aware, memcache is a distributed, stateless, low latency, network based caching system. It is used by Facebook, Google, Yahoo, and just about every popular and scalable web site on the planet. It was invented by Brad Fitzpatrick partially to help offset the database overhead inherent in a narrow table database design. Let's take a look at the same example as discussed in #1 above, but this time, let's introduce memcache.

Let's begin when a user first visits a page and nothing is in the cache. You begin by querying table A which returns the IDs of the 10 entries you want to display on the page. For each of those entries you then query the database to retrieve the properties associated with that entry, and then using those properties constitute an object that your code can interface with (e.g. an object). You then stash that object (or a serialized form of that object) in memcache.

The second time someone loads the same page, you begin the same way: by querying table A for the list of entry IDs you will display. For each entry you first go to memcache and say, "do you have entry #X in the cache?" If yes, then memcache returns the entry object to you. If not, then you need to query the database again to fetch its properties, constitute the object and stash it in memcache. Most of the time, the second time someone visits the same page there is only one database query, all other data is then pulled straight from memcache.

In practice, what ended up happening for most of LiveJournal is that most of the system's data, especially the less volatile data, was cached in memcache and the extra queries to the database needed to support the narrow table schema were all but completely offset.

This design made solving the problem associated with assembling a list of posts associated with all of your friends into a stream, or "wall" much, much easier.

Next, consider partitioning your database. The model discussed above surfaces yet another problem, and that is your narrow tables will tend to be very large/long. And the more rows those tables have the harder other administrative tasks become. To offset this, it might make sense to manage the size of your tables by partitioning the tables in someway, so that clusters of users are served by one database, and another cluster of users are served by a separate database. This distributes load on the database and keeps queries efficient.

Finally, you need awesome indexes. The speed of your queries will depend largely upon how well indexed your database's tables are. I won't spend too much time discussing what an index is, except to say that it is a lot like a giant card catalog system to make finding needles in a haystack more efficient. If you use mysql then I recommend turning on the slow query log to monitor for queries that take a long time to fulfill. When a query pops up on your radar (e.g. because it is slow), then figure out what index you need to add to the table to speed it up.

"Thank you for all of this great background, but holy crud, that is a lot of code I will have to write."

Not necessarily. Many libraries have been written that make interfacing with memcache really easy. Still other libraries have codified the entire process described above; Data::ObjectDriver in Perl is just such a library. As for other languages, you will need to do your own research.

I hope you found this answer helpful. What I have found more often than not is that the scalability of a system often comes down less and less to code, and more and more to a sound data storage and management strategy/technical design.

+1 I really love this Wow, this is a simple question, which a huge array of possible answers.
–
Pankaj UpadhyaySep 16 '11 at 6:12

"The more explicit part of your question asks whether it is more scalable to interface with your database directly or through a web service. That answer is simple: query the database directly." Actually, not simple -- keep in mind that this strategy trades other qualities such as maintainability and modifiablity, and may make other design decisions more difficult (or easier). These things may not be important in this case, but in my experience answers are rarely this cut and dry.
–
MichaelSep 16 '11 at 14:07

Michael - very excellent point. I considered re-writing my answer at some point to mention that Twitter is a good example of a service that publishes and API that used by both external AND internal development teams. They live by the mantra "eat your own dog food." I admire them for that. It took them years however before they solved the scalability problem. Once they did though, they had a system that was both scalable, well documented, and had all of the positive qualities inherent in a web service oriented platform.
–
Byrne ReeseSep 16 '11 at 15:51

I completely disagree with 'query the database directly'. You mention partitioning the database for performance when it would be easier to implement a single-master multiple-slave architecture with an API interface. The benefit of decoupling the DB from the application is, the API layer can distribute requests however you want. The API is an abstraction that allows you to change the underlying implementation and/or reuse the data without breaking the application.
–
Evan PlaiceJun 4 '12 at 23:35

(cont) Serialization will always add overhead but only in the API layer which will most likely consist of multiple instances running simultaneously. If you're worried about transfer speeds across the wire, convert to JSON and it will most likely be compressed with gzip anyway. The easiest performance gains can be found when work is pushed from the server to the client. The important question to ask is, would you rather distribute requests within the application or at the server level? Which is easier to duplicate?
–
Evan PlaiceJun 4 '12 at 23:48

Scalablility is not a function of specific implementation strategies but rather of designing your application architecture so that the data access layer can evolve without massive refactoring and rewriting.

An important technique in building a system that scales is to understand your high-level data access requirements and build an interface contract around them. For example, you might have the requirement to get one user or to list the 50 photos posted most recently by any user.

You don't necessarily need a network channel between your application business logic and the data access logic; a method call indirection with one method per logical operation would do just fine to start.

Make these data access methods as simple as possible to start with. It's very hard to predict where the performance problems will lie until your application is serving real usage patterns and you are collecting data about where you have bottlenecks.

By having a well defined data access interface you can evolve your data access implementation without making broad changes to your entire application. You can also decide to switch to a web service architecture transparently to your business logic.

Many of the answers above give some great advice on how to proceed once you've discovered your performance bottlenecks, but if you apply these too early you can become hamstrung by the complexity of your code before you know if that complexity is even required.

Any additional step in connecting to the database, is just an overhead. For example, between UI -> Business Facade -> Business -> Data Access -> Database and UI -> Database, the second approach is faster. However, the more steps you remove, the less maintainable your system becomes and the more duplication appears. Imagine writing the necessary code to retrieve the list of friends in profile, home page, fiends management page, etc.

So, you should make a balance here between higher performance (which of course directly affects higher scalability) and better maintainability.

But, don't get limited to database connection topic when you think about creating highly-scalable web sites. Consider these items too:

Choosing the right platform (PHP is faster because of its scripting nature, but ASP.NET needs to compile the requested file on the fly to process it and serve something. Also node.js is claimed to be more scalable, because of its callback-based architecture)

Using RESTful architecture instead of web service model (SOA)

Using JSON for data transfer instead of XML (which results in less bytes to be transfered)

It's accepted wisdom that web applications should be designed with three tiers by default - web (presentation), application and database layers. This division is due to different requirements at each of the layers - typically quality disk access/storage for the database, high CPU/Memory at the app layer, and high external bandwidth/memory/geographic dispersion at the web layer. The application/database layer are often merged into one until much later in the lifecycle of the application, as database machines often tend to be massive servers which can be built to handle the early application load, too.

The specific number of layers and appropriate architecture for your application, though, doesn't have to match this or any other model.

Plan to need to measure and monitor all of the activity in your system. Start from a two or three tier design, and focus on the portions of it which, as you're building it, look like will require the most amount of resources. Let the running application guide your design, at this level. The more information you gather, and the more accurate and detailed it is, the better decisions you can make about designing the application as it grows.

Choose a framework and architecture which will, later on, allow you to pivot/make required changes as quickly and painlessly as possible. Even if your data-access/storage/processing and application processing are being performed in the same executable, if they're properly factored, it won't be as difficult to split them out into two layers later on, for example.