Friday, February 29, 2008

Building scalable system is becoming a hotter and hotter topic. Mainly because more and more people are using computer these days, both the transaction volume and their performance expectation has grown tremendously.

Scalability is about reducing the adverse impact due to growth on performance, cost, maintainability and many other aspects

e.g. Running every components in one box will have higher performance when the load is small. But it is not scalable because performance drops drastically when the load is increased beyond the machine's capacity

Understand environmental workload conditions that the system is design for

If there is a large number of independent (potentially concurrent) request, then you can use a server farm which is basically a set of identically configured machine, frontend by a load balancer.

The application itself need to be stateless so the request can be dispatched purely based on load conditions and not other factors.

Incoming requests will be dispatched by the load balancer to different machines and hence the workload is spread and shared across the servers in the farm.

The architecture allows horizontal growth so when the workload increases, you can just add more server instances into the farm.

This strategy is even more effective when combining with Cloud computing as adding more VM instances into the farm is just an API call.

Data Partitioning

Spread your data into multiple DB so that data access workload can be distributed across multiple servers

By nature, data is stateful. So there must be a deterministic mechanism to dispatch data request to the server that host the data

Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.

Most distributed key/value store do this

Map / Reduce (Batch Parallel Processing)

The algorithm itself need to be parallelizable. This usually mean the steps of execution should be relatively independent of each other.

Google's Map/Reduce is a good framework for this model. There is also an open source Java framework Hadoop as well.

Content Delivery Network (Static Cache)

This is common for static media content. The idea is to create many copies of contents that are distributed geographically across servers.

User request will be routed to the server replica with close proxmity

Cache Engine (Dynamic Cache)

This is a time vs space tradeoff. Some executions may use the same set of input parameters over and over again. Therefore, instead of redo the same execution for same input parameters, we can remember the previous execution's result.

DBSession and TCP connection are expensive to create, so reuse them across multiple requests

Calculate an approximate result

Instead of calculate an accurate answer, see if you can tradeoff some accuracy for speed.

If real life, usually some degree of inaccuracy is tolerable

Filtering at the source

Try to do more processing upstream (where data get generated) than downstream because it reduce the amount of data being propagated

Asynchronous Processing

You make a call which returns a result. But you don't need to use the result until at a much later stage of your process. Therefore, you don't need to wait immediately after making the call., instead you can proceed to do other things until you reach the point where you need to use the result.

In additional, the waiting thread is idle but consume system resources. For high transaction volume, the number of idle threads is (arrival_rate * processing_time) which can be a very big number if the arrival_rate is high. The system is running under a very ineffective mode

The service call in this example is better handled using an asynchronous processing model. This is typically done in 2 ways: Callback and Polling

In callback mode, the caller need to provide a response handler when making the call. The call itself will return immediately before the actually work is done at the server side. When the work is done later, response will be coming back as a separate thread which will execute the previous registered response handler. Some kind of co-ordination may be required between the calling thread and the callback thread.

In polling mode, the call itself will return a "future" handle immediately. The caller can go off doing other things and later poll the "future" handle to see if the response if ready. In this model, there is no extra thread being created so no extra thread co-ordination is needed.

Implementation design considerations

Use efficient algorithms and data structure. Analyze the time (CPU) and space (memory) complexity for logic that are execute frequently (ie: hot spots). For example, carefully decide if hash table or binary tree should be use for lookup.

Analyze your concurrent access scenarios when multiple threads accessing shared data. Carefully analyze the synchronization scenario and make sure the locking is fine-grain enough. Also watch for any possibility of deadlock situation and how you detect or prevent them. A wrong concurrent access model can have huge impact in your system's scalability. Also consider using Lock-Free data structure (e.g. Java's Concurrent Package have a couple of them)

Analyze the memory usage patterns in your logic. Determine where new objects are created and where they are eligible for garbage collection. Be aware of the creation of a lot of short-lived temporary objects as they will put a high load on the Garbage Collector.

However, never trade off code readability for performance. (e.g. Don't try to bundle too much logic into a single method). Let the VM handle this execution for you.

Sunday, February 3, 2008

Decision Tree is another model-based learning approach where the output is a tree. Here, we are given a set of data with structure [x1, x2 …, y] is presented. (in this case y is the output). The learning algorithm will learn (from the training set) a decision tree and use that to predict the output y for future seen data.

x1, x2 ... can be either numeric or categorical. y1 is categoricalIn the decision tree, the leaf node contains relatively "pure" data represented by a histogram {class_j => count}. The intermediate node is called a "decision node" containing a test against an input attribute (e.g. x2) to a constant value. Two branches are output according to whether the decision is evaluate to be true or false. If x2 is a categorical value, the test will be a equality test (e.g. if "weather" equals "sunny"). If x2 is numeric, the test will be a greater than / less than test (e.g. if "age" >= 40).Building the Decision Tree

We start at the root node which contains all training samples. We need to figure out what should be our first test. Our strategy is to pick the test such that it divides the training samples into two groups which has the highest sum of "purity"

A set of data records is "pure" if all their outcome is gravitated towards a particular value, otherwise it is impure. Purity can be measurable by Entropy or Gini Impurity function.

Gini is measured by calculating the probability of picking two records from a set such that their outcome is different.

Entropy is measured by calculate the following ...sum_over_j(P(class_j) * log (P(class_j)))

Note that the term P * logP is close to zero in either case when P is close to zero or when P is close to one. Entropy is large when P is about 0.5. The higher the entropy, the lower the purity.

Keep doing the following until the overall purity is not improved further

Since the training data set is not exactly same as the actual population, we may biased towards some characteristics of the training data set which doesn't exist in actual population. Such biases is called over fitting and we want to avoid it.

In above algorithm, we create an intermediate node when the purity_gain is significant. We can also do "post pruning" such that we build the tree first and then working backward trying to merge the leaf nodes if the decrease of purity is small.

We also can set aside 1/4 of training data as validation data. So we use the 3/4 of training data to build the tree first and then use the 1/4 validation data to prune the tree. This approach will reduce the training data size and so only practical when the training data set is large.

Classification and missing data handlingWhen a new query point arise, we start traversing the decision tree from its root by answering the question at each decision node until we reach the leaf node, from there we pick the class with the highest number of instance count.However, what if the value of some attributes (say x3, x5) are missing. When we reach a decision node where x3 need to be tested, we cannot proceed.

One simple solution is to fill-in the most likely value of x3 based on the probability distribution of x3 within the training set. We can also throw a dice based on the probability distribution. Or we can use a tree-way branch by adding "missing data" as one of the condition.

Another more sophisticated solution is to walk down both path if the attributed under test is missed. We know the probability distribution at the junction point, which used to calculate the weight on each branch. The final result will be aggregated according to the weight.