Monday, May 30, 2016

Percolation - Princeton Algorithm 4th

http://coursera.cs.princeton.edu/algs4/assignments/percolation.htmlThe model. We model a percolation system using an N-by-N grid of sites. Each site is either open or blocked. A full site is an open site that can be connected to an open site in the top row via a chain of neighboring (left, right, up, down) open sites. We say the system percolates if there is a full site in the bottom row. In other words, a system percolates if we fill all open sites connected to the top row and that process fills some open site on the bottom row. (For the insulating/metallic materials example, the open sites correspond to metallic materials, so that a system that percolates has a metallic path from top to bottom, with full sites conducting. For the porous substance example, the open sites correspond to empty space through which water might flow, so that a system that percolates lets water fill open sites, flowing from top to bottom.)

The problem. In a famous scientific problem, researchers are interested in the following question: if sites are independently set to be open with probability p (and therefore blocked with probability 1 − p), what is the probability that the system percolates? When p equals 0, the system does not percolate; when p equals 1, the system percolates. The plots below show the site vacancy probability p versus the percolation probability for 20-by-20 random grid (left) and 100-by-100 random grid (right).

When N is sufficiently large, there is a threshold value p* such that when p < p* a random N-by-N grid almost never percolates, and when p > p*, a random N-by-N grid almost always percolates. No mathematical solution for determining the percolation threshold p* has yet been derived. Your task is to write a computer program to estimate p*.Percolation data type. To model a percolation system, create a data type Percolation with the following API:

Corner cases. By convention, the row and column indices i and j are integers between 1 and N, where (1, 1) is the upper-left site: Throw a java.lang.IndexOutOfBoundsException if any argument to open(), isOpen(), or isFull() is outside its prescribed range. The constructor should throw a java.lang.IllegalArgumentException if N ≤ 0.Performance requirements. The constructor should take time proportional to N2; all methods should take constant time plus a constant number of calls to the union-find methods union(), find(), connected(), andcount().Monte Carlo simulation. To estimate the percolation threshold, consider the following computational experiment:

Initialize all sites to be blocked.

Repeat the following until the system percolates:

Choose a site (row i, column j) uniformly at random among all blocked sites.

Open the site (row i, column j).

The fraction of sites that are opened when the system percolates provides an estimate of the percolation threshold.

For example, if sites are opened in a 20-by-20 lattice according to the snapshots below, then our estimate of the percolation threshold is 204/400 = 0.51 because the system percolates when the 204th site is opened.

By repeating this computation experiment T times and averaging the results, we obtain a more accurate estimate of the percolation threshold. Let xt be the fraction of open sites in computational experiment t. The sample mean μ provides an estimate of the percolation threshold; the sample standard deviation σ measures the sharpness of the threshold.

Assuming T is sufficiently large (say, at least 30), the following provides a 95% confidence interval for the percolation threshold:

To perform a series of computational experiments, create a data type PercolationStats with the following API.

I thought that the solution I coded last weekend was fine because it was correctly computing the percolation thresholds for many different grid sizes. However, when I tested it today using the provided PercolationVisualizer class, I realized that it was suffering from the "backwash" problem.It took me a couple of hours but, in the end, I managed to fix it by using two WeightedQuickUnionUF objects instead of one.

1) The easiest way to tackle this problem is to use two different Weighted Quick Union Union Find objects. The only difference between them is that one has only top virtual site (let’s call it uf_A), the other is the normal object with two virtual sites top one and the bottom one (let’s call it uf_B) suggested in the course, uf_A has no backwash problem because we “block” the bottom virtual site by removing it. uf_B is for the purpose to efficiently determine if the system percolates or not as described in the course. So every time we open a site (i, j) by calling Open(i, j), within this method, we need to do union() twice: uf_A.union() and uf_B.union(). Obviously the bad point of the method is that: semantically we are saving twice the same information which doesn’t seem like a good pattern indeed. The good aspect might be it is the most straightforward and natural approach for people to think of.

A final remark: one key observation here is that to test a site is full or not, we only need to know the status of the root of that connected component containing the site, whether a site is full is the same question interpreted by asking whether the connected component is full or not: this make things simple and saves time, someone in the forum proposed to linear scan the whole connected component to update the status of each site which is not necessary and inefficient. each time we only update the status of the root site rather than each site in that component.

2) And there turned out to be more elegant solutions without using 2 WQUUF if we can modify the API or we just wrote our own UF algorithm from scratch. The solution is from “Bobs Notes 1: Union-Find and Percolation (Version 2)“: Store two bits with each component root. One of these bits is true if the component contains a cell in the top row. The other bit is true if the component contains a cell in the bottom row. Updating the bits after a union takes constant time. Percolation occurs when the component containing a newly opened cell has both bits true, after the unions that result from the cell becoming open. Excellent! However, this one involves the modification of the original API. Based on this and some other discussion from other threads in the discussion forum, I have come up with the following approach which need not to modify the given API but adopt similar idea by associating each connected component root with the information of connection to top and/or bottom sites.

(3) Here we go for the approach based on Bob’s notes while involving no modification of the original given API:

In the original Bob’s notes, it says we have to modify the API to achieve this, but actually it does not have to be like that. We create ONE WQUUF object of size N * N, and allocate a separate array of size N * N to keep the status of each site: blocked, open, connect to top, connect to bottom. I use bit operation for the status so for each site, it could have combined status like Open and connect to top.

The most important operation is open(int i, int j): we need to union the newly opened site (let’s call it site ‘S’) S with the four adjacent neighbor sites if possible. For each possible neighbor site(Let’s call it ‘neighbor’), we first call find(neighbor) to get the root of that connected component, and retrieves the status of that root (Let’s call it ‘status’), next, we do Union(S, neighbor); we do the similar operation for at most 4 times, and we do a 5th find(S) to get the root of the newly (copyright @sigmainfy) generated connected component results from opening the site S, finally we update the status of the new root by combining the old status information into the new root in constant time. I leave the details of how to combine the information to update the the status of the new root to the readers, which would not be hard to think of.

For the isFull(int i, int j), we need to find the the root site in the connected component which contains site (i, j) and check the status of the root.

For the isOpen(int i, int j) we directly return the status.

For percolates(), there is a way to make it constant time even though we do not have virtual top or bottom sites: think about why?

So the most important operation open(int i, int j) will involve 4 union() and 5 find() API calls.