Introduction

Mo's Algorithm has become pretty popular in the past few years and is now considered as a pretty standard technique in the world of Competitive Programming. This blog will describe a method to generalize Mo's algorithm to maintain information about paths between nodes in a tree.

Prerequisites

Mo's Algorithm — If you do not know this yet, read this amazing article before continuing with this blog.

Preorder Traversal or DFS Order of the Tree.

Problem 1 — Handling Subtree Queries

Consider the following problem. You will be given a rooted Tree T of N nodes where each node is associated with a value A[node]. You need to handle Q queries, each comprising one integer u. In each query you must report the number of distinct values in the subtree rooted at u. In other words, if you store all the values in the subtree rooted at u in a set, what would be the size of this set?

Constraints

1 ≤ N, Q ≤ 105

1 ≤ A[node] ≤ 109

Solution(s)

Seems pretty simple, doesn't it? One easy way to solve this is to flatten the tree into an array by doing a Preorder traversal and then implement Mo's Algorithm. Maintain a lookup table which maintains the frequency of each value in the current window. By maintaining this, the answer can be updated easily. The complexity of this algorithm would be

Note that you can also solve this in by maintaining a set in each node and merging the smaller set into the larger ones.

Problem 2 — Handling Path Queries

Now let's modify Problem 1 a little. Instead of computing the number of distinct values in a subtree, compute the number of distinct values in the unique path from u to v. I recommend you to pause here and try solving the problem for a while. The constraints of this problem are the same as Problem 1.

The Issue

An important reason why Problem (1) worked beautifully was because the dfs-order traversal made it possible to represent any subtree as a contiguous range in an array. Thus the problem was reduced to "finding number of distinct values in a subarray [L, R] of A[]. Note that it is not possible to do so for path queries, as nodes which are O(N) distance apart in the tree might be O(1) distance apart in the flattened tree (represented by Array A[]). So doing a normal dfs-order would not work out.

Observation(s)

Let a node u have k children. Let us number them as v1,v2...vk. Let S(u) denote the subtree rooted at u.

Let us assume that dfs() will visit u's children in the order v1,v2...vk. Let x be any node in S(vi) and y be any node in S(vj) and let i < j. Notice that dfs(y) will be called only after dfs(x) has been completed and S(x) has been explored. Thus, before we call dfs(y), we would have entered and exited S(x). We will exploit this seemingly obvious property of dfs() to modify our existing algorithm and try to represent each query as a contiguous range in a flattened array.

Modified DFS-Order

Let us modify the dfs order as follows. For each node u, maintain the Start and End time of S(u). Let's call them ST(u) and EN(u). The only change you need to make is that you must increment the global timekeeping variable even when you finish traversing some subtree (EN(u) = ++cur). In short, we will maintain 2 values for each node u. One will denote the time when you entered S(u) and the other would denote the time when you exited S(u). Consider the tree in the picture. Given below are the ST() and EN() values of the nodes.

ST(1) = 1EN(1) = 18

ST(2) = 2EN(2) = 11

ST(3) = 3EN(3) = 6

ST(4) = 4EN(4) = 5

ST(5) = 7EN(5) = 10

ST(6) = 8EN(6) = 9

ST(7) = 12EN(7) = 17

ST(8) = 13EN(8) = 14

ST(9) = 15EN(9) = 16

A[] = {1, 2, 3, 4, 4, 3, 5, 6, 6, 5, 2, 7, 8, 8, 9, 9, 7, 1}

The Algorithm

Now that we're equipped with the necessary weapons, let's understand how to process the queries.

Let a query be (u, v). We will try to map each query to a range in the flattened array. Let ST(u) ≤ ST(v) where ST(u) denotes visit time of node u in T. Let P = LCA(u, v) denote the lowest common ancestor of nodes u and v. There are 2 possible cases:

Case1: P = u

In this case, our query range would be [ST(u), ST(v)]. Why will this work?

Consider any node x that does not lie in the (u, v) path. Notice that x occurs twice or zero times in our specified query range. Therefore, the nodes which occur exactly once in this range are precisely those that are on the (u, v) path! (Try to convince yourself of why this is true : It's all because of dfs() properties.)

This forms the crux of our algorithm. While implementing Mo's, our add/remove function needs to check the number of times a particular node appears in a range. If it occurs twice (or zero times), then we don't take it's value into account! This can be easily implemented while moving the left and right pointers.

Case2: P ≠ u

In this case, our query range would be [EN(u), ST(v)] + [ST(P), ST(P)].

The same logic as Case 1 applies here as well. The only difference is that we need to consider the value of P i.e the LCA separately, as it would not be counted in the query range.

If you aren't sure about some elements of this algorithm, take a look at this neat code.

Conclusion

We have effectively managed to reduce problem (2) to number of distinct values in a subarray by doing some careful bookkeeping. Now we can solve the problem in This modified DFS order works brilliantly to handle any type path queries and works well with Mo's algo. We can use a similar approach to solve many types of path query problems.

For example, consider the question of finding number of inversions in a (u, v) path in a Tree T, where each node has a value associated with it. This can now be solved in by using the above technique and maintaining a BIT or Segment Tree.

This is my first blog and I apologize for any mistakes that I may have made. I would like to thank sidhant for helping me understand this technique.

but in each query, there is a new k. I wrote code for this problem, after whole implementation, i noticed that i missed the point that there is alway a new k for each query. Now i am not getting how can i solve this prob!

You can create two maps,One that is updated on adding/removing the elements and it is keeping the counts on the Go. The other map is the map for MAP of K's like if element at position 2 was removed then MAP[count of 2]-- will be done that means there were x number of 2 but now there are x-1 of them so the occurrence of number that occurred x times (the contribution of 2 in x times is now reduced ) will be reduced by 1. Then in each query you will have to store the corresponding k's value as answer for that query.

Recently I solved one question using Mo's algorithm, and I remembered about this comment here. I overwrote the solution on the same link. Here is the solution for COT2. I think its self-explanatory how it is working.

Maintain a set of values for each node in the tree. Let set(u) be the set of all values in the subtree rooted at u. We want size(set(u)) for all u.

Let a node u have k children, v1, v2...vk. Every time you want to merge set(u) with set(vi), pop out the elements from the smaller set and insert them into the larger one. You can think of it like implementing union find, based on size.

Consider any arbitrary node value. Every time you remove it from a certain set and insert it into some other, the size of the merged set is atleast twice the size of the original.

Say you merge sets x and y. Assume size(x) ≤ size(y). Therefore, by the algorithm, you will push all the elements of x into y. Let xy be the merged set. size(xy) = size(x) + size(y). But size(y) ≥ size(x).

So size(xy) ≥ 2 * size(x).

Thus, each value will not move more than logn times. Since each move is done in O(logn), the total complexity for n values amounts to O(nlog2n)

EDIT I think I understood. For a particular value to be included the maximum number of times in a move operation from set(x) to set(xy) where size(x) <= size(y), this value must be moved for each of it's ancestor upto root. That is only possible if the height of the tree is at most log n.

But the size(xy) >= 2 * size(x) seems incorrect. I think you meant that the size of subtree of parent of x >= 2 * size(x).

For the "Frank Sinatra" problem. How could you find the less value not present in the path?

I realize that any value greather than the size of the tree wouldn't change the answer. So, if i have at most 1E5 different values I can build a BIT. pos[i] = 1 if value i is present in the path. Then I binary search the less value k wich sum[0...k] is less than k. That would be my answer. However the complexity is O(N*sqrt(N)*log(N)*log(N)) and I think is excesive.

You can solve the problem in by doing square root decomposition on the values. Each update would be done in constant time and you will take additional time per query to find the block which has the smallest value.

BTW, there is a standard solution for the first problem (see this link in Russian). For each of the colors order all the vertices of this color according to the dfs traversal, let the vertices be labelled v1, v2, ..., vk. Add +1 to each of these verticies, and add -1 to the LCAs of the neighboring vertices lca(v1, v2), lca(v2, v3), ..., lca(vk - 1, vk). If you sum up the values inside a subtree, you get the number of distinct elements in it.

Since the ordering can be done in O(n), and in theory you can answer lca queries for a static tree in O(1) with O(n) pre-processing, you have a linear solution (assuming 0 ≤ A[x] < N).

I understand the merging sets optimization!Can you explain how we can utilize this in Problem 1 (unique elements in subtree) to achieve ?As far as I see, lets say then we want, to change which can be done optimally when ,how do you propose to do it for ?

If the size of each block is k, then the time complexity of moving the left pointer is O(Q*k) and the time complexity of moving the right pointer is O(N/k*N). The optimal value of k is N/sqrt(Q) which results in total time complexity O(N*sqrt(Q)).

If I flatten the above tree, my array would be:8 3 1 6 4 7 10 14 13Suppose I need to use Mo's algorithm for subtrees(assume I need to find sum of values of each subtree indicated by the query)For a given query 'Vj' how would I find its end range index in the array?Eg if given query is node '6', the starting range would be idx 3 and ending would be idx 5.

For a problem like this: http://lightoj.com/volume_showproblem.php?problem=1348 where I need to return sum of all the nodes in a given path & update the value of a node, how should I approach using this technique of linearizing the tree? I mean since I need to ignore nodes which have occurrence of 2 so the range becomes discontinuous for a segment tree structure.

Consider on this case, if we select 3 and 8 on the tree given to explain the DFS-Order, the range[EN(u), ST(v)] contains the whole subtree S(5) which is not on our query path. Are we supposed to judge every node in the range or Am I missing something? Thx!

Thanks a lot, but I don't really understand the meaning of If it occurs twice (or zero times), then we don't take it's value into account! This can be easily implemented while moving the left and right pointers. and I wrote the code as what you have said in this essay,but it seems wrong and I don't know where I count the answer incorrectlymy code

"One easy way to solve this is to flatten the tree into an array by doing a Preorder traversal" Isn't Preorder traversal done on a binary tree? The given tree may not be a binary tree. Where am I going wrong?

if occurence of element becomes 1 from 2 then consider that element as ADDED

if occurences of element becomes 0 from 1 then consider that element REMOVED if occurences of element becomes 2 from 1 then consider that element REMOVED if occurences of element becomes 1 form 0 then consider that element ADDED

To sort queries we need to know how to compare them, it can be done by overloading operator < or making a bool function (comparator) which takes two queries and return true if first must be to the left of the second in sorted array.

It is sorting the l values according to the BLOCK value of l i.e. BL[l]. If both the values are same then it compares the one which has smaller r value just like in Standard Mo's algorithm. Am I interpreting it right?

i think DFS-order will be 1-2-3-3-4-4-2-5-6-6-7-7-5-8-9-9-10-10-11-11-8-1, EN[4]=6, ST[11]=19, so the nodes which occur exactly once in this range are 4,2,8,11; then you have to consider also ST[1]=1 .In the end you have 4,2,8,11,1