Suppose we have a ground set of $n$ elements and $m$ sets are defined over them $S_i \subseteq [n]$. Think of the following procedure: At each step take two of the sets, take the union, and add the union back to the pool of all sets. Obviously at each step two sets are taken and one is added. Therefore after exactly $m-1$ operations only one set is left which is the union of all sets $\cup_i S_i$. We want to do this procedure to minimize the total cost. Suppose that if we take the union of two sets $A, B$ the cost will be $|A|+|B|$. This is somewhat similar to matrix chain multiplication, but the order is not predefined.

Now this procedure can be thought of as a tree in the following sense: We start with the leaves of the tree which are the original sets and at each step we merge two, create a parent for these two sets, and repeat the process repeatedly until we end up with the root of the tree. The cost of the tree is drfinef as the sum of sizes of all nodes. The problem is to build it such that the total cost is minimum. Here is an illustration of a tree for four sets $S_1, S_2, S_3, S_4$:

The internal nodes are $I_3=S_2\cup S_3, I_2 = I_3\cup S_4$ and $I_1=S_1\cup I_2$. The total cost is $|S_1|+|S_2|+|S_3|+|S_4|+|I_1|+|I_2|+|I_3|$. The problem asks for a tree with minimum cost. Clearly the $|S_i|$s here are constant and the construction of tree only affects the $|I_i|$s.
Therefore the real question is how to minimize the internal nodes' costs.

The motivation for studying this problem is that it can be thought of as compression scheme for an array of sets. Using this tree we will be able to query things like "does item $i$ belong to $S_j$?". The reason it's a compression is because we can store the elements of each node in a rank/select DS and use the rank/select operations by walking from the root down the leaf corresponding to $S_j$. There are rank/select data structures that are roughly of memory complexity $O(|S_j|)$ and hence the cost that we are minimizing. Here are possible approaches to tackle this problem:

1. Greedy The obvious greedy thing to do is to chose the pair $S_v, S_u$ such that $|S_u \cup S_v|$ is minimized. This can be shown that this is not globally optimal. Can it be proven to be in a certain proximity of the global solution?

2. Reformulating the cost for a balanced tree If we denote left and right child of an internal node as $l(v), r(v)$ respectively, one could right:
$$|S_v| = |S_{l(v)} \cup S_{r(v)}|= |S_{r(v)}| + |S_{l(v)}| - |S_{l(v)} \cap S_{r(v)}|$$
We can apply this rule recursively, staring from the root down to the leaves. If $h_v$ denotes height of a node, i.e. its distance from the root, one can show by induction:
$$cost = \sum_{v\in Leaves} h_v|S_v| -\sum_{v\in Inner} h_v|S_{l(v)} \cap S_{r(v)}|$$

The problem with this cost function is that for the inner nodes in the middle of tree we don't have direct access to $|S_{l(v)} \cap S_{r(v)}|$. One way to is find an upper bound for terms $|S_{l(v)} \cap S_{r(v)}|$ so that we're upper bounding the cost. One candidate is:
$$|S_{r(v)} \cap S_{l(v)})|\le \frac{1}{|L(r(v))|\times |L(l(v))|} \sum_{v\in L(r(v)),u\in L(l(v))} |S_v\cap S_u| $$
in which $L(v)$ is the leaves of the subtree rooted by $v$. This simply means that if you take the average of intersection of the leaves of the two subtrees it's not more than the intersection between the two nodes at the top.

If we enforce that the tree is perfectly balanced (and for simplicity assuming $m = 2^H$) search space boils down to the initial permutation of sets $\sigma:[m]\rightarrow [m]$. Besides in a balanced tree number of children of a node is a function of its height $L(v) = 2^{H-h_v}$. Therefore we have:
$$Cost \le \sum_{v\in T} H |S_v| - \sum_{i\neq j\in [m]} \frac{H - \log_2 |\sigma(i)-\sigma(j)|}{(\sigma(i)-\sigma(j))^2}|S_i\cap S_j|$$
In the first term $h_v$ is replaced by $H$ because all leafs are at a predetermined height $H = \log_2(m)$. So ignoring the constant terms w.r.t $\sigma$ we can maximize negative term:
$$\max_\sigma \sum_{i\neq j\in [m]} \frac{\log_2 |\sigma(i)-\sigma(j)|}{(\sigma(i)-\sigma(j))^2}|S_i\cap S_j| $$
This can again be reformulated as the following problem:
$$\max_\sigma \langle M^\sigma, C\rangle $$
in which $M,C\in \mathbb{R}^{m\times m}$ are defined as $C_{i,j} = |S_i\cap S_j|$ and $M_{i,j} = \frac{\log_2 |i-j|}{(i-j)^2}$ and $M^\sigma$ is simply matrix $M$ permutated column-wise and row-wise according to $\sigma$. Also $\langle \cdot, \cdot \rangle$ denotes matrix Frobenius product If we have an oracle that solves this problem efficiently, is this a good approximation or not?

$\begingroup$@a3nm I haven't tried to prove that, but it very much looks like an NP-hard problem. I'm fine with constant approximations of the problem as well though.$\endgroup$
– AmeerJOct 13 '18 at 12:13

4

$\begingroup$Ah, so it doesn't reduce to Huffman Coding because the underlying sets might not be disjoint.. Perhaps this paper shows this problem to be NP hard and considers approximation algorithms?$\endgroup$
– Neal YoungOct 14 '18 at 2:39

$\begingroup$@NealYoung thank you. The method 2 I suggested is actually searching within the balanced trees. Do you think it could give a tighter bound than $O(\log m)$?$\endgroup$
– AmeerJOct 14 '18 at 10:12

$\begingroup$Any algorithm that forces the leaves to be at depth $\Omega(\log m)$ cannot give an $o(\log m)$-approximation. Consider an example with one set of size $n-\sqrt n$ and $\sqrt n$ sets of size $1$ (so $m=\sqrt n+1$). OPT is $\Theta(n)$, but any tree where all leaves are at depth $\Omega(\log m)$ will have cost at least $\Omega(n\log m)$. See also Lemma 4.2 in the paper I linked to.$\endgroup$
– Neal YoungOct 14 '18 at 12:57