The Junta Problem

Avrim Blum is one of the top experts in computational learning theory, and has also made important contributions to many other aspects of theory. These include approximation algorithms, computational game and economic theory, and complexity theory–to name a few. However, you may not know that he has done extremely influential work on an important AI problem: more about this in a moment.

Today I will talk about the Junta Problem. The problem was known for years before it was given this flashy name, but to me it will always be “Arvim’s Problem.” I am not sure if he created it: I asked him and he says the Lisa Hellerstein was also there at the beginning. In any event it is a terrible problem.

The problem is terrible because it is so natural, so simple to state, and so hard to solve. When I heard about it for the first time I could not believe that it was open. But open it is, very open. We know something about the problem, yet a full solution remains elusive.

As I promised before I start discussing the Junta problem, I want to say something about his AI work. Years ago, he and Merrick Furst, set out to try and understand what were the key problem(s) in AI. For the rest of this I will call them the “team.”

At the time they set out to discover if theory could help AI, they were both at Carnegie-Mellon, which is one of the top places for doing AI research in the world. Then and still now. They told me that they started to have weekly meetings with the AI folks, trying to discover if theory could help solve any AI problems. As you might imagine, at first this was like a U.N. meeting without the benefit of translators. Initially, it might have been better if the AI researchers spoke German, and the team only understood French. The language and world view of AI and theory are very different: the notions, the methods, the tools, the names, and more are completely different.

However, after many meetings and lots of hard work, the team finally got it. The figured out what the “planning problem” was, and found a viable approach. The planning problem is a central problem in AI: it covers everything from moving robot arms, to driving autonomous vehicles, to using digital agents. Formally, think of a set of objects and a set of operations. The problem is to get the objects from some initial state to some final state. Not surprising, the planning problem is not even NP-complete, but is PSPACE-complete. Good luck.

However, their colleagues had problems that they needed to solve, so the team set out to make their approach work on “practical problems.” They succeeded wildly: they created a novel approach to planning that is now the core of all planning methods in AI. Their work is one of the most cited paper in AI, and has helped make planning possible for problems many orders of magnitude larger than before.

Their new idea was to introduce what became known as a planning graph. This was a normal directed graph that captures the planning problem’s essential structure, and reduced the planning problem to a kind of “s-t” connectivity problem. Of course this could not be exactly right since the problem is so hard, but the graph yields new insights that help prune the exponential searches tremendously. Go team.

The Junta Problem

The Junta problem is to learn an unknown boolean function

that only depends on of the variables from uniform random samples of the form . Thus, for some and some boolean function ,

The variables are called the Junta, and the hidden function. Note, that only uniform random sampling is allowed; no queries are made. This is why the Junta problem is hard, and also why it is interesting.

The problem can be reduced to learning just the Junta, since if the Junta is known, then recovering the function takes time polynomial in and . All such statements can be made more precise, since they are really statements “with high probability”. We will not be so careful; please be sure to check the papers for the exact statements.

The reason this is an important problem is that it represents, perhaps, the simplest case of learning in the presence of a vast amount of extraneous information. An example of this could be trying to discover biological pathways. It may be the case that very few pieces of information are predictors of which proteins are actually in the pathway, but we are overwhelmed by a vast amount of information.

The Junta problem is an attempt to make a clean theory problem that, at least in spirit, captures this type of situation. The fact that the values of are selected from a uniform distribution makes the problem an idealization of more practical problems. But as I stated earlier it is still very difficult.

The obvious bound is order . This is the brute force algorithm that tries all possible subsets of size . Once you guess a subset checking that it is correct and finding the actual function is easy–with high probability. The primary challenge is to do better than this bound.

A Key Simple Case

There is an important special case of the Junta problem that can be solved in polynomial time. Suppose that the hidden function is a linear boolean function. In this case there is an algorithm that solves the problem in polynomial time–with high probability. The key observation is for each input and output value , form the equation,

In this case, is just the XOR function of a -sized subset of ‘s. The subset is determined the value of ‘s. Note, viewed as an equation in the ‘s this is a linear equation. After sufficiently many random samples the collection of all such equations will have full rank. Essentially, we will get equations like this:

Solve the resulting equations for the ‘s, which will yield the location of the Junta: if and only if is in the Junta. Then, it is easy to find the actual function, as we pointed out earlier.

Mossel, O’Donnell, and Servedio

Elchanan Mossel, Ryan O’Donnell, and Rocco Servedio (MOS) made a breakthrough on the Junta problem. The obvious bound, as we already said, is , they were the first to break through this. There were some earlier results that were of the form , but theirs was a true breakthrough.

MOS proved a wonderful result. They also named the problem the Junta problem in their paper. It’s rare to see a paper that both gets a great result, and re-names a problem in such a memorable way. Their result is the following.

Theorem: The Junta problem can be solved in time where .

In the above result, is the Strassen constant.

Their result is based on a general idea from problem solving. When faced with a problem that you cannot solve with one strategy, find two strategies. Suppose one is trying to prove that all have some property. You then must prove three things:

If is in the first case, use strategy I;

If is in the second case, use strategy II;

Finally, all fall into one of these two cases.

This is an old idea, that has been used many times before. One of my personal favorite examples, is from number theory. John Littlewood once proved a great theorem in number theory by dividing the world into two cases: (i) the Riemann Hypothesis is true; (ii) the Riemann Hypothesis is false. In the first case primes are “well-behaved” so he could solve his problem. In the second case, they are not “well-behaved”, but he had such a strong hypothesis that he could still solve the problem.

The main idea of MOS’s proof is this. Consider the hidden boolean function . They have two strategies for finding the Junta.

In the first case, suppose that the degree of as a boolean function is “small”. They generalize the linear case and collect many samples, and then solve the resulting linear system, on variables. Of course this can be done in where is the number of equations.

In the second case, suppose that the boolean function has a “small” non-zero Fourier coefficient. Then, they search all small possible Fourier coefficients, and find the Junta.

The notions of small are carefully tuned to make all this work–see their paper for details. The key insight is that one of these two cases must hold, no matter what the hidden boolean function is. They show that it is impossible for a boolean function to have both high degree and also no small non-zero Fourier coefficient. This is essentially an inequality bounding the degree and the size of the smallest Fourier coefficients.

Lisa Hellerstein told me, recently at STOC, that after reading the MOS paper, years ago, she and her colleagues tried to do some experiments with machine learning algorithms on Junta problems. She said their methods failed badly on certain types of functions. In an attempt to understand why, she also tried to get references on the class of functions that seemed to be the worst case.

A chance conversation with Eric Bach one day, led Lisa to find out that the same inequality proved in the MOS paper was already known. It is called Siegenthaler’s inequality, and was proved in the part of the security community that works on crypto-boxes, back in 1984. Further, there is a rich collection of ideas and results known about this lemma, and which functions are the worse cases for the inequality. See the book by Thomas Cusick and Pantelimon Stănică for a modern treatment of this inequality and related ideas.

I find this connection interesting for two reasons. First, it is a tribute to MOS that they were able to prove the lemma from scratch. Second, it also says something about research: Seigenthaler’s inequality was proved in the security community for the purpose of understanding crypto-boxes. Yet many of us–all?–in the theory community were unaware of this neat result. We owe thanks to Lisa for discovering this connection. Sometime research requires a bit of luck, or is it persistence?

Symmetric Juntas

One special case that I think is interesting is when the hidden boolean function is symmetric. In this case Mihail Kolountzakis, Evangelos Markakis, and Aranyak Mehta proved a beautiful result on the Junta problem. They show that,

Theorem: In the symmetric case the Junta problem can be solved in

I will not have time to explain the details of the proof, but it is based on Fourier methods only–there is no linear solve strategy needed. They show,

Lemma: Every symmetric boolean function of variables, has a non-zero Fourier coefficient of size at most

They conjecture that this bound is far from optimal. It is open, if there is always a Fourier coefficient of size at most for symmetric boolean functions.

By the way, I helped start this line of research with Markakis, Mehta, and Nisheeth Vishnoi, we eventually got bounds of on the Fourier coefficients for small . Then, with important insights from Kolountzakis–an expert on Fourier methods–Kolountzakis, Markakis, and Mehta could prove . Finally, we all put several papers together to make the final contribution. The complete story of who did what when is a bit complex, and probably is best left unsaid. Research can be messy.

An Idea for Improvement of MOS

I have an idea that I want to share with you on how compressive sensing technology could play a role in improving the result of MOS. Recall MOS have to solve a linear system of equations as one of their two strategies:

The matrix comes from the values of the random that is selected and the vector are the values of the function . Note, the key point is thatthe vector is sparse. If is by where is at most , then the vector has at most non-zero components. Here is the degree of the boolean function. I will not explain what compressive sensing is today: see this for lots of information and tutorials. The key to compressive sensing technology is that it has methods for solving linear systems,

over the reals, that have sparse solutions. The technology works when the matrix is “random.” Even better the matrix need not be full rank for the technology to work.

The matrix that arises in the MOS approach to the Junta problem is not a random matrix. However, it does have a large amount of randomness. Is there enough randomness in the matrix to make compressive sensing work? The point can be made clearer if we consider the case where . Then, for each , we get an equation of the form:

where is the random input and is the output. There are order terms, but only random bits. As additional rows are added, with additional samples, the matrix will be by . Is this matrix able to make the compressive sensing method work, since it only has order random bits?

MOS solve their linear system by using, of course, fast matrix methods. They solve this part using time. My thought is that if we could use the fact that is sparse compared to the size of the matrix, could we do much better? Note, if we could do closer to time for the linear solve, then we would improve the exponent of the MOS algorithm.

The problem with this approach is that I do not see exactly how to use the compressive sensing technology here. That basic technology is for real numbers and we seem to need to do all our calculations over a finite field. But, perhaps there is some hope. More in a future post on this approach.

Open Problems

Solve the Junta problem, please. Or at least improve the current exponent of MOS. A more tractable problem may be try to solving the Junta problem over other distributions. Or try to solve the problem for other special classes of boolean functions. If you go this route, be aware that I have not given an exhaustive summary of all that is known about the Junta problem: for example, monotone functions are also easy. Or try something else.

Another open problem is to try using compressive sensing. This will require an extension, I believe, to the existing technology. It may not work, but it is worth the effort, since the payoff is large.

Finally, having worked hard on the symmetric case, I would like to see this case solved. The proof of our theorem that symmetric boolean functions have small non-zero Fourier coefficients is not easy, yet we believe that it is far from optimal. Perhaps there is a completely new approach that will solve the Junta problem for symmetric functions?

First of all, I write a small blog on compressive sensing ( http://nuit-blanche.blogspot.com/search/label/CS ) that is mostly geared toward making sense of the different results springing left and right in this new field. In particular, it reaches out to the mathematicians all the way to the applied engineers and therefore I put an emphasis on making sure that everybody speaks French :-)

With regards to compressive sensing, an early idea was the need for the measurement matrix to satisfy the Restrcited Isometry Property (RIP or UUP). One of the idea needed by the engineering community is understanding how one can map the restricted isometry property to the measurement matrix (your A) that is given by their physics. In effect, the concept of RIP is not useful to them unless in specific cases when there is an obvious mapping. RIP has several issues including the fact that it is NP-hard to prove for a given matrix and that it is a sufficient condition. Numerical tests have shown however that it was sometimes far from being a necessary condition. All kinds of other properties have sprung up since then (Statistical RIP, GRIP, Null space property….) that sometimes allowed one to consider new families of measurement matrices that were not covered by RIP. Of potential interest to your post is the recent arrival of two techniques devised to check a sufficient condition for the null space property :

Learning Juntas is one of my favorite problems. I’d like to make a few notes and contribute a few problems:

1. The reason that I find learning Juntas to be an important problem is that it is provably easier, but not known to be strictly easier, than learning poly-size DNF, or poly-size decision trees, etc. . So it’s a simple barrier standing between us and much-coveted algorithms: a situation analogous to that of Unique Label Cover (the problem from UGC), if you will. Juntas is an incredibly simple but hard problem that can teach us a lot.

2. Another problem similar to Juntas is the problem of Learning Parities with Noise (abbreviated PwN; also see the Blum-Kalai-Wasserman paper, the Feldman-Gopalan-Khot-Ponnuswami paper, and my paper with Kalai and Mansour).The best algorithm for PwN runs in time 2^Omega(n/logn), even though the non-noisy problem can be solved in matrix-multiplication time (as you outline in the post). It’s striking to see that even 0.001% noise kills your ability to learn parities efficiently (for asymptotically-large n), and that an integer program easily solves PwN, so PwN is readily seen to be a problem about computational complexity. I believe that the hurdles to finding a better algorithm to PwN are similar to the hurdles to finding a better Juntas learner. Essentially both problems are about finding a non-zero Fourier coefficient. in a function with a sparse Fourier spectrum (n^k coefficients in the case of Juntas, just 1 coefficient in the case of PwN, if you define the Fourier spectrum of a probabilistic function appropriately)

3. On that spirit, I have a crazy conjecture that I’ve been offering money on for a few years now: prove that if you can learn Parity with Noise in polynomial time, then you can factor integers in polynomial time. The idea is to look at Shor’s algorithm for factoring: the only place where it uses quantumness is in performing the Quantum Fourier Transform (QFT) over Z_N. The QFT is only used in order to find the heavy Fourier Coefficients. However, an algorithm for learning parities will, in particular, give an algorithm for finding all non-negligible (>=1/poly) Fourier coefficients of a boolean function, as beautifully shown in the paper of Feldman et al. Using such.algorithm for PwN, one would presumably be able to simulate the QFT as used by Shor’s algorithm, thus giving a poly-time algorithm that factors integers. Thinking about it a little more, one sees that PwN gives you functionality over Z_2^n, while Shor’s algorithm requires functionality over Z_N=Z_{2^n}, so that’s a problem, but maybe surmountable. Another difficulty is that I could try to convince you in the same way that a poly-time algorithm for PwN would give a classical version of Simon’s algorithm, which is obviously false because no classical algorithm exists for Simon’s problem, by information-theoretical considerations. Thus, t appears that you really need to use some knowledge about the factoring problem itself. Nonetheless, I wouldn’t be surprised if there is such a result. Dinner on me or 128$ to the solver. I’m even willing to fix it to Rinminbi by today’s rate, if you want.

4. Another nice attempt for an algorithm for learning Juntas is to write an integer program, relax it to a linear program, and try to use the solution in a useful way. I’ll say in advance that I don’t think that what I’m going to suggest will work, but here goes: The LP has n variables: x_i=1 would indicate that the i-th variable is in the junta (i.e. is influential), otherwise x_i=0. To get the equations of the LP, pick samples from the example oracle. Wait to get both a 1-labeled example (x,1) and 0-labeled example (y,0). Look at the coordinates where x and y differ. At least one of them are in the junta, so write an inequality to this effect. Keep picking pairs like this. You need only roughly logn*k2^k pairs like this, but you can pick as many as poly(n,2^k) if you want. The objective function is to minimize the sum of the x_i’s. It’s not hard to prove that if you view this as an IP, it has a unique solution which is indeed the Junta, assuming you picked at least logn*k2^k pairs pairs or so. I don’t know about the LP relaxation of this. My best hunch for rounding is to do iterated rounding: put variable i in your solution with probability proportional to x_i, repeat by picking new samples, writing a new LP (this time choosing only samples with a i-coodinate equal to a fixed bit b), etc. Repeat k times, get a solution, check if it works. If it doesn’t work, chug it and try again. I don’t believe this works, and I’m wondering whether the simple Junta which is a parity of the k relevant variables will kill this. In any case, I wonder if this gives a faster algorithm than MOS’s algorithm. I believe not.

5. The algorithm I suggested above points out to a connection between set cover and learning Juntas, since the IP you get is a set-cover IP. If you could solve set cover exactly and efficiently, then you would be able to learn Juntas efficiently. I think that even an approximation ratio of o(logn) would learn Juntas efficiently. I vaguely remember seeing this connection between Juntas and Set Cover in the literature, but I don’t remember where. Another interesting connection is to the k-clique problem, i.e. finding a clique of size k in a graph. This problem is believed to take n^{Omega(k)}. It’s W[1]-hard. Personally, I believe that learning Juntas is very similar to k-Clique, and should also take n^{Omega(k)} time. I was never able to find a reduction in either direction. (The more likely direction: if k-CLIQUE taken n^{Omega(k)} time, then so does learning Juntas).

6. This is getting long, but here’s another question. As you mention, Monotone Juntas can be learned efficiently. Also, if you can perform membership queries (MQ) , then you can also learn Juntas efficiently. The same effect occurs for Decision Trees: MQ give you learning decision trees in polynomial time (this is a special case of Jackson’s seminal poly-time algorithm for learning DNF), and also, if you only have a random-sample oracle but you know that the decision tree is monotone, then you can learn it in polynomial time (this is a beautiful result of O’Donnell and Servedio). The same effect does not hold for DNF as far as we know: DNF can be efficiently learned using MQ (again, Jackson) but we don’t know how to learn monotone DNF efficiently (the best is in a paper by Servedio, but somewhat superpolynomial; This problem of learning monotone DNF is very interesting in itself. No one has a guess whether it’s possible). Now, here is my question: Prove that any function that is learnable in the monotone setting, is also learnable in the MQ setting. This is of course not true in general: I think that for juntas of log^2(n) variables, if you know that they’re monotone, you can learn them in poly(n) time (using a result of Bshouty, and you also need to compute the influences of variables), but having MQ does not help you, and you’re stuck with n^{log(n)} time because of information-theoretic reasons. But I think that the principle should work in general: monotone learning should morally imply MQ learning. Any kind of evidence will be appreciated.

7. And just because I can’t stop myself, another question on the same topic: Algorithms for learning monotone functions typically use the monotonicity for just one thing: for computing the influences of all of the variables. One striking exception is the O’Donnell-Servedio learner for monotone decision trees, that uses monotonicity also in order to prove that the average sensitivity is small, and then uses a result of Friedgut about functions with small average sensitivity. Note that it’s easy to calculate influences of variables in a monotone function (exercise!) but information-theoretically hard for a non-monotone function. My question: Suppose I gave you a wizard (=oracle) that given any function, computes the influence of each of its variables. Can you reproduce known monotone-learning results? Obviously, most just need to know influences so they follow automatically. But what about the O’Donnell-Servedio result? I vaguely remember having proved once that if you want to learn a decision tree f, and if you get along with the input the influences of f’s coordinates, then that doesn’t help you at all, i.e. it’s as hard as learning decision trees without extra knowledge about influences (I can try to recall the proof if there’s interest). But what happens if you get a wizard that computes influences of variables in any black-box function of your choice? My guess is that it doesn’t help, but I can’t prove it. Maybe an oracle separation-like result? That will also teach the blog’s owner a lesson.

8. About Lipton’s approach to Juntas: If we want to avoid the use of matrix multiplication to solve the low-binary-degree case, a good place to start looking would be in trying to learn a linear function (i.e. a parity) faster than matrix-multiplication time. Currently, as far as I know, the best algorithm runs in matrix-multiplication time. (Correct me if I’m wrong). Would you say it’s possible to learn parities in less, or to avoid matrix-multiplication (and/or avoid Gaussian-eliminating the samples)? This question is orthogonal to Lipton’s question, I think, since I think that his whole point is that for higher degrees, the matrix becomes sparser and sparser. However, I think it might be interesting to look at the really low-degree case (ie. linear, quadratic), figure out whether we think that could be improved, and if we think it can’t, trying to take those arguments and convincing ourselves that things are not likely to work for higher degrees as well. Or maybe to find a possible reason that things could work for higher degrees, inspired by the comparison to the really-low-degree case.

9. There’s something very old-school about calling the matrix-multiplication exponent the “Strassen Constant”. And somewhat confusing — as Strassen’s constant should logically refer to the constant that Strassen got for Fast Matrix Multiplication.

10. Do you know the story how the beautifully-named paper “Learning Juntas” got renamed, in its Journal version, to “Learning functions of k relevant variables” ?

Hello! I know this is kinda off topic however , I’d figured I’d ask. Would you be interested in trading links or maybe guest authoring a blog article or vice-versa? My blog addresses a lot of the same subjects as yours and I feel we could greatly benefit from each other. If you’re interested feel free to send me an email. I look forward to hearing from you! Superb blog by the way!