Jacob Ziv is best known for developing the Lempel-Ziv algorithm for data compression with his colleague, Abraham Lempel. He was educated at the Technion (Israeli Institute of Technology) in Haifa, Israel and earned his doctorate in information theory at MIT in 1961. He worked for Bell Telephone Labs and the Israeli Ministry of Defense in addition to teaching at the Technion. The Lempel-Ziv algorithm was in many ways ahead of its time, and was not fully utilized by the commercial sector until later technological developments such as the Internet and cellular telephony.

+

[[Image:Jacob Ziv 2320.jpg|thumb|left|Jacob Ziv]]

−

The interview begins with Ziv’s education and work experience. Ziv then discusses MIT’s program in information theory during the time he was a student there, and the important intellectual influences at MIT (particularly Claude Shannon) that shaped his thinking. He goes on to describe the workings of the Lempel-Ziv algorithm and other data compression algorithms, as well as their practical applications in various fields, such as the Internet, cellular telephony, and the transmission of photographs from Mars. The interview concludes with a discussion of Ziv’s activities with the IEEE and the Information Theory Society and his thoughts on the IEEE as a unique organization.

+

Jacob Ziv is best known for developing the Lempel-Ziv algorithm for data compression with his colleague, [[Abraham Lempel]]. The [[Milestones:Lempel-Ziv Data Compression Algorithm, 1977|Lempel-Ziv Data Compression Algorithm]] was designated as an IEEE milestone in 2004.

−

&nbsp;

+

Jacob Ziv was educated at the Technion (Israeli Institute of Technology) in Haifa, Israel, and earned his doctorate in information theory at MIT in 1961. He worked for Bell Telephone Labs and the Israeli Ministry of Defense in addition to teaching at the Technion. The Lempel-Ziv algorithm was in many ways ahead of its time, and was not fully utilized by the commercial sector until later technological developments such as the Internet and cellular telephony.

−

== About the Interview ==

+

The interview begins with Ziv’s education and work experience. Ziv then discusses MIT’s program in information theory during the time he was a student there, and the important intellectual influences at MIT (particularly [[Claude Shannon|Claude Shannon]]) that shaped his thinking. He goes on to describe the workings of the Lempel-Ziv algorithm and other data compression algorithms, as well as their practical applications in various fields, such as the Internet, cellular telephony, and the transmission of photographs from Mars. The interview concludes with a discussion of Ziv’s activities with the IEEE and the [[IEEE Information Theory Society History|Information Theory Society]] and his thoughts on the IEEE as a unique organization.

−

Jacob Ziv: An interview conducted by Robert Colburn, IEEE History Center, 25 March 2004.

+

== About the Interview ==

−

Interview # 437 for the IEEE History Center, The Institute of Electrical and Electronics Engineers, Inc.,

+

Jacob Ziv: An interview conducted by Robert Colburn, IEEE History Center, 25 March 2004.

−

and Rutgers, The State University of New Jersey.

+

Interview # 437 for the IEEE History Center, The Institute of Electrical and Electronics Engineers, Inc.

−

&nbsp;

+

== Copyright Statement ==

−

== Copyright Statement ==

+

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

−

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

+

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, 39 Union Street, New Brunswick, NJ 08901-8538 USA. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

−

<br>Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, Rutgers - The State University, 39 Union Street, New Brunswick, NJ 08901-8538 USA. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

+

It is recommended that this oral history be cited as follows:

−

<br>It is recommended that this oral history be cited as follows:<br>Ziv, Jacob, Electrical Engineer, an oral history conducted in 2004 by Robert Colburn, IEEE History Center, Rutgers University, New Brunswick, NJ, USA.

+

Ziv, Jacob, an oral history conducted in 2004 by Robert Colburn, IEEE History Center, New Brunswick, NJ, USA.

Colburn: I would like to ask a number of questions, partly to go through your life history, your early education and schooling, and the things that led you to the things that you’ve done, and also to ask about the creative process in developing the algorithm. <br>Ziv: Okay, let me do my best. I’m not sure I’m completely aware of what happened insofar as the creative process is concerned. <br>I got my primary education and secondary education in Israel. I was born in Israel. I got my first degree in Electrical Engineering at the Technion in Haifa, Israel. Technion is the Israeli Institute of Technology. It’s the oldest university in Israel and the main engineering school in Israel until now. I also got my Master’s degree at the Technion, but then I went to the United States and got my Ph.D. from MIT in Cambridge, Massachusetts. <br>The area of my research at MIT was information theory. Prior to going to MIT, I was an R&amp;D engineer with the Research Establishment of the Ministry of Defense in Israel. After coming back from MIT, I then joined the Technion and the Technion faculty.<br>Colburn: What year was that?<br>Ziv: I got my Ph.D., I believe, in 1961. I then spent a few more years at the Research Establishment of the Ministry of Defense. Then around 1970, after spending a sabbatical at the Bell Telephone Labs, which in a way was my second research home for quite a long time—I used to spend, for long periods of times, summers and sabbaticals at Bell Labs. But then in 1970, I joined Technion in Israel. Since then, this is my home university base. <br>While working as a communication engineer, I was fascinated by early books about information theory. It was pretty clear to me that this was what I wanted to do. Since Shannon was at MIT, it was pretty clear that that was the place to go to. Shannon is the father of information theory. And it was indeed the right choice, because at that time it was the center, the best place for that research activity in the world, I would say. Not only that it was the best place, but they were able to give you the feeling there that you are in the best place, which is almost as important. I had the benefit of working there with leading figures in information theory at the time, Jack Wozencraft and Bob Fano. <br>At that time, a book by Jack Wozencraft and Erwin Jacobs, who is now, by the way, the president of Qualcomm, was the bible of communication theory. There were many others there. Peter Elias and so forth. Bob Gallager was a young professor there—all the people who were either then superstars in information theory or who became superstars later on. So I was lucky in making the right choice then.<br>Since then, I’m in this area. When I returned to Israel, as I told you, I was going back and forth during summers and a few sabbaticals to Bell Labs. I collaborated a lot there with the late Aaron Wyner, who is one of the most important contributors to information theory. While at Bell Labs, and also at Technion, I got interested in one branch of information theory, which almost didn’t exist then. The classical theory of Shannon, which is a beautiful mathematical theory, not only was a theory, but it so happened that it was very applicable, especially for the space channel where the noise is well-behaved, namely it’s White-gaussian noise, something that everybody knew how to deal with mathematically. It wasn’t just an idealistic modeling for a communication channel; it was a real model. <br>The basic assumption there was that if you know what the channel is, statistically, if you know the statistics that govern the channel, namely the noise statistics and so forth, you know the probabilistic rule that governs the output given the input, then Claude Shannon came with a beautiful idea and proved it, that by doing coding, namely, using longer and longer messages, you can transmit data practically error-free, as long as the rate is below “capacity.” The only cost was delay, of course. This was a novel idea, but it was all based on the fact that you know what a channel is, probabilistically. <br>The natural problem is two-fold. The simpler case is that where you want to compress data, this is another idea that Shannon dealt with. If you know the statistics of the source, you can represent the source with less bits per second than the rate at which the original data is coming in. By clever mapping, which is based on the old idea by Morse, you attach a long code for a rare message or rare letter and a short one for something that is not rare because it’s being used more frequently. So, on the average, you can transmit at a lower rate. This was known, and this idea was analyzed by Huffman and Shannon and then beautifully developed by Qualcomm and so forth. But it was all based on the assumption that you know what the statistics are. I was bothered by this a-priori assumption for a long time.<br>Coming back to the Lempel-Ziv algorithm you referred to earlier in your question, the first step was to try and understand how to cope with the case where you don’t know what the statistics are and you have to cleverly learn what it is while dealing in compressing the data that comes in. This finally led to a whole theory of universal data compression. Again, the algorithm that was developed jointly with Abraham Lempel, here at Technion, and led to an optimal solution in which we were able to show that asymptotically - when you deal with longer and longer sequences - you actually can compress the data to the very same minimal rate that you could get even if somebody would have told you what the statistics are exactly. That’s nice.<br>Now, in science, you don’t have to be only clever; you have to be lucky too. And we were lucky in the sense that the tools by which we were able to prove this nice result also led to a very efficient algorithm which was linear in time. Namely, if you need to compress a sequence of, say, n letters, the number of computations that you have to do was proportional to n, rather than much more than that in other schemes, such as the Huffman code. Even in the Huffman scheme, which is very efficient, complexity is bigger, despite of the fact that you do know what the statistics are. The very fact that this mathematical tool by which we were able to prove these results led to a very efficient algorithm. This really made it so popular and famous.<br>Let me mention another interesting example where the very same thing happened. It’s the Viterbi algorithm, the very famous Viterbi algorithm, which is one of the cornerstones in communication systems. Again, Andrew Viterbi developed it because he was interested in proving some mathematical theorem which had to do with convolutional codes, which was a form of coding which was at that time very popular. Toward that, he derived an algorithm, which basically was a mathematical tool to prove a theorem, but turned out to be a very efficient algorithm which everybody is using. What made it famous, again, its simplicity.<br>When the Lempel-Ziv algorithm was developed, the practical impact was not felt because at that time, people weren’t that much interested—there was no real need for data compression for communication. It was used in the late 1970s for magnetic decoding and so forth, but insofar as communication there was no real need until, of course, when the Internet came in and there was a huge need for it. This is why now this algorithm is integrated in all the classical communication algorithms that have been used over the Internet for transmitting data error free, as well as in JPEGs and other algorithms. <br>In practice, you deal with data, which is usually man-made. Sometimes it’s rather foolish, even, to try to model it statistically. By the way, the theory that we developed with Abraham Lempel is able to cope with individual sequences. The results had to refer to an individual sequence, even if it’s not being generated by some probabilistic source. So the probabilistic assumption was really not needed at all. Mathematically, it was a very strong result from that respect. <br>This has to do with sources. I then got interested in similar problems which had to do with the case where the channel statistics is not known. What about the channel? What do you do then? I had been working on that for a number of years, and I was able to show that there are ways by which you can transmit error-free over a wide class of channels, channels with vanishing memory. Channels which do have memory, but not an infinite one, a fading memory. So for those channels, and most of real channels are like that, it was possible to show that you can get error-free transmission over the channel without a priori knowing what the channel statistics are. Not only that you can do it, but if the messages are long enough, you can get an error decay which was identical to the rate which was possible to get when you do know what the channel statistics are. So asymptotically, you again didn’t lose anything. <br>That was another effort that I spent a lot of energy on—and later, together with others. This led me to what I do more recently. This has to do with cases where you want to do other jobs, like for example, prediction. Suppose, based on observing a past sequence, you want to guess what the next letter will be. If you know the statistics, it’s known exactly what one should do and what the prediction error is. But what if you don’t know the statistics? The classical approach in information theory in many problems uses a technique and philosophy established by Shannon, and it’s a very important one. The idea is the following: How do you cope with a typical problem in information theory? You first try to set a converse theorem. A converse theorem is something that tells you what you cannot ever do, by any means. For example, if we talk about transmission over a channel, you can show that if the rate of information is above a certain value, namely capacity, no one in the world, no algorithms can yield a probability of error which vanishes in any way.<br>The same can apply to prediction. You can show that there is a lower bound below which the prediction error must be big. And then there is something which is sometimes called a coding theorem, where you derive an algorithm by which you can demonstrate that you can come very close to the lower bound, which is guaranteed by the converse theorem. If the two bounds come close to each other, then you know that you are done, because you know that you have something which is optimal or as good as the optimal solution, or very close to the optimal solution. The same technique has been used not only for data compression and for transmission in an error-free fashion over an unknown channel, but also for the case where you want to do prediction, and also classification. <br>Classification is another interesting case. Suppose I give you a full description of a certain source of information probabilistically. You know the full probabilistic description of that source. Now somebody comes and hands you a sequence, and you have to guess or judge whether this sequence is emerging from a source, which is different from this particular source or it’s not, if it comes from something which is different from that source, namely a different source with a different probabilistic orientation, according to some distance criterion, you should be able to tell that they are different. Again, if you have to judge between two known sources, you know exactly how to do good classification. By classification, I mean you have to declare whether it’s coming from Source A or Source B and you want to minimize the probability of error. But what if you don’t know anything about the source that is generating the sequence that you have to decide about? Then again, there is an optimal solution. The technique is that classical approach of information theory that basically all information theorists are using since Shannon, and that is setting a converse theorem which tells you what you cannot do. Then, if you are lucky, you get an actual algorithm that approaches that lower bound. Then you are happy, because you know that nobody can beat you with that, and I was lucky. <br>This is the main thrust of my work. There are other interesting results which have to do with the borderline between problems in communication theory and the estimations and so forth. There was a nice result that Professor Mosche Zakai from Technion and myself developed. It’s called the Ziv-Zakai Bound. It has to do with an interesting relation between the minimal probability of error which is associated with communication, when you want to transmit letters via a noisy channel, and the estimation error, where you want to estimate a certain value of the message—not to decide whether it’s A or B, but to decide what the analog value of the message is. It’s a happy marriage between two different areas in communication. Basically one which has to do with analog communication, the other one with digital communication. This bound is very useful for a variety of problems in communication, in radar theory, and so forth. <br>There are many other results, but I would say that main thrust of my work, what I did and what I was trying to do throughout the years, is to get to develop an information theory which is universal in the sense that you don’t really have to know the exact statistics of either the source or the channel. This is, I think, a short summary of what I did.<br>Colburn: What are some of the tools that you used to do this in terms of computer machinery? When you were working on this, what sort of things did you use, and how did you test it?<br>Ziv: Let’s go back. The best example is, again, that of the Lempel-Ziv algorithm. It had to do with the following: Suppose you meet a long sequence in the street, and you don’t know what it is, you don’t know who generated it. You want to compress it in the best way that’s possible. Is that a silly question or not? It turned out that if you limit the class of computers or machines that are going to compress this data to a class with finite memory, then theoretically, there is an answer. Not every sequence can be compressed. And actually we demonstrated that there exist some sequences which are not compressible at all by any machine with limited memory. By the way, this is important also for cryptography, because when you need good keys for cryptographic purposes, this is exactly what you need, something which will look really random. If something is really random, it means that it’s incompressible. <br>Anyhow, once we tried to solve that and we did solve that, then we realized there is a nice algorithm that goes with it. Then we published a second paper, which stressed the algorithmic meaning or the algorithmic usefulness of that method. Then we tried to play with it ourselves by doing some simulation. All we had at that time was small PCs. Even PCs were not around; they weren’t too popular then. It was in the mid-1970s. Then what happened was that both Professor Abraham Lempel and myself went for a sabbatical. I went to Bell Labs and he went to what used to be then Sperry-Rand Research in Boston. Then we could do much more, because I then was benefited by the help of some excellent programmers at Bell Labs; he did the same at Sperry. Although Abraham Lempel himself is a computer scientist—he knew more about programming than I did. We kept on working together. Then we came with a second version, which is much more efficient. And then the first patent was issued by Sperry-Rand on that. This patent now belongs to Unisys. There were many disputes about this patent. <br>Even before going to a sabbatical, we tried to convince a company in Israel to start a project, because we were aware of the fact that it’s useful commercially, too, even in the late 1970s, because in Germany, unlike in the States, at that time, they charged you by the number of bits rather than by the duration. This means that if you do compression, you pay less. So there was a really good reason for trying to make a commercial product out of it. We set up together with a company in Israel, a small R&amp;D group. But then this company collapsed because of other reasons. So nothing came out of it. <br>Anyhow, we were aware of the commercial importance of that aside from the theoretical part of it. But we were really hoping that either Bell Labs or Sperry-Rand or both would do something about it. I must say that it was very difficult. I was at Bell Labs and when I tried to convince them to issue a patent, they said that as long as Sperry-Rand is doing it, it’s okay with them, because they had some kind of a patent agreement between the two companies. At that time, there was another good reason. At that time it wasn’t clear—the law was changed later—it wasn’t clear whether you could actually get a patent on software. It had to be a real piece of hardware for you to be able to get a patent. Sperry-Rand did propose a hardware realization and applied and got a patent, and then a second patent. <br>Later on, there were other derivatives of this algorithm. Abraham Lempel spent some time at that time at HP, so he continued to develop it there. But then other companies took over, Microsoft and other companies, because at Microsoft became part of the operating system. Since then, both of us didn’t really follow up all the details of the different derivatives of that algorithm.<br>The nice thing about this algorithm, by the way, is that it’s so simple that it’s very easy for somebody to get his own version of this algorithm. Everybody is aware of the fact of the phenomenon of “not invented here.” It is so simple an algorithm that it’s also very easy to change it slightly and make it more optimal for a specific use. This is why many people took over. Now there’s a whole family of the LZ algorithm. Even nowadays, it’s the most popular and most common data compression algorithm that people are using for noiseless data compression. By noiseless, I mean this is the case where you don’t allow for any errors in the data compression. <br>But this is not the only noiseless compression algorithm. A very excellent example is the local algorithm, which was developed at HP and which was used in the Mars probe recently. This is to transmit these beautiful pictures from Mars. So by now, there are other algorithms, but they are much more complicated computationally. You have to pay the price—in the case of NASA, where every bit that you save is worth a million. For every day use, the old LZ algorithm is still the most popular one because it’s so simple, and very fast too. But indeed, I must emphasize the fact that it’s not the only one. There are other algorithms that have been developed since then. This gave a real thrust to the whole area of universal data compression. <br>I can mention other well-known algorithms for data compression: the PPM algorithm, which is mostly used when you want to compress huge archives and so forth, CTW and so forth. But they are all based on the same philosophy, as that of the Lempel-Ziv algorithm. The idea is to establish a dictionary or a list of things that you have observed already. In a sense it’s like a TV Guide. If you take a TV Guide and there’s an old movie that they show three or four times in the same week, they don’t print the full description of that movie again. They just use a pointer that tells you to go to two or three pages earlier and look for it there. Once you have observed something (and I’ll tell you a little more about what this something is), then you list it as an entry in the dictionary. And when you observe it again, you don’t have to copy it. All you have to do is to send the pointer and tell the other side where to copy from—the dictionary which both sides have. The whole idea is that this dictionary is being developed dynamically by both the sender and the receiver as it goes. <br>How you do it is simply the following. When you have a sequence, and let’s say that this sequence starts with a zero. You have never seen a zero before, so this is your first entry into your dictionary. Now let’s say that the second letter is a zero. Then you say, “Hey, wait a minute. I saw that already, so I don’t have to copy it.” Although in the beginning a zero can be represented by a small number of bits; it doesn’t really matter whether you copy it or not. It’s just one letter. Let’s say that after the second zero, there’s a one. So you say, “Aha! Zero-one is something that I’ve never seen in the past, so let me put that in the dictionary.” Slowly, you are generating a dictionary which consists of phrases of the data that are longer and longer. If such a phrase, sometimes it’s very long, repeats itself, you don’t have to copy it; all you have to do is to send the pointer and tell the other side where to copy this phrase from. That’s the whole idea. I think that the TV Guide explanation is the right one. The idea is generating dynamically a dictionary and pointers. You transmit the pointers, not the information. <br>The other algorithms that were mentioned—PPM, CTW—they’re all based on that idea. The new algorithm, which has been used in the Mars probe, is the LOCO-I algorithm, which is a very clever idea. But it’s only for the noiseless compression of pictures. If you want to transmit pictures noiselessly, then this is a very efficient algorithm. <br>There is another branch of data compression, which is called noisy data compression. In most cases, you allow yourself to introduce distortion. If you are pressed on rate, if you want to minimize the number of bits, some distortion may be acceptable to you. This calls for a different kind of data compression schemes, although sometimes a building block in such compression schemes is again the Lempel-Ziv algorithm. You first introduce distortion into the picture and compress is a little bit, and then go on and use the Lempel-Ziv algorithm. <br>Coming back to your question, I myself am not an expert in computer programming. But again, there are so many versions that there were enough jobs for many good programmers that came up with a variety of improvements.<br>Colburn: You mentioned that at first, people didn’t see the application. What was the first time that you heard of or that you knew of, the first application that somebody had taken it and used it?<br>Ziv: On the Net, people were aware of it and starting to use it for their own purposes. That’s what I did, too. At that time, memory was expensive. Currently, you have a huge memory even in a hand-held computer. But at that time, memory was expensive, so you had no choice but to compress if you wanted to store a lot of data. I was using it for my own purposes, and many other people started to use it for the same purpose. People were aware of it mainly in information theory, and then computer science discovered it; also some versions of it were available on the Net and were free for everybody to use for data storage purposes. <br>But the big push came when people were starting to use data compression on the Internet. There was no other choice because of the huge traffic. People just had to use it. This is when Sperry-Rand was bought by Unisys. Unisys, for years, didn’t tell anybody that they had that patent. Only when everybody was using it for free that they started to get after people, charging them. There was a big fuss about it. Finally, there was some sort of an agreement when the Lempel-Ziv algorithm was declared to be part of the approved standard of the FCC, v. 42-bis. Anyhow, when they adopted that as a standard, they forced Unisys to agree to a reasonable arrangement where any vendor of modems which wanted to use it, or had to use it as part of its modem, was paying an initial single payment of I think $25,000 dollars to Unisys and that was it. <br>For a while, everything was quiet. But then there was a second phase where it became part of GIF. Are you yourself familiar with it?<br>Colburn: I’ve been doing a fair amount of reading, as much reading as I could.<br>Ziv: Oh, okay. GZIP is a free version of the Lempel-Ziv algorithm, which is very good. People were threatening Unisys that if they won’t stop running after people and suing them or charging them, they’ll just switch to a new standard. I think that at that time they gave up. I’m not that familiar with the details, or what’s going on today. But definitely, when the World Wide Web became popular, this gave a huge push to the Lempel-Ziv algorithm, because it became an integral part of the algorithm of the World Wide Web. Without data compression, it was impossible to do anything at all at that time, and even nowadays. It’s amazing. The more bandwidth you get—you can always feel that there’ll be no problems. But then there’s so much more data being generated that data compression is always needed and always will be needed. The same applies to computers. You have a huge amount of storage space, but still, it’s being filled up very quickly, as you know. And Microsoft is not doing its best to make things efficiently, so this helps the data compression business. <br>So this is it, in a nutshell, the more applicable side of it. But I’m not really an expert on the most recent versions and applications of that algorithm and the other algorithms. Basically, I’m interested in information theory, although my basic education is that of a communication engineer, and I was involved with R&amp;D of real communication systems. But I’ve drifted into doing more and more theoretical work, which again, became applicable.<br>Colburn: One of the things that the other Marconi Fellows have talked about is that a lot of great ideas occur at the borderlines.<br>Ziv: I agree. This is one point that I always emphasize to my students, too. What happens is that as a scientist, you acquire and develop certain mathematical tools. If you stick to the original application of the tools, then you find yourself being pushed more and more into a corner which nobody is interested in. The whole idea is that once you acquire the tools, you should always ask yourself, “What is it that one can do with these tools in other fields?” At least in my case, it was always a fruitful approach. I have mentioned to you the borderline between information theory and prediction, information theory and classification. More generally, the borderline between information theory and statistics, information theory and pattern recognition, all these areas, the borderline is really where the action is, and should be, I think. And you are right. I’ve heard that being mentioned by other Marconi Fellows as well.<br>Colburn: I read somewhere recently that the Lempel-Ziv algorithm is used for music, that there are a lot of uses for it in recording of music.<br>Ziv: Yes, that’s right. Yes, definitely. When Sony-Philips came out with the first DVDs, the Lempel-Ziv algorithm is built into that. I’m pretty sure that its still is. In hardware I mean, not just software. It has been used in lots of places. Let me mention other interesting applications. For example, if you listen to a recorded piece of music and you want to classify it and find out whether it’s Mozart or Beethoven, there are some successes with that too, trying to use Lempel-Ziv as a classification means. <br>Let me tell you how it works. The idea is the following. Remember I told you earlier about developing a dictionary. If you treat a piece of music as a sequence of notes and so forth, these are letters too. You build up a dictionary. The dictionary would be typical to that particular music. If you try to use that dictionary in order to compress another piece of music which was composed by a different composer, it won’t work because it’s different. All you have to do is to check and see if that piece of music is compressible as compared with another piece of music (which was used for developing the dictionary), or not? If it compresses well, there’s a good chance it’s the same composer. If not, it’s not. And it has been used, and quite successfully. <br>Let me mention another application, which was just developed by a group of students at the Technion. Everybody is worried about communication security nowadays. Their idea was the following. Everyone has a differing typing pattern when he touches the keyboard. They were able to use a version of the Lempel-Ziv algorithm to identify persons by the way they type on the keyboard. The typing style on the keyboard apparently is like a fingerprint. I think that they’re making a product out of it. It started as a student project—not in our department, in the EE department. It was developed at the Computer Science department. You can apply it to music, you can apply it to typing styles, and so forth.<br>Colburn: Sort of like ham radio operators can recognize other operators by their Morse code “fist.”<br>Ziv: Exactly. Right.<br>Colburn: That’s very interesting. <br>Ziv: Let me give you another example. In the Bible, the book of Isaiah, there are two parts, Isaiah I and Isaiah II. The question is whether it was written by the same person or not. People, I think at Bar-Ilan University here in Israel, were using Lempel-Ziv to demonstrate that it was not written by the same person because it’s a different style. You can check if something is an original Shakespeare or not, but it takes miles of data. It’s not easy. It takes a while, because suppose the differences are in the syntax. It takes many words before you can find out something. But in principle, it’s working. Now with the big machines that we have nowadays, this is definitely feasible. <br>This is exactly related to the classification problem that I mentioned earlier. Classification is exactly that. In your memory, you have a piece of Shakespeare, a book of Shakespeare, and then I feed the computer with something which was written by somebody else. The job of the classifier, namely the computer that does the classification, is just to tell you whether it’s Shakespeare or not Shakespeare. Mathematically, it’s feasible. This I know because I have the theorems to prove it. Of course, when it comes to application, you need big machines to do that. But now we can do a lot. If not now, ten years from now.<br>By the way, if you’re interested in the Lempel-Ziv implications on fields other than pure data compression, there’s a very nice book by Bob Lucky. He’s another Marconi Fellow. He did a beautiful job. For example, I can give you another very interesting philosophical question. People are arguing about the fact as to whether we are born with the ability to understand languages, or is it something that we learn by ourselves? Noam Chomsky, for MIT, a very famous linguist, came up years ago with a theory that we are born with certain abilities when it comes to languages. He came to the conclusion that this is the case because he was able to show that there are similar structures of different languages, including music, by the way. So it doesn’t really matter whether you’re born in Africa or in the States. It doesn’t really matter what language you’re speaking. There are certain similar structures to languages. <br>Bob Lucky poses the following problem. Take all the books in the Library of Congress and apply the Lempel-Ziv algorithm to the series of books, making them one huge sequence. You’ll find out that the average length of a word in that dictionary is not too big, I think he mentioned something like 16 letters, 17 letters. But we know we are much more clever than that. After all, sometimes if I tell you two words, you can guess the whole sentence. Where is it coming from, this ability? The Lempel-Ziv model is that of a huge computer which knows nothing—all it can do is learn from its past experience. But apparently, our capability to comprehend much more than what’s contained in a word of 17 letters has to do with something which we are born with. If you wish, you can go philosophical about it and go to other areas as well.<br>Colburn: Yes, indeed. Again, another borderline.<br>Ziv: Yes, sure. <br>Colburn: Another area I wanted to ask you about was your IEEE activities and connections. I also wanted to mention to you that the Lempel-Ziv algorithm is being proposed as an IEEE milestone.<br>Ziv: Yes, I see. Yes, I know that there were some initiatives from the IEEE in Israel. Is that what you are referring to?<br>Colburn: Yes. The proposal has been submitted and it’s going to be discussed at this Sunday’s meeting. <br>Ziv: Is the purpose of the interview its relevance to the Marconi?<br>Colburn: To Marconi, yes.<br>Ziv: Okay. Because if you want to get more information about the algorithm itself, you should also interview Abraham Lempel. He is in Israel, but he time shares between Israel and HP in Palo Alto, California. He’s in charge of their research group in computer science and a few other fields worldwide. But he’s also the director of HP Labs in Israel. If you want to put the accent on that, you can check with him as well.<br>Colburn: I would like to. I hope to be able to.<br>Ziv: Yes. We collaborated very closely at least six years or so on the notion of universal coding and data compression, and also the development of the algorithm, of course. <br>Now about the IEEE. Well, I think I’m a loyal member. I did my best. I was on the board of what is now called the IEEE Information Theory Society. I was co-chairman of one international symposium, which was held here in Israel in Ashkelon. By the way, we had the most prestigious award, which is given by the Information Theory Society, called the Shannon Award. The first one to get a Shannon Award was Claude Shannon. Actually, I must confess to you that Aaron Wyner, who was co-chair of that symposium with me, and myself thought of ways how to convince Claude Shannon to come to Israel, of all places. He wasn’t that well then, too. This is how the Shannon Award came about, because this convinced him to come. He gave the first Shannon Lecture, which is the most important event until today in the international symposium on information theory. He was the first Shannon awardee. But the truth behind it was how to drag Claude Shannon from Boston to Ashkelon, Israel.<br>Colburn: And how did you do that?<br>Ziv: First of all, it was an international symposium on information theory, which switches from one place to others around the world. So he did participate in those symposiums from time to time, but at that time already, he did it very rarely. We approached him and told him that we want to establish the Shannon Lecture and the Shannon Award. Becoming a Shannon Lecturer is the highest appreciation that one can get in our society. He liked the idea and volunteered to give the first lecture, and that was it. Since then, there’s a whole line of Shannon awardees. I am glad to be one of them, and so is Aaron Wyner, but this came much later. All the big names like Bob Gallagher and Fanno and Elias are there, of course.<br>The Information Theory Society is relatively small as compared to huge societies like computer science and so forth. But it’s very family-like, I must say. It’s a very high quality society. The symposiums are really of very high quality. The rejection rates of papers is always very, very high, even today. <br>Colburn: Yes, they put out a very impressive Transactions.<br>Ziv: Yes, that’s right. I remember once a meeting in the mid-‘80s in Florida where the topic was “Is Information Theory Dead?” Look what happened since then—Internet! Now, I search on the Internet. The cellular telephony is the latest development in information theory, without which there would have been no cellular telephony, especially wide band. Everything is in there—error correction of the highest quality. <br>By the way, talking about unknown channels, this is it. When you talk about the cellular communications with all the reflections and the fadings, nobody really knows what a channel is statistically, exactly. So theoretical results find their place, finally, because things are getting more and more complicated. There’s time for everything under the sun. <br>Colburn: Did you ever imagine that the algorithm would be used, for example, transmitting pictures from Mars?<br>Ziv: No. By the way, the group of people at HP that developed it are students of Abraham Lempel and myself. But that was 25 years ago. What they did has nothing to do with the algorithm, although they got good education, I must say. The LOCO algorithm is beautiful. It’s an amazing thing. I got an email at six a.m. in the morning from that group, telling me about it. They were proud of the fact that it was working, that NASA adopted their algorithm. I was very proud of them, of course. <br>I’m not implying it’s a derivative of Lempel-Ziv. It’s not. It’s called LOCO: Lossless Image Compression. I don’t even remember the letters, but you can find it. It’s really amazing. By the way, an interesting side issue is: one might ask himself why is it that it’s important to have lossless data compression of pictures? After all, some distortion won’t matter, right? But the point is that when you go to Mars, you don’t know what to expect; and when you throw away information, you don’t know what it is that you’re losing. The same thing applied to medical information. When you compress, you want to compress tomographic pictures and X-rays and so forth. The medical doctors won’t let you discard anything because they’re afraid of being sued, of course. Somebody who might then sue them for ignoring the fact that they had cancer and they didn’t see it. <br>This is another avenue where noiseless data compression is being used heavily. Although the compression rate saving might be only 60% or so, but nevertheless, when you have to store a huge amount of data… I think that by US federal laws, you must keep the record of any US citizen which is older than 21. You are not supposed to destroy data. So where do you store it? So, if you save 60% of the data, it means not building another wing to a hospital, which is a huge amount of money. <br>Then it’s interesting that when you go to Mars, this is another place where you don’t allow yourself to throw out information. Although I’m sure that the next time they’ll go there, they’ll know more about the background pictures and so forth and they might be able to use more efficient algorithms with some distortions. When you go to something which is completely unknown to you, that’s the only thing you can do. The quality of the picture is really amazing.<br>Colburn: It is amazing, yes.<br>Ziv: I’m really surprised that there was no more coverage in the media about it. It’s an immense technological success. People are getting used to everything, I guess.<br>Colburn: I think that’s what it is. The Voyager pictures were so spectacular that after that, we have gotten used to it.<br>Ziv: That’s right. Anyhow, coming back to IEEE, I think I paid my dues to the IEEE society in many respects. The IEEE organization is really unique. I think that it’s one of the few organizations where you can really say that it’s an international organization, and where the center of activities in different areas varies from time to time from one country to the other. This is not the case in many of the other organizations. There are scientific societies, but the IEEE is an organization with interests other than just the scientific part of it. It encourages collaboration of all sorts of ways. <br>Colburn: I’ve always been very proud to work for it. Thank you so much for your time.<br>Ziv: By the way, I might be in New Jersey on the weekend of April 17th. I’m on the Technical Advisory Board of a young company named Klarion, so I’m going to a Technical Advisory Board meeting there. If needed, on Saturday I might have some time. <br>Colburn: That would be wonderful, and it would be an honor to meet you. By the way, we will be sending you a copy of the transcript of the oral history interview so you can edit it and make changes and so forth. <br>Ziv: Okay, fine. Don’t hesitate to call on me if you have more questions. <br>Colburn: Thank you very much.<br>Ziv: Okay. Fine. You’re welcome. It was nice talking to you.<br>Colburn: Thank you very, very much, again.

+

I would like to ask a number of questions, partly to go through your life history, your early education and schooling, and the things that led you to the things that you’ve done, and also to ask about the creative process in developing the algorithm.

+

+

'''Ziv:'''

+

+

Okay, let me do my best. I’m not sure I’m completely aware of what happened insofar as the creative process is concerned.

+

+

I got my primary education and secondary education in Israel. I was born in Israel. I got my first degree in Electrical Engineering at the Technion in Haifa, Israel. Technion is the Israeli Institute of Technology. It’s the oldest university in Israel and the main engineering school in Israel until now. I also got my Master’s degree at the Technion, but then I went to the United States and got my Ph.D. from MIT in Cambridge, Massachusetts.

+

+

The area of my research at MIT was information theory. Prior to going to MIT, I was an R&amp;D engineer with the Research Establishment of the Ministry of Defense in Israel. After coming back from MIT, I then joined the Technion and the Technion faculty.

+

+

'''Colburn:'''

+

+

What year was that?

+

+

'''Ziv:'''

+

+

I got my Ph.D., I believe, in 1961. I then spent a few more years at the Research Establishment of the Ministry of Defense. Then around 1970, after spending a sabbatical at the Bell Telephone Labs, which in a way was my second research home for quite a long time—I used to spend, for long periods of times, summers and sabbaticals at Bell Labs. But then in 1970, I joined Technion in Israel. Since then, this is my home university base.

+

+

While working as a communication engineer, I was fascinated by early books about information theory. It was pretty clear to me that this was what I wanted to do. Since Shannon was at MIT, it was pretty clear that that was the place to go to. Shannon is the father of information theory. And it was indeed the right choice, because at that time it was the center, the best place for that research activity in the world, I would say. Not only that it was the best place, but they were able to give you the feeling there that you are in the best place, which is almost as important. I had the benefit of working there with leading figures in information theory at the time, Jack Wozencraft and Bob Fano.

+

+

At that time, a book by [[John M. Wozencraft|Jack Wozencraft]] and [[Irwin M. Jacobs|Irwin Jacobs]], who is now, by the way, the president of Qualcomm, was the bible of communication theory. There were many others there. [[Peter Elias]] and so forth. [[Robert G. Gallager|Bob Gallager]] was a young professor there—all the people who were either then superstars in information theory or who became superstars later on. So I was lucky in making the right choice then.

+

+

Since then, I’m in this area. When I returned to Israel, as I told you, I was going back and forth during summers and a few sabbaticals to [[Bell Labs|Bell Labs]]. I collaborated a lot there with the late Aaron Wyner, who is one of the most important contributors to information theory. While at [[Bell Labs|Bell Labs]], and also at Technion, I got interested in one branch of information theory, which almost didn’t exist then. The classical theory of Shannon, which is a beautiful mathematical theory, not only was a theory, but it so happened that it was very applicable, especially for the space channel where the noise is well-behaved, namely it’s White-gaussian noise, something that everybody knew how to deal with mathematically. It wasn’t just an idealistic modeling for a communication channel; it was a real model.

+

+

The basic assumption there was that if you know what the channel is, statistically, if you know the statistics that govern the channel, namely the noise statistics and so forth, you know the probabilistic rule that governs the output given the input, then Claude Shannon came with a beautiful idea and proved it, that by doing coding, namely, using longer and longer messages, you can transmit data practically error-free, as long as the rate is below “capacity.” The only cost was delay, of course. This was a novel idea, but it was all based on the fact that you know what a channel is, probabilistically.

+

+

The natural problem is two-fold. The simpler case is that where you want to compress data, this is another idea that Shannon dealt with. If you know the statistics of the source, you can represent the source with less bits per second than the rate at which the original data is coming in. By clever mapping, which is based on the old idea by Morse, you attach a long code for a rare message or rare letter and a short one for something that is not rare because it’s being used more frequently. So, on the average, you can transmit at a lower rate. This was known, and this idea was analyzed by Huffman and Shannon and then beautifully developed by Qualcomm and so forth. But it was all based on the assumption that you know what the statistics are. I was bothered by this a-priori assumption for a long time.

+

+

=== Lempel-Ziv algorithm, Viterbi algorithm, and Ziv-Zakai bound ===

+

+

'''Ziv:'''

+

+

<flashmp3>437 - ziv - clip 1.mp3</flashmp3>

+

+

Coming back to the Lempel-Ziv algorithm you referred to earlier in your question, the first step was to try and understand how to cope with the case where you don’t know what the statistics are and you have to cleverly learn what it is while dealing in compressing the data that comes in. This finally led to a whole theory of universal data compression. Again, the algorithm that was developed jointly with [[Abraham Lempel]], here at Technion, and led to an optimal solution in which we were able to show that asymptotically - when you deal with longer and longer sequences - you actually can compress the data to the very same minimal rate that you could get even if somebody would have told you what the statistics are exactly. That’s nice.

+

+

Now, in science, you don’t have to be only clever; you have to be lucky too. And we were lucky in the sense that the tools by which we were able to prove this nice result also led to a very efficient algorithm which was linear in time. Namely, if you need to compress a sequence of, say, n letters, the number of computations that you have to do was proportional to n, rather than much more than that in other schemes, such as the Huffman code. Even in the Huffman scheme, which is very efficient, complexity is bigger, despite of the fact that you do know what the statistics are. The very fact that this mathematical tool by which we were able to prove these results led to a very efficient algorithm. This really made it so popular and famous.

+

+

Let me mention another interesting example where the very same thing happened. It’s the [[Andrew J. Viterbi|Viterbi]] algorithm, the very famous [[Andrew J. Viterbi|Viterbi]] algorithm, which is one of the cornerstones in communication systems. Again, [[Andrew J. Viterbi|Andrew Viterbi]] developed it because he was interested in proving some mathematical theorem which had to do with convolutional codes, which was a form of coding which was at that time very popular. Toward that, he derived an algorithm, which basically was a mathematical tool to prove a theorem, but turned out to be a very efficient algorithm which everybody is using. What made it famous, again, its simplicity.

+

+

When the Lempel-Ziv algorithm was developed, the practical impact was not felt because at that time, people weren’t that much interested—there was no real need for data compression for communication. It was used in the late 1970s for magnetic decoding and so forth, but insofar as communication there was no real need until, of course, when the Internet came in and there was a huge need for it. This is why now this algorithm is integrated in all the classical communication algorithms that have been used over the Internet for transmitting data error free, as well as in JPEGs and other algorithms.

+

+

In practice, you deal with data, which is usually man-made. Sometimes it’s rather foolish, even, to try to model it statistically. By the way, the theory that we developed with Abraham Lempel is able to cope with individual sequences. The results had to refer to an individual sequence, even if it’s not being generated by some probabilistic source. So the probabilistic assumption was really not needed at all. Mathematically, it was a very strong result from that respect.

+

+

This has to do with sources. I then got interested in similar problems which had to do with the case where the channel statistics is not known. What about the channel? What do you do then? I had been working on that for a number of years, and I was able to show that there are ways by which you can transmit error-free over a wide class of channels, channels with vanishing memory. Channels which do have memory, but not an infinite one, a fading memory. So for those channels, and most of real channels are like that, it was possible to show that you can get error-free transmission over the channel without a priori knowing what the channel statistics are. Not only that you can do it, but if the messages are long enough, you can get an error decay which was identical to the rate which was possible to get when you do know what the channel statistics are. So asymptotically, you again didn’t lose anything.

+

+

That was another effort that I spent a lot of energy on—and later, together with others. This led me to what I do more recently. This has to do with cases where you want to do other jobs, like for example, prediction. Suppose, based on observing a past sequence, you want to guess what the next letter will be. If you know the statistics, it’s known exactly what one should do and what the prediction error is. But what if you don’t know the statistics? The classical approach in information theory in many problems uses a technique and philosophy established by Shannon, and it’s a very important one. The idea is the following: How do you cope with a typical problem in information theory? You first try to set a converse theorem. A converse theorem is something that tells you what you cannot ever do, by any means. For example, if we talk about transmission over a channel, you can show that if the rate of information is above a certain value, namely capacity, no one in the world, no algorithms can yield a probability of error which vanishes in any way.

+

+

The same can apply to prediction. You can show that there is a lower bound below which the prediction error must be big. And then there is something which is sometimes called a coding theorem, where you derive an algorithm by which you can demonstrate that you can come very close to the lower bound, which is guaranteed by the converse theorem. If the two bounds come close to each other, then you know that you are done, because you know that you have something which is optimal or as good as the optimal solution, or very close to the optimal solution. The same technique has been used not only for data compression and for transmission in an error-free fashion over an unknown channel, but also for the case where you want to do prediction, and also classification.

+

+

Classification is another interesting case. Suppose I give you a full description of a certain source of information probabilistically. You know the full probabilistic description of that source. Now somebody comes and hands you a sequence, and you have to guess or judge whether this sequence is emerging from a source, which is different from this particular source or it’s not, if it comes from something which is different from that source, namely a different source with a different probabilistic orientation, according to some distance criterion, you should be able to tell that they are different. Again, if you have to judge between two known sources, you know exactly how to do good classification. By classification, I mean you have to declare whether it’s coming from Source A or Source B and you want to minimize the probability of error. But what if you don’t know anything about the source that is generating the sequence that you have to decide about? Then again, there is an optimal solution. The technique is that classical approach of information theory that basically all information theorists are using since Shannon, and that is setting a converse theorem which tells you what you cannot do. Then, if you are lucky, you get an actual algorithm that approaches that lower bound. Then you are happy, because you know that nobody can beat you with that, and I was lucky.

+

+

This is the main thrust of my work. There are other interesting results which have to do with the borderline between problems in communication theory and the estimations and so forth. There was a nice result that Professor Mosche Zakai from Technion and myself developed. It’s called the Ziv-Zakai Bound. It has to do with an interesting relation between the minimal probability of error which is associated with communication, when you want to transmit letters via a noisy channel, and the estimation error, where you want to estimate a certain value of the message—not to decide whether it’s A or B, but to decide what the analog value of the message is. It’s a happy marriage between two different areas in communication. Basically one which has to do with analog communication, the other one with digital communication. This bound is very useful for a variety of problems in communication, in radar theory, and so forth.

+

+

There are many other results, but I would say that main thrust of my work, what I did and what I was trying to do throughout the years, is to get to develop an information theory which is universal in the sense that you don’t really have to know the exact statistics of either the source or the channel. This is, I think, a short summary of what I did.

+

+

'''Colburn:'''

+

+

What are some of the tools that you used to do this in terms of computer machinery? When you were working on this, what sort of things did you use, and how did you test it?

+

+

'''Ziv:'''

+

+

Let’s go back. The best example is, again, that of the Lempel-Ziv algorithm. It had to do with the following: Suppose you meet a long sequence in the street, and you don’t know what it is, you don’t know who generated it. You want to compress it in the best way that’s possible. Is that a silly question or not? It turned out that if you limit the class of computers or machines that are going to compress this data to a class with finite memory, then theoretically, there is an answer. Not every sequence can be compressed. And actually we demonstrated that there exist some sequences which are not compressible at all by any machine with limited memory. By the way, this is important also for cryptography, because when you need good keys for cryptographic purposes, this is exactly what you need, something which will look really random. If something is really random, it means that it’s incompressible.

+

+

Anyhow, once we tried to solve that and we did solve that, then we realized there is a nice algorithm that goes with it. Then we published a second paper, which stressed the algorithmic meaning or the algorithmic usefulness of that method. Then we tried to play with it ourselves by doing some simulation. All we had at that time was small PCs. Even PCs were not around; they weren’t too popular then. It was in the mid-1970s. Then what happened was that both Professor Abraham Lempel and myself went for a sabbatical. I went to [[Bell Labs|Bell Labs]] and he went to what used to be then Sperry-Rand Research in Boston. Then we could do much more, because I then was benefited by the help of some excellent programmers at [[Bell Labs|Bell Labs]]; he did the same at Sperry. Although Abraham Lempel himself is a computer scientist—he knew more about programming than I did. We kept on working together. Then we came with a second version, which is much more efficient. And then the first patent was issued by Sperry-Rand on that. This patent now belongs to Unisys. There were many disputes about this patent.

+

+

Even before going to a sabbatical, we tried to convince a company in Israel to start a project, because we were aware of the fact that it’s useful commercially, too, even in the late 1970s, because in Germany, unlike in the States, at that time, they charged you by the number of bits rather than by the duration. This means that if you do compression, you pay less. So there was a really good reason for trying to make a commercial product out of it. We set up together with a company in Israel, a small R&amp;D group. But then this company collapsed because of other reasons. So nothing came out of it.

+

+

Anyhow, we were aware of the commercial importance of that aside from the theoretical part of it. But we were really hoping that either [[Bell Labs|Bell Labs]] or Sperry-Rand or both would do something about it. I must say that it was very difficult. I was at [[Bell Labs|Bell Labs]] and when I tried to convince them to issue a patent, they said that as long as Sperry-Rand is doing it, it’s okay with them, because they had some kind of a patent agreement between the two companies. At that time, there was another good reason. At that time it wasn’t clear—the law was changed later—it wasn’t clear whether you could actually get a patent on software. It had to be a real piece of hardware for you to be able to get a patent. Sperry-Rand did propose a hardware realization and applied and got a patent, and then a second patent.

+

+

Later on, there were other derivatives of this algorithm. Abraham Lempel spent some time at that time at HP, so he continued to develop it there. But then other companies took over, Microsoft and other companies, because at Microsoft became part of the operating system. Since then, both of us didn’t really follow up all the details of the different derivatives of that algorithm.

+

+

The nice thing about this algorithm, by the way, is that it’s so simple that it’s very easy for somebody to get his own version of this algorithm. Everybody is aware of the fact of the phenomenon of “not invented here.” It is so simple an algorithm that it’s also very easy to change it slightly and make it more optimal for a specific use. This is why many people took over. Now there’s a whole family of the LZ algorithm. Even nowadays, it’s the most popular and most common data compression algorithm that people are using for noiseless data compression. By noiseless, I mean this is the case where you don’t allow for any errors in the data compression.

+

+

But this is not the only noiseless compression algorithm. A very excellent example is the local algorithm, which was developed at HP and which was used in the Mars probe recently. This is to transmit these beautiful pictures from Mars. So by now, there are other algorithms, but they are much more complicated computationally. You have to pay the price—in the case of NASA, where every bit that you save is worth a million. For every day use, the old LZ algorithm is still the most popular one because it’s so simple, and very fast too. But indeed, I must emphasize the fact that it’s not the only one. There are other algorithms that have been developed since then. This gave a real thrust to the whole area of universal data compression.

+

+

I can mention other well-known algorithms for data compression: the PPM algorithm, which is mostly used when you want to compress huge archives and so forth, CTW and so forth. But they are all based on the same philosophy, as that of the Lempel-Ziv algorithm. The idea is to establish a dictionary or a list of things that you have observed already. In a sense it’s like a TV Guide. If you take a TV Guide and there’s an old movie that they show three or four times in the same week, they don’t print the full description of that movie again. They just use a pointer that tells you to go to two or three pages earlier and look for it there. Once you have observed something (and I’ll tell you a little more about what this something is), then you list it as an entry in the dictionary. And when you observe it again, you don’t have to copy it. All you have to do is to send the pointer and tell the other side where to copy from—the dictionary which both sides have. The whole idea is that this dictionary is being developed dynamically by both the sender and the receiver as it goes.

+

+

How you do it is simply the following. When you have a sequence, and let’s say that this sequence starts with a zero. You have never seen a zero before, so this is your first entry into your dictionary. Now let’s say that the second letter is a zero. Then you say, “Hey, wait a minute. I saw that already, so I don’t have to copy it.” Although in the beginning a zero can be represented by a small number of bits; it doesn’t really matter whether you copy it or not. It’s just one letter. Let’s say that after the second zero, there’s a one. So you say, “Aha! Zero-one is something that I’ve never seen in the past, so let me put that in the dictionary.” Slowly, you are generating a dictionary which consists of phrases of the data that are longer and longer. If such a phrase, sometimes it’s very long, repeats itself, you don’t have to copy it; all you have to do is to send the pointer and tell the other side where to copy this phrase from. That’s the whole idea. I think that the TV Guide explanation is the right one. The idea is generating dynamically a dictionary and pointers. You transmit the pointers, not the information.

+

+

The other algorithms that were mentioned—PPM, CTW—they’re all based on that idea. The new algorithm, which has been used in the Mars probe, is the LOCO-I algorithm, which is a very clever idea. But it’s only for the noiseless compression of pictures. If you want to transmit pictures noiselessly, then this is a very efficient algorithm.

+

+

There is another branch of data compression, which is called noisy data compression. In most cases, you allow yourself to introduce distortion. If you are pressed on rate, if you want to minimize the number of bits, some distortion may be acceptable to you. This calls for a different kind of data compression schemes, although sometimes a building block in such compression schemes is again the Lempel-Ziv algorithm. You first introduce distortion into the picture and compress is a little bit, and then go on and use the Lempel-Ziv algorithm.

+

+

Coming back to your question, I myself am not an expert in computer programming. But again, there are so many versions that there were enough jobs for many good programmers that came up with a variety of improvements.

+

+

==== Applications ====

+

+

'''Colburn:'''

+

+

You mentioned that at first, people didn’t see the application. What was the first time that you heard of or that you knew of, the first application that somebody had taken it and used it?

+

+

'''Ziv:'''

+

+

<flashmp3>437 - ziv - clip 2.mp3</flashmp3>

+

+

On the Net, people were aware of it and starting to use it for their own purposes. That’s what I did, too. At that time, memory was expensive. Currently, you have a huge memory even in a hand-held computer. But at that time, memory was expensive, so you had no choice but to compress if you wanted to store a lot of data. I was using it for my own purposes, and many other people started to use it for the same purpose. People were aware of it mainly in information theory, and then computer science discovered it; also some versions of it were available on the Net and were free for everybody to use for data storage purposes.

+

+

But the big push came when people were starting to use data compression on the Internet. There was no other choice because of the huge traffic. People just had to use it. This is when Sperry-Rand was bought by Unisys. Unisys, for years, didn’t tell anybody that they had that patent. Only when everybody was using it for free that they started to get after people, charging them. There was a big fuss about it. Finally, there was some sort of an agreement when the Lempel-Ziv algorithm was declared to be part of the approved standard of the FCC, v. 42-bis. Anyhow, when they adopted that as a standard, they forced Unisys to agree to a reasonable arrangement where any vendor of modems which wanted to use it, or had to use it as part of its modem, was paying an initial single payment of I think $25,000 dollars to Unisys and that was it.

+

+

For a while, everything was quiet. But then there was a second phase where it became part of GIF. Are you yourself familiar with it?

+

+

'''Colburn:'''

+

+

I’ve been doing a fair amount of reading, as much reading as I could.

+

+

'''Ziv:'''

+

+

Oh, okay. GZIP is a free version of the Lempel-Ziv algorithm, which is very good. People were threatening Unisys that if they won’t stop running after people and suing them or charging them, they’ll just switch to a new standard. I think that at that time they gave up. I’m not that familiar with the details, or what’s going on today. But definitely, when the World Wide Web became popular, this gave a huge push to the Lempel-Ziv algorithm, because it became an integral part of the algorithm of the World Wide Web. Without data compression, it was impossible to do anything at all at that time, and even nowadays. It’s amazing. The more bandwidth you get—you can always feel that there’ll be no problems. But then there’s so much more data being generated that data compression is always needed and always will be needed. The same applies to computers. You have a huge amount of storage space, but still, it’s being filled up very quickly, as you know. And Microsoft is not doing its best to make things efficiently, so this helps the data compression business.

+

+

So this is it, in a nutshell, the more applicable side of it. But I’m not really an expert on the most recent versions and applications of that algorithm and the other algorithms. Basically, I’m interested in information theory, although my basic education is that of a communication engineer, and I was involved with R&amp;D of real communication systems. But I’ve drifted into doing more and more theoretical work, which again, became applicable.

+

+

'''Colburn:'''

+

+

One of the things that the other [[Oral-History:Marconi Fellows Oral Histories|Marconi Fellows]] have talked about is that a lot of great ideas occur at the borderlines.

+

+

'''Ziv:'''

+

+

I agree. This is one point that I always emphasize to my students, too. What happens is that as a scientist, you acquire and develop certain mathematical tools. If you stick to the original application of the tools, then you find yourself being pushed more and more into a corner which nobody is interested in. The whole idea is that once you acquire the tools, you should always ask yourself, “What is it that one can do with these tools in other fields?” At least in my case, it was always a fruitful approach. I have mentioned to you the borderline between information theory and prediction, information theory and classification. More generally, the borderline between information theory and statistics, information theory and pattern recognition, all these areas, the borderline is really where the action is, and should be, I think. And you are right. I’ve heard that being mentioned by other Marconi Fellows as well.

+

+

'''Colburn:'''

+

+

I read somewhere recently that the Lempel-Ziv algorithm is used for music, that there are a lot of uses for it in recording of music.

+

+

'''Ziv:'''

+

+

Yes, that’s right. Yes, definitely. When Sony-Philips came out with the first DVDs, the Lempel-Ziv algorithm is built into that. I’m pretty sure that its still is. In hardware I mean, not just software. It has been used in lots of places. Let me mention other interesting applications. For example, if you listen to a recorded piece of music and you want to classify it and find out whether it’s Mozart or Beethoven, there are some successes with that too, trying to use Lempel-Ziv as a classification means.

+

+

Let me tell you how it works. The idea is the following. Remember I told you earlier about developing a dictionary. If you treat a piece of music as a sequence of notes and so forth, these are letters too. You build up a dictionary. The dictionary would be typical to that particular music. If you try to use that dictionary in order to compress another piece of music which was composed by a different composer, it won’t work because it’s different. All you have to do is to check and see if that piece of music is compressible as compared with another piece of music (which was used for developing the dictionary), or not? If it compresses well, there’s a good chance it’s the same composer. If not, it’s not. And it has been used, and quite successfully.

+

+

Let me mention another application, which was just developed by a group of students at the Technion. Everybody is worried about communication security nowadays. Their idea was the following. Everyone has a differing typing pattern when he touches the keyboard. They were able to use a version of the Lempel-Ziv algorithm to identify persons by the way they type on the keyboard. The typing style on the keyboard apparently is like a fingerprint. I think that they’re making a product out of it. It started as a student project—not in our department, in the EE department. It was developed at the Computer Science department. You can apply it to music, you can apply it to typing styles, and so forth.

+

+

'''Colburn:'''

+

+

Sort of like [[Amateur Radio|ham radio]] operators can recognize other operators by their [[Morse Code|Morse code]] “fist.”

+

+

'''Ziv:'''

+

+

Exactly. Right.

+

+

'''Colburn:'''

+

+

That’s very interesting.

+

+

'''Ziv:'''

+

+

Let me give you another example. In the Bible, the book of Isaiah, there are two parts, Isaiah I and Isaiah II. The question is whether it was written by the same person or not. People, I think at Bar-Ilan University here in Israel, were using Lempel-Ziv to demonstrate that it was not written by the same person because it’s a different style. You can check if something is an original Shakespeare or not, but it takes miles of data. It’s not easy. It takes a while, because suppose the differences are in the syntax. It takes many words before you can find out something. But in principle, it’s working. Now with the big machines that we have nowadays, this is definitely feasible.

+

+

This is exactly related to the classification problem that I mentioned earlier. Classification is exactly that. In your memory, you have a piece of Shakespeare, a book of Shakespeare, and then I feed the computer with something which was written by somebody else. The job of the classifier, namely the computer that does the classification, is just to tell you whether it’s Shakespeare or not Shakespeare. Mathematically, it’s feasible. This I know because I have the theorems to prove it. Of course, when it comes to application, you need big machines to do that. But now we can do a lot. If not now, ten years from now.

+

+

By the way, if you’re interested in the Lempel-Ziv implications on fields other than pure data compression, there’s a very nice book by [[Oral-History:Robert Lucky|Bob Lucky]]. He’s another Marconi Fellow. He did a beautiful job. For example, I can give you another very interesting philosophical question. People are arguing about the fact as to whether we are born with the ability to understand languages, or is it something that we learn by ourselves? Noam Chomsky, for MIT, a very famous linguist, came up years ago with a theory that we are born with certain abilities when it comes to languages. He came to the conclusion that this is the case because he was able to show that there are similar structures of different languages, including music, by the way. So it doesn’t really matter whether you’re born in Africa or in the States. It doesn’t really matter what language you’re speaking. There are certain similar structures to languages.

+

+

Bob Lucky poses the following problem. Take all the books in the Library of Congress and apply the Lempel-Ziv algorithm to the series of books, making them one huge sequence. You’ll find out that the average length of a word in that dictionary is not too big, I think he mentioned something like 16 letters, 17 letters. But we know we are much more clever than that. After all, sometimes if I tell you two words, you can guess the whole sentence. Where is it coming from, this ability? The Lempel-Ziv model is that of a huge computer which knows nothing—all it can do is learn from its past experience. But apparently, our capability to comprehend much more than what’s contained in a word of 17 letters has to do with something which we are born with. If you wish, you can go philosophical about it and go to other areas as well.

+

+

'''Colburn:'''

+

+

Yes, indeed. Again, another borderline.

+

+

'''Ziv:'''

+

+

Yes, sure.

+

+

=== IEEE ===

+

+

'''Colburn:'''

+

+

Another area I wanted to ask you about was your IEEE activities and connections. I also wanted to mention to you that the Lempel-Ziv algorithm is being proposed as an IEEE milestone.

+

+

'''Ziv:'''

+

+

Yes, I see. Yes, I know that there were some initiatives from the IEEE in Israel. Is that what you are referring to?

+

+

'''Colburn:'''

+

+

Yes. The proposal has been submitted and it’s going to be discussed at this Sunday’s meeting.

+

+

'''Ziv:'''

+

+

Is the purpose of the interview its relevance to the Marconi?

+

+

'''Colburn:'''

+

+

To Marconi, yes.

+

+

'''Ziv:'''

+

+

Okay. Because if you want to get more information about the algorithm itself, you should also interview Abraham Lempel. He is in Israel, but he time shares between Israel and HP in Palo Alto, California. He’s in charge of their research group in computer science and a few other fields worldwide. But he’s also the director of HP Labs in Israel. If you want to put the accent on that, you can check with him as well.

+

+

'''Colburn:'''

+

+

I would like to. I hope to be able to.

+

+

'''Ziv:'''

+

+

Yes. We collaborated very closely at least six years or so on the notion of universal coding and data compression, and also the development of the algorithm, of course.

+

+

Now about the IEEE. Well, I think I’m a loyal member. I did my best. I was on the board of what is now called the [[IEEE Information Theory Society History|IEEE Information Theory Society]]. I was co-chairman of one international symposium, which was held here in Israel in Ashkelon. By the way, we had the most prestigious award, which is given by the Information Theory Society, called the Shannon Award. The first one to get a Shannon Award was [[Claude Shannon|Claude Shannon]]. Actually, I must confess to you that Aaron Wyner, who was co-chair of that symposium with me, and myself thought of ways how to convince [[Claude Shannon|Claude Shannon]] to come to Israel, of all places. He wasn’t that well then, too. This is how the Shannon Award came about, because this convinced him to come. He gave the first Shannon Lecture, which is the most important event until today in the international symposium on information theory. He was the first Shannon awardee. But the truth behind it was how to drag Claude Shannon from Boston to Ashkelon, Israel.

+

+

'''Colburn:'''

+

+

And how did you do that?

+

+

'''Ziv:'''

+

+

First of all, it was an international symposium on information theory, which switches from one place to others around the world. So he did participate in those symposiums from time to time, but at that time already, he did it very rarely. We approached him and told him that we want to establish the Shannon Lecture and the Shannon Award. Becoming a Shannon Lecturer is the highest appreciation that one can get in our society. He liked the idea and volunteered to give the first lecture, and that was it. Since then, there’s a whole line of Shannon awardees. I am glad to be one of them, and so is Aaron Wyner, but this came much later. All the big names like Bob Gallagher and Fanno and Elias are there, of course.

+

+

The Information Theory Society is relatively small as compared to huge societies like computer science and so forth. But it’s very family-like, I must say. It’s a very high quality society. The symposiums are really of very high quality. The rejection rates of papers is always very, very high, even today.

+

+

'''Colburn:'''

+

+

Yes, they put out a very impressive Transactions.

+

+

'''Ziv:'''

+

+

Yes, that’s right. I remember once a meeting in the mid-‘80s in Florida where the topic was “Is Information Theory Dead?” Look what happened since then—Internet! Now, I search on the Internet. The cellular telephony is the latest development in information theory, without which there would have been no cellular telephony, especially wide band. Everything is in there—error correction of the highest quality.

+

+

By the way, talking about unknown channels, this is it. When you talk about the cellular communications with all the reflections and the fadings, nobody really knows what a channel is statistically, exactly. So theoretical results find their place, finally, because things are getting more and more complicated. There’s time for everything under the sun.

+

+

'''Colburn: ''' <br>

+

+

Did you ever imagine that the algorithm would be used, for example, transmitting pictures from Mars?

+

+

'''Ziv:'''

+

+

No. By the way, the group of people at HP that developed it are students of Abraham Lempel and myself. But that was 25 years ago. What they did has nothing to do with the algorithm, although they got good education, I must say. The LOCO algorithm is beautiful. It’s an amazing thing. I got an email at six a.m. in the morning from that group, telling me about it. They were proud of the fact that it was working, that NASA adopted their algorithm. I was very proud of them, of course.

+

+

I’m not implying it’s a derivative of Lempel-Ziv. It’s not. It’s called LOCO: Lossless Image Compression. I don’t even remember the letters, but you can find it. It’s really amazing. By the way, an interesting side issue is: one might ask himself why is it that it’s important to have lossless data compression of pictures? After all, some distortion won’t matter, right? But the point is that when you go to Mars, you don’t know what to expect; and when you throw away information, you don’t know what it is that you’re losing. The same thing applied to medical information. When you compress, you want to compress tomographic pictures and X-rays and so forth. The medical doctors won’t let you discard anything because they’re afraid of being sued, of course. Somebody who might then sue them for ignoring the fact that they had cancer and they didn’t see it.

+

+

This is another avenue where noiseless data compression is being used heavily. Although the compression rate saving might be only 60% or so, but nevertheless, when you have to store a huge amount of data… I think that by US federal laws, you must keep the record of any US citizen which is older than 21. You are not supposed to destroy data. So where do you store it? So, if you save 60% of the data, it means not building another wing to a hospital, which is a huge amount of money.

+

+

Then it’s interesting that when you go to Mars, this is another place where you don’t allow yourself to throw out information. Although I’m sure that the next time they’ll go there, they’ll know more about the background pictures and so forth and they might be able to use more efficient algorithms with some distortions. When you go to something which is completely unknown to you, that’s the only thing you can do. The quality of the picture is really amazing.

+

+

'''Colburn:'''

+

+

It is amazing, yes.

+

+

'''Ziv:'''

+

+

I’m really surprised that there was no more coverage in the media about it. It’s an immense technological success. People are getting used to everything, I guess.

+

+

'''Colburn:'''

+

+

I think that’s what it is. The [[Pioneer and Voyager Missions|Voyager]] pictures were so spectacular that after that, we have gotten used to it.

+

+

'''Ziv:'''

+

+

That’s right. Anyhow, coming back to IEEE, I think I paid my dues to the IEEE society in many respects. The IEEE organization is really unique. I think that it’s one of the few organizations where you can really say that it’s an international organization, and where the center of activities in different areas varies from time to time from one country to the other. This is not the case in many of the other organizations. There are scientific societies, but the IEEE is an organization with interests other than just the scientific part of it. It encourages collaboration of all sorts of ways.

+

+

'''Colburn:'''

+

+

I’ve always been very proud to work for it. Thank you so much for your time.

+

+

'''Ziv:'''

+

+

By the way, I might be in New Jersey on the weekend of April 17th. I’m on the Technical Advisory Board of a young company named Klarion, so I’m going to a Technical Advisory Board meeting there. If needed, on Saturday I might have some time.

+

+

'''Colburn:'''

+

+

That would be wonderful, and it would be an honor to meet you. By the way, we will be sending you a copy of the transcript of the oral history interview so you can edit it and make changes and so forth.

Jacob Ziv was educated at the Technion (Israeli Institute of Technology) in Haifa, Israel, and earned his doctorate in information theory at MIT in 1961. He worked for Bell Telephone Labs and the Israeli Ministry of Defense in addition to teaching at the Technion. The Lempel-Ziv algorithm was in many ways ahead of its time, and was not fully utilized by the commercial sector until later technological developments such as the Internet and cellular telephony.

The interview begins with Ziv’s education and work experience. Ziv then discusses MIT’s program in information theory during the time he was a student there, and the important intellectual influences at MIT (particularly Claude Shannon) that shaped his thinking. He goes on to describe the workings of the Lempel-Ziv algorithm and other data compression algorithms, as well as their practical applications in various fields, such as the Internet, cellular telephony, and the transmission of photographs from Mars. The interview concludes with a discussion of Ziv’s activities with the IEEE and the Information Theory Society and his thoughts on the IEEE as a unique organization.

About the Interview

Jacob Ziv: An interview conducted by Robert Colburn, IEEE History Center, 25 March 2004.

Interview # 437 for the IEEE History Center, The Institute of Electrical and Electronics Engineers, Inc.

Copyright Statement

This manuscript is being made available for research purposes only. All literary rights in the manuscript, including the right to publish, are reserved to the IEEE History Center. No part of the manuscript may be quoted for publication without the written permission of the Director of IEEE History Center.

Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, 39 Union Street, New Brunswick, NJ 08901-8538 USA. It should include identification of the specific passages to be quoted, anticipated use of the passages, and identification of the user.

It is recommended that this oral history be cited as follows:

Ziv, Jacob, an oral history conducted in 2004 by Robert Colburn, IEEE History Center, New Brunswick, NJ, USA.

Interview

Interview: Jacob Ziv

Interviewer: Robert Colburn

Date: 25 March 2004

Place: Telephone Interview, Tel Aviv, Israel

Life History, Graduate Work at MIT, and Career Path

Colburn:

I would like to ask a number of questions, partly to go through your life history, your early education and schooling, and the things that led you to the things that you’ve done, and also to ask about the creative process in developing the algorithm.

Ziv:

Okay, let me do my best. I’m not sure I’m completely aware of what happened insofar as the creative process is concerned.

I got my primary education and secondary education in Israel. I was born in Israel. I got my first degree in Electrical Engineering at the Technion in Haifa, Israel. Technion is the Israeli Institute of Technology. It’s the oldest university in Israel and the main engineering school in Israel until now. I also got my Master’s degree at the Technion, but then I went to the United States and got my Ph.D. from MIT in Cambridge, Massachusetts.

The area of my research at MIT was information theory. Prior to going to MIT, I was an R&D engineer with the Research Establishment of the Ministry of Defense in Israel. After coming back from MIT, I then joined the Technion and the Technion faculty.

Colburn:

What year was that?

Ziv:

I got my Ph.D., I believe, in 1961. I then spent a few more years at the Research Establishment of the Ministry of Defense. Then around 1970, after spending a sabbatical at the Bell Telephone Labs, which in a way was my second research home for quite a long time—I used to spend, for long periods of times, summers and sabbaticals at Bell Labs. But then in 1970, I joined Technion in Israel. Since then, this is my home university base.

While working as a communication engineer, I was fascinated by early books about information theory. It was pretty clear to me that this was what I wanted to do. Since Shannon was at MIT, it was pretty clear that that was the place to go to. Shannon is the father of information theory. And it was indeed the right choice, because at that time it was the center, the best place for that research activity in the world, I would say. Not only that it was the best place, but they were able to give you the feeling there that you are in the best place, which is almost as important. I had the benefit of working there with leading figures in information theory at the time, Jack Wozencraft and Bob Fano.

At that time, a book by Jack Wozencraft and Irwin Jacobs, who is now, by the way, the president of Qualcomm, was the bible of communication theory. There were many others there. Peter Elias and so forth. Bob Gallager was a young professor there—all the people who were either then superstars in information theory or who became superstars later on. So I was lucky in making the right choice then.

Since then, I’m in this area. When I returned to Israel, as I told you, I was going back and forth during summers and a few sabbaticals to Bell Labs. I collaborated a lot there with the late Aaron Wyner, who is one of the most important contributors to information theory. While at Bell Labs, and also at Technion, I got interested in one branch of information theory, which almost didn’t exist then. The classical theory of Shannon, which is a beautiful mathematical theory, not only was a theory, but it so happened that it was very applicable, especially for the space channel where the noise is well-behaved, namely it’s White-gaussian noise, something that everybody knew how to deal with mathematically. It wasn’t just an idealistic modeling for a communication channel; it was a real model.

The basic assumption there was that if you know what the channel is, statistically, if you know the statistics that govern the channel, namely the noise statistics and so forth, you know the probabilistic rule that governs the output given the input, then Claude Shannon came with a beautiful idea and proved it, that by doing coding, namely, using longer and longer messages, you can transmit data practically error-free, as long as the rate is below “capacity.” The only cost was delay, of course. This was a novel idea, but it was all based on the fact that you know what a channel is, probabilistically.

The natural problem is two-fold. The simpler case is that where you want to compress data, this is another idea that Shannon dealt with. If you know the statistics of the source, you can represent the source with less bits per second than the rate at which the original data is coming in. By clever mapping, which is based on the old idea by Morse, you attach a long code for a rare message or rare letter and a short one for something that is not rare because it’s being used more frequently. So, on the average, you can transmit at a lower rate. This was known, and this idea was analyzed by Huffman and Shannon and then beautifully developed by Qualcomm and so forth. But it was all based on the assumption that you know what the statistics are. I was bothered by this a-priori assumption for a long time.

Lempel-Ziv algorithm, Viterbi algorithm, and Ziv-Zakai bound

Ziv:

Coming back to the Lempel-Ziv algorithm you referred to earlier in your question, the first step was to try and understand how to cope with the case where you don’t know what the statistics are and you have to cleverly learn what it is while dealing in compressing the data that comes in. This finally led to a whole theory of universal data compression. Again, the algorithm that was developed jointly with Abraham Lempel, here at Technion, and led to an optimal solution in which we were able to show that asymptotically - when you deal with longer and longer sequences - you actually can compress the data to the very same minimal rate that you could get even if somebody would have told you what the statistics are exactly. That’s nice.

Now, in science, you don’t have to be only clever; you have to be lucky too. And we were lucky in the sense that the tools by which we were able to prove this nice result also led to a very efficient algorithm which was linear in time. Namely, if you need to compress a sequence of, say, n letters, the number of computations that you have to do was proportional to n, rather than much more than that in other schemes, such as the Huffman code. Even in the Huffman scheme, which is very efficient, complexity is bigger, despite of the fact that you do know what the statistics are. The very fact that this mathematical tool by which we were able to prove these results led to a very efficient algorithm. This really made it so popular and famous.

Let me mention another interesting example where the very same thing happened. It’s the Viterbi algorithm, the very famous Viterbi algorithm, which is one of the cornerstones in communication systems. Again, Andrew Viterbi developed it because he was interested in proving some mathematical theorem which had to do with convolutional codes, which was a form of coding which was at that time very popular. Toward that, he derived an algorithm, which basically was a mathematical tool to prove a theorem, but turned out to be a very efficient algorithm which everybody is using. What made it famous, again, its simplicity.

When the Lempel-Ziv algorithm was developed, the practical impact was not felt because at that time, people weren’t that much interested—there was no real need for data compression for communication. It was used in the late 1970s for magnetic decoding and so forth, but insofar as communication there was no real need until, of course, when the Internet came in and there was a huge need for it. This is why now this algorithm is integrated in all the classical communication algorithms that have been used over the Internet for transmitting data error free, as well as in JPEGs and other algorithms.

In practice, you deal with data, which is usually man-made. Sometimes it’s rather foolish, even, to try to model it statistically. By the way, the theory that we developed with Abraham Lempel is able to cope with individual sequences. The results had to refer to an individual sequence, even if it’s not being generated by some probabilistic source. So the probabilistic assumption was really not needed at all. Mathematically, it was a very strong result from that respect.

This has to do with sources. I then got interested in similar problems which had to do with the case where the channel statistics is not known. What about the channel? What do you do then? I had been working on that for a number of years, and I was able to show that there are ways by which you can transmit error-free over a wide class of channels, channels with vanishing memory. Channels which do have memory, but not an infinite one, a fading memory. So for those channels, and most of real channels are like that, it was possible to show that you can get error-free transmission over the channel without a priori knowing what the channel statistics are. Not only that you can do it, but if the messages are long enough, you can get an error decay which was identical to the rate which was possible to get when you do know what the channel statistics are. So asymptotically, you again didn’t lose anything.

That was another effort that I spent a lot of energy on—and later, together with others. This led me to what I do more recently. This has to do with cases where you want to do other jobs, like for example, prediction. Suppose, based on observing a past sequence, you want to guess what the next letter will be. If you know the statistics, it’s known exactly what one should do and what the prediction error is. But what if you don’t know the statistics? The classical approach in information theory in many problems uses a technique and philosophy established by Shannon, and it’s a very important one. The idea is the following: How do you cope with a typical problem in information theory? You first try to set a converse theorem. A converse theorem is something that tells you what you cannot ever do, by any means. For example, if we talk about transmission over a channel, you can show that if the rate of information is above a certain value, namely capacity, no one in the world, no algorithms can yield a probability of error which vanishes in any way.

The same can apply to prediction. You can show that there is a lower bound below which the prediction error must be big. And then there is something which is sometimes called a coding theorem, where you derive an algorithm by which you can demonstrate that you can come very close to the lower bound, which is guaranteed by the converse theorem. If the two bounds come close to each other, then you know that you are done, because you know that you have something which is optimal or as good as the optimal solution, or very close to the optimal solution. The same technique has been used not only for data compression and for transmission in an error-free fashion over an unknown channel, but also for the case where you want to do prediction, and also classification.

Classification is another interesting case. Suppose I give you a full description of a certain source of information probabilistically. You know the full probabilistic description of that source. Now somebody comes and hands you a sequence, and you have to guess or judge whether this sequence is emerging from a source, which is different from this particular source or it’s not, if it comes from something which is different from that source, namely a different source with a different probabilistic orientation, according to some distance criterion, you should be able to tell that they are different. Again, if you have to judge between two known sources, you know exactly how to do good classification. By classification, I mean you have to declare whether it’s coming from Source A or Source B and you want to minimize the probability of error. But what if you don’t know anything about the source that is generating the sequence that you have to decide about? Then again, there is an optimal solution. The technique is that classical approach of information theory that basically all information theorists are using since Shannon, and that is setting a converse theorem which tells you what you cannot do. Then, if you are lucky, you get an actual algorithm that approaches that lower bound. Then you are happy, because you know that nobody can beat you with that, and I was lucky.

This is the main thrust of my work. There are other interesting results which have to do with the borderline between problems in communication theory and the estimations and so forth. There was a nice result that Professor Mosche Zakai from Technion and myself developed. It’s called the Ziv-Zakai Bound. It has to do with an interesting relation between the minimal probability of error which is associated with communication, when you want to transmit letters via a noisy channel, and the estimation error, where you want to estimate a certain value of the message—not to decide whether it’s A or B, but to decide what the analog value of the message is. It’s a happy marriage between two different areas in communication. Basically one which has to do with analog communication, the other one with digital communication. This bound is very useful for a variety of problems in communication, in radar theory, and so forth.

There are many other results, but I would say that main thrust of my work, what I did and what I was trying to do throughout the years, is to get to develop an information theory which is universal in the sense that you don’t really have to know the exact statistics of either the source or the channel. This is, I think, a short summary of what I did.

Colburn:

What are some of the tools that you used to do this in terms of computer machinery? When you were working on this, what sort of things did you use, and how did you test it?

Ziv:

Let’s go back. The best example is, again, that of the Lempel-Ziv algorithm. It had to do with the following: Suppose you meet a long sequence in the street, and you don’t know what it is, you don’t know who generated it. You want to compress it in the best way that’s possible. Is that a silly question or not? It turned out that if you limit the class of computers or machines that are going to compress this data to a class with finite memory, then theoretically, there is an answer. Not every sequence can be compressed. And actually we demonstrated that there exist some sequences which are not compressible at all by any machine with limited memory. By the way, this is important also for cryptography, because when you need good keys for cryptographic purposes, this is exactly what you need, something which will look really random. If something is really random, it means that it’s incompressible.

Anyhow, once we tried to solve that and we did solve that, then we realized there is a nice algorithm that goes with it. Then we published a second paper, which stressed the algorithmic meaning or the algorithmic usefulness of that method. Then we tried to play with it ourselves by doing some simulation. All we had at that time was small PCs. Even PCs were not around; they weren’t too popular then. It was in the mid-1970s. Then what happened was that both Professor Abraham Lempel and myself went for a sabbatical. I went to Bell Labs and he went to what used to be then Sperry-Rand Research in Boston. Then we could do much more, because I then was benefited by the help of some excellent programmers at Bell Labs; he did the same at Sperry. Although Abraham Lempel himself is a computer scientist—he knew more about programming than I did. We kept on working together. Then we came with a second version, which is much more efficient. And then the first patent was issued by Sperry-Rand on that. This patent now belongs to Unisys. There were many disputes about this patent.

Even before going to a sabbatical, we tried to convince a company in Israel to start a project, because we were aware of the fact that it’s useful commercially, too, even in the late 1970s, because in Germany, unlike in the States, at that time, they charged you by the number of bits rather than by the duration. This means that if you do compression, you pay less. So there was a really good reason for trying to make a commercial product out of it. We set up together with a company in Israel, a small R&D group. But then this company collapsed because of other reasons. So nothing came out of it.

Anyhow, we were aware of the commercial importance of that aside from the theoretical part of it. But we were really hoping that either Bell Labs or Sperry-Rand or both would do something about it. I must say that it was very difficult. I was at Bell Labs and when I tried to convince them to issue a patent, they said that as long as Sperry-Rand is doing it, it’s okay with them, because they had some kind of a patent agreement between the two companies. At that time, there was another good reason. At that time it wasn’t clear—the law was changed later—it wasn’t clear whether you could actually get a patent on software. It had to be a real piece of hardware for you to be able to get a patent. Sperry-Rand did propose a hardware realization and applied and got a patent, and then a second patent.

Later on, there were other derivatives of this algorithm. Abraham Lempel spent some time at that time at HP, so he continued to develop it there. But then other companies took over, Microsoft and other companies, because at Microsoft became part of the operating system. Since then, both of us didn’t really follow up all the details of the different derivatives of that algorithm.

The nice thing about this algorithm, by the way, is that it’s so simple that it’s very easy for somebody to get his own version of this algorithm. Everybody is aware of the fact of the phenomenon of “not invented here.” It is so simple an algorithm that it’s also very easy to change it slightly and make it more optimal for a specific use. This is why many people took over. Now there’s a whole family of the LZ algorithm. Even nowadays, it’s the most popular and most common data compression algorithm that people are using for noiseless data compression. By noiseless, I mean this is the case where you don’t allow for any errors in the data compression.

But this is not the only noiseless compression algorithm. A very excellent example is the local algorithm, which was developed at HP and which was used in the Mars probe recently. This is to transmit these beautiful pictures from Mars. So by now, there are other algorithms, but they are much more complicated computationally. You have to pay the price—in the case of NASA, where every bit that you save is worth a million. For every day use, the old LZ algorithm is still the most popular one because it’s so simple, and very fast too. But indeed, I must emphasize the fact that it’s not the only one. There are other algorithms that have been developed since then. This gave a real thrust to the whole area of universal data compression.

I can mention other well-known algorithms for data compression: the PPM algorithm, which is mostly used when you want to compress huge archives and so forth, CTW and so forth. But they are all based on the same philosophy, as that of the Lempel-Ziv algorithm. The idea is to establish a dictionary or a list of things that you have observed already. In a sense it’s like a TV Guide. If you take a TV Guide and there’s an old movie that they show three or four times in the same week, they don’t print the full description of that movie again. They just use a pointer that tells you to go to two or three pages earlier and look for it there. Once you have observed something (and I’ll tell you a little more about what this something is), then you list it as an entry in the dictionary. And when you observe it again, you don’t have to copy it. All you have to do is to send the pointer and tell the other side where to copy from—the dictionary which both sides have. The whole idea is that this dictionary is being developed dynamically by both the sender and the receiver as it goes.

How you do it is simply the following. When you have a sequence, and let’s say that this sequence starts with a zero. You have never seen a zero before, so this is your first entry into your dictionary. Now let’s say that the second letter is a zero. Then you say, “Hey, wait a minute. I saw that already, so I don’t have to copy it.” Although in the beginning a zero can be represented by a small number of bits; it doesn’t really matter whether you copy it or not. It’s just one letter. Let’s say that after the second zero, there’s a one. So you say, “Aha! Zero-one is something that I’ve never seen in the past, so let me put that in the dictionary.” Slowly, you are generating a dictionary which consists of phrases of the data that are longer and longer. If such a phrase, sometimes it’s very long, repeats itself, you don’t have to copy it; all you have to do is to send the pointer and tell the other side where to copy this phrase from. That’s the whole idea. I think that the TV Guide explanation is the right one. The idea is generating dynamically a dictionary and pointers. You transmit the pointers, not the information.

The other algorithms that were mentioned—PPM, CTW—they’re all based on that idea. The new algorithm, which has been used in the Mars probe, is the LOCO-I algorithm, which is a very clever idea. But it’s only for the noiseless compression of pictures. If you want to transmit pictures noiselessly, then this is a very efficient algorithm.

There is another branch of data compression, which is called noisy data compression. In most cases, you allow yourself to introduce distortion. If you are pressed on rate, if you want to minimize the number of bits, some distortion may be acceptable to you. This calls for a different kind of data compression schemes, although sometimes a building block in such compression schemes is again the Lempel-Ziv algorithm. You first introduce distortion into the picture and compress is a little bit, and then go on and use the Lempel-Ziv algorithm.

Coming back to your question, I myself am not an expert in computer programming. But again, there are so many versions that there were enough jobs for many good programmers that came up with a variety of improvements.

Applications

Colburn:

You mentioned that at first, people didn’t see the application. What was the first time that you heard of or that you knew of, the first application that somebody had taken it and used it?

Ziv:

On the Net, people were aware of it and starting to use it for their own purposes. That’s what I did, too. At that time, memory was expensive. Currently, you have a huge memory even in a hand-held computer. But at that time, memory was expensive, so you had no choice but to compress if you wanted to store a lot of data. I was using it for my own purposes, and many other people started to use it for the same purpose. People were aware of it mainly in information theory, and then computer science discovered it; also some versions of it were available on the Net and were free for everybody to use for data storage purposes.

But the big push came when people were starting to use data compression on the Internet. There was no other choice because of the huge traffic. People just had to use it. This is when Sperry-Rand was bought by Unisys. Unisys, for years, didn’t tell anybody that they had that patent. Only when everybody was using it for free that they started to get after people, charging them. There was a big fuss about it. Finally, there was some sort of an agreement when the Lempel-Ziv algorithm was declared to be part of the approved standard of the FCC, v. 42-bis. Anyhow, when they adopted that as a standard, they forced Unisys to agree to a reasonable arrangement where any vendor of modems which wanted to use it, or had to use it as part of its modem, was paying an initial single payment of I think $25,000 dollars to Unisys and that was it.

For a while, everything was quiet. But then there was a second phase where it became part of GIF. Are you yourself familiar with it?

Colburn:

I’ve been doing a fair amount of reading, as much reading as I could.

Ziv:

Oh, okay. GZIP is a free version of the Lempel-Ziv algorithm, which is very good. People were threatening Unisys that if they won’t stop running after people and suing them or charging them, they’ll just switch to a new standard. I think that at that time they gave up. I’m not that familiar with the details, or what’s going on today. But definitely, when the World Wide Web became popular, this gave a huge push to the Lempel-Ziv algorithm, because it became an integral part of the algorithm of the World Wide Web. Without data compression, it was impossible to do anything at all at that time, and even nowadays. It’s amazing. The more bandwidth you get—you can always feel that there’ll be no problems. But then there’s so much more data being generated that data compression is always needed and always will be needed. The same applies to computers. You have a huge amount of storage space, but still, it’s being filled up very quickly, as you know. And Microsoft is not doing its best to make things efficiently, so this helps the data compression business.

So this is it, in a nutshell, the more applicable side of it. But I’m not really an expert on the most recent versions and applications of that algorithm and the other algorithms. Basically, I’m interested in information theory, although my basic education is that of a communication engineer, and I was involved with R&D of real communication systems. But I’ve drifted into doing more and more theoretical work, which again, became applicable.

Colburn:

One of the things that the other Marconi Fellows have talked about is that a lot of great ideas occur at the borderlines.

Ziv:

I agree. This is one point that I always emphasize to my students, too. What happens is that as a scientist, you acquire and develop certain mathematical tools. If you stick to the original application of the tools, then you find yourself being pushed more and more into a corner which nobody is interested in. The whole idea is that once you acquire the tools, you should always ask yourself, “What is it that one can do with these tools in other fields?” At least in my case, it was always a fruitful approach. I have mentioned to you the borderline between information theory and prediction, information theory and classification. More generally, the borderline between information theory and statistics, information theory and pattern recognition, all these areas, the borderline is really where the action is, and should be, I think. And you are right. I’ve heard that being mentioned by other Marconi Fellows as well.

Colburn:

I read somewhere recently that the Lempel-Ziv algorithm is used for music, that there are a lot of uses for it in recording of music.

Ziv:

Yes, that’s right. Yes, definitely. When Sony-Philips came out with the first DVDs, the Lempel-Ziv algorithm is built into that. I’m pretty sure that its still is. In hardware I mean, not just software. It has been used in lots of places. Let me mention other interesting applications. For example, if you listen to a recorded piece of music and you want to classify it and find out whether it’s Mozart or Beethoven, there are some successes with that too, trying to use Lempel-Ziv as a classification means.

Let me tell you how it works. The idea is the following. Remember I told you earlier about developing a dictionary. If you treat a piece of music as a sequence of notes and so forth, these are letters too. You build up a dictionary. The dictionary would be typical to that particular music. If you try to use that dictionary in order to compress another piece of music which was composed by a different composer, it won’t work because it’s different. All you have to do is to check and see if that piece of music is compressible as compared with another piece of music (which was used for developing the dictionary), or not? If it compresses well, there’s a good chance it’s the same composer. If not, it’s not. And it has been used, and quite successfully.

Let me mention another application, which was just developed by a group of students at the Technion. Everybody is worried about communication security nowadays. Their idea was the following. Everyone has a differing typing pattern when he touches the keyboard. They were able to use a version of the Lempel-Ziv algorithm to identify persons by the way they type on the keyboard. The typing style on the keyboard apparently is like a fingerprint. I think that they’re making a product out of it. It started as a student project—not in our department, in the EE department. It was developed at the Computer Science department. You can apply it to music, you can apply it to typing styles, and so forth.

Let me give you another example. In the Bible, the book of Isaiah, there are two parts, Isaiah I and Isaiah II. The question is whether it was written by the same person or not. People, I think at Bar-Ilan University here in Israel, were using Lempel-Ziv to demonstrate that it was not written by the same person because it’s a different style. You can check if something is an original Shakespeare or not, but it takes miles of data. It’s not easy. It takes a while, because suppose the differences are in the syntax. It takes many words before you can find out something. But in principle, it’s working. Now with the big machines that we have nowadays, this is definitely feasible.

This is exactly related to the classification problem that I mentioned earlier. Classification is exactly that. In your memory, you have a piece of Shakespeare, a book of Shakespeare, and then I feed the computer with something which was written by somebody else. The job of the classifier, namely the computer that does the classification, is just to tell you whether it’s Shakespeare or not Shakespeare. Mathematically, it’s feasible. This I know because I have the theorems to prove it. Of course, when it comes to application, you need big machines to do that. But now we can do a lot. If not now, ten years from now.

By the way, if you’re interested in the Lempel-Ziv implications on fields other than pure data compression, there’s a very nice book by Bob Lucky. He’s another Marconi Fellow. He did a beautiful job. For example, I can give you another very interesting philosophical question. People are arguing about the fact as to whether we are born with the ability to understand languages, or is it something that we learn by ourselves? Noam Chomsky, for MIT, a very famous linguist, came up years ago with a theory that we are born with certain abilities when it comes to languages. He came to the conclusion that this is the case because he was able to show that there are similar structures of different languages, including music, by the way. So it doesn’t really matter whether you’re born in Africa or in the States. It doesn’t really matter what language you’re speaking. There are certain similar structures to languages.

Bob Lucky poses the following problem. Take all the books in the Library of Congress and apply the Lempel-Ziv algorithm to the series of books, making them one huge sequence. You’ll find out that the average length of a word in that dictionary is not too big, I think he mentioned something like 16 letters, 17 letters. But we know we are much more clever than that. After all, sometimes if I tell you two words, you can guess the whole sentence. Where is it coming from, this ability? The Lempel-Ziv model is that of a huge computer which knows nothing—all it can do is learn from its past experience. But apparently, our capability to comprehend much more than what’s contained in a word of 17 letters has to do with something which we are born with. If you wish, you can go philosophical about it and go to other areas as well.

Colburn:

Yes, indeed. Again, another borderline.

Ziv:

Yes, sure.

IEEE

Colburn:

Another area I wanted to ask you about was your IEEE activities and connections. I also wanted to mention to you that the Lempel-Ziv algorithm is being proposed as an IEEE milestone.

Ziv:

Yes, I see. Yes, I know that there were some initiatives from the IEEE in Israel. Is that what you are referring to?

Colburn:

Yes. The proposal has been submitted and it’s going to be discussed at this Sunday’s meeting.

Ziv:

Is the purpose of the interview its relevance to the Marconi?

Colburn:

To Marconi, yes.

Ziv:

Okay. Because if you want to get more information about the algorithm itself, you should also interview Abraham Lempel. He is in Israel, but he time shares between Israel and HP in Palo Alto, California. He’s in charge of their research group in computer science and a few other fields worldwide. But he’s also the director of HP Labs in Israel. If you want to put the accent on that, you can check with him as well.

Colburn:

I would like to. I hope to be able to.

Ziv:

Yes. We collaborated very closely at least six years or so on the notion of universal coding and data compression, and also the development of the algorithm, of course.

Now about the IEEE. Well, I think I’m a loyal member. I did my best. I was on the board of what is now called the IEEE Information Theory Society. I was co-chairman of one international symposium, which was held here in Israel in Ashkelon. By the way, we had the most prestigious award, which is given by the Information Theory Society, called the Shannon Award. The first one to get a Shannon Award was Claude Shannon. Actually, I must confess to you that Aaron Wyner, who was co-chair of that symposium with me, and myself thought of ways how to convince Claude Shannon to come to Israel, of all places. He wasn’t that well then, too. This is how the Shannon Award came about, because this convinced him to come. He gave the first Shannon Lecture, which is the most important event until today in the international symposium on information theory. He was the first Shannon awardee. But the truth behind it was how to drag Claude Shannon from Boston to Ashkelon, Israel.

Colburn:

And how did you do that?

Ziv:

First of all, it was an international symposium on information theory, which switches from one place to others around the world. So he did participate in those symposiums from time to time, but at that time already, he did it very rarely. We approached him and told him that we want to establish the Shannon Lecture and the Shannon Award. Becoming a Shannon Lecturer is the highest appreciation that one can get in our society. He liked the idea and volunteered to give the first lecture, and that was it. Since then, there’s a whole line of Shannon awardees. I am glad to be one of them, and so is Aaron Wyner, but this came much later. All the big names like Bob Gallagher and Fanno and Elias are there, of course.

The Information Theory Society is relatively small as compared to huge societies like computer science and so forth. But it’s very family-like, I must say. It’s a very high quality society. The symposiums are really of very high quality. The rejection rates of papers is always very, very high, even today.

Colburn:

Yes, they put out a very impressive Transactions.

Ziv:

Yes, that’s right. I remember once a meeting in the mid-‘80s in Florida where the topic was “Is Information Theory Dead?” Look what happened since then—Internet! Now, I search on the Internet. The cellular telephony is the latest development in information theory, without which there would have been no cellular telephony, especially wide band. Everything is in there—error correction of the highest quality.

By the way, talking about unknown channels, this is it. When you talk about the cellular communications with all the reflections and the fadings, nobody really knows what a channel is statistically, exactly. So theoretical results find their place, finally, because things are getting more and more complicated. There’s time for everything under the sun.

Colburn:

Did you ever imagine that the algorithm would be used, for example, transmitting pictures from Mars?

Ziv:

No. By the way, the group of people at HP that developed it are students of Abraham Lempel and myself. But that was 25 years ago. What they did has nothing to do with the algorithm, although they got good education, I must say. The LOCO algorithm is beautiful. It’s an amazing thing. I got an email at six a.m. in the morning from that group, telling me about it. They were proud of the fact that it was working, that NASA adopted their algorithm. I was very proud of them, of course.

I’m not implying it’s a derivative of Lempel-Ziv. It’s not. It’s called LOCO: Lossless Image Compression. I don’t even remember the letters, but you can find it. It’s really amazing. By the way, an interesting side issue is: one might ask himself why is it that it’s important to have lossless data compression of pictures? After all, some distortion won’t matter, right? But the point is that when you go to Mars, you don’t know what to expect; and when you throw away information, you don’t know what it is that you’re losing. The same thing applied to medical information. When you compress, you want to compress tomographic pictures and X-rays and so forth. The medical doctors won’t let you discard anything because they’re afraid of being sued, of course. Somebody who might then sue them for ignoring the fact that they had cancer and they didn’t see it.

This is another avenue where noiseless data compression is being used heavily. Although the compression rate saving might be only 60% or so, but nevertheless, when you have to store a huge amount of data… I think that by US federal laws, you must keep the record of any US citizen which is older than 21. You are not supposed to destroy data. So where do you store it? So, if you save 60% of the data, it means not building another wing to a hospital, which is a huge amount of money.

Then it’s interesting that when you go to Mars, this is another place where you don’t allow yourself to throw out information. Although I’m sure that the next time they’ll go there, they’ll know more about the background pictures and so forth and they might be able to use more efficient algorithms with some distortions. When you go to something which is completely unknown to you, that’s the only thing you can do. The quality of the picture is really amazing.

Colburn:

It is amazing, yes.

Ziv:

I’m really surprised that there was no more coverage in the media about it. It’s an immense technological success. People are getting used to everything, I guess.

Colburn:

I think that’s what it is. The Voyager pictures were so spectacular that after that, we have gotten used to it.

Ziv:

That’s right. Anyhow, coming back to IEEE, I think I paid my dues to the IEEE society in many respects. The IEEE organization is really unique. I think that it’s one of the few organizations where you can really say that it’s an international organization, and where the center of activities in different areas varies from time to time from one country to the other. This is not the case in many of the other organizations. There are scientific societies, but the IEEE is an organization with interests other than just the scientific part of it. It encourages collaboration of all sorts of ways.

Colburn:

I’ve always been very proud to work for it. Thank you so much for your time.

Ziv:

By the way, I might be in New Jersey on the weekend of April 17th. I’m on the Technical Advisory Board of a young company named Klarion, so I’m going to a Technical Advisory Board meeting there. If needed, on Saturday I might have some time.

Colburn:

That would be wonderful, and it would be an honor to meet you. By the way, we will be sending you a copy of the transcript of the oral history interview so you can edit it and make changes and so forth.