How do Java programs deal with vast quantities of data? Many of the data structures and algorithms that work with introductory toy examples break when applications process real, large data sets. Efficiency is critical, but how do we achieve it, and how do we even measure it?
This is an intermediate Java course. We recommend this course to learners who have previous experience in software development or a background in computer science, and in particular, we recommend that you have taken the first course in this specialization (which also requires some previous experience with Java).
In this course, you will use and analyze data structures that are used in industry-level applications, such as linked lists, trees, and hashtables. You will explain how these data structures make programs more efficient and flexible. You will apply asymptotic Big-O analysis to describe the performance of algorithms and evaluate which strategy to use for efficient data retrieval, addition of new data, deletion of elements, and/or memory usage.
The program you will build throughout this course allows its user to manage, manipulate and reason about large sets of textual data. This is an intermediate Java course, and we will build on your prior knowledge. This course is designed around the same video series as in our first course in this specialization, including explanations of core content, learner videos, student and engineer testimonials, and support videos -- to better allow you to choose your own path through the course!

AE

Very solid review of data structures. In fact, I learned a lot of new things along the way like how to benchmark test Java code and how to properly use JUnit and other unit testing frameworks.

YK

Jan 11, 2016

Filled StarFilled StarFilled StarFilled StarFilled Star

I really liked this course and the way it was taught. Although I am a CS major, I really learnt a lot of other stuff from this course apart from learning the core content of Data Structures.

À partir de la leçon

Efficiency Analysis and Benchmarking

Welcome to week 3! The text-editor application you worked with last week does something, but it doesn't do it particularly fast. This week we'll start talking about efficiency. We'll introduce the concept of "Big-O" notation, which sounds a little silly, but is really a powerful (and extremely common) way of analyzing a program's efficiency, independent of the system that it's running on and the exact details of how it's implemented. Then we'll go the other direction and dive into the details, talking about how to measure the actual running time of a piece of code to get an idea of how it really performs in practice.

Enseigné par

Christine Alvarado

Associate Teaching Professor

Mia Minnes

Assistant Teaching Professor

Leo Porter

Associate Teaching Professor

Transcription

All right, now you've gotten warmed up with a project and now we want to think about performance, and this is a common theme throughout the entire course. We'll be writing code but then we'll also be talking about analyzing it, thinking about if it's good enough, and what that even means. So, by the end of this first video, you'll have a better sense of what it means to talk about the performance of an algorithm and why it's important. And then we'll also describe some factor that impact the performance of an algorithm. So, first of all, before we delve into an example, let's remind ourselves that an algorithm is a strategy for solving a problem. So when we talked about the performance of an algorithm, we're talking about how good that strategy is for coming at the solution of the problem that we're thinking about. So, let's think about an example that comes from the project. And in the project we're analyzing a whole bunch of text, and we're trying to understand how readable it is. That's the first part of the project. And Christine suggested that we used the Flesch score and she talked about that in some of the earlier videos. And so we have a way of computing some numerical representation of the readability of a text, based on the number of words, the number of sentences, and the number of syllables. Okay, so that's great. We read in a whole bunch of text, we compute this Flesch score and then we have a description of how readable the text is. But the question is, how long does it take? So, from when we've read in the text to when we have the Flesch score available for our user. And if we want to answer that question, we might just time it. So, we might start a stopwatch and press go when our algorithm to compute the Flesch score starts, and then stop the stopwatch when we've gotten the answer. The issue, though, is that this time that we get might not be a great representation of how good our algorithm is. So we're asking how good is the algorithm for computing flesh score? And so, before we get to answering that question, let's think about what might be the problem with just looking at the stopwatch time. So if we think about how much time it takes for our code to run on a particular input, what might impact that time? Well, so, for example, the computer that we're running on, will have a huge part to play in how quickly the code runs. So, if we run on really old hardware, then we expect our code to take longer than if we run it on a really snazzy new machine. Or if the computer's really bogged down with a whole lot of other programs, then we expect it to be a little bit slower than one that is devoting all of its resources to just the one program that we're running. Other factors that might come into play, include the compiler that we use to compile our code into machine language. But also decisions that we make about what libraries to use and what optimizations were made in those libraries for the particular commands that we want. So, just running our code on a particular input and timing it doesn't really give us a big picture of our overall strategy for the code. So you can think about the problem solving paradigm, as we're given a problem, compute the Flesch score of given text. And we've got a strategy for computing that Flesch score that maybe says, read in all the text and compute how many words there are. Read in all the text and compute the number of sentences. Read in all the text and compute the number of syllables. And then once we've got those numbers, then we do something with it and compute the Flesch score. And so, that high-level description is like our algorithm, and then we've coded it up in Java, or you've done it in the project, and we have a particular implementation. And what we'd like to do is talk about the performance of the algorithm and sometimes that's tied into the implementation as well. And so, just the time for running the specific code on a specific machine on a specific input doesn't give us the whole picture. So, should we just give up trying to measure performance? Well we shouldn't because performance is so important. Performance is the difference between running a really cool algorithm that let's us answer very hard questions in very short amount of time. And therefore, processing more and more data that then leads to analyze even bigger trends and answering more important questions. And if we don't have handle on the size of the data that we can process with our algorithms, then maybe we think we have a really cool solution but if we try to run it, then the program's not gonna stop until the end of time. And so, having a sense of how our solutions interact with larger data and more complicated data is at the heart of solving these problems. So we don't want to give up, but what can we do? Well, maybe instead of measuring the time, we're going to count operations. Because hopefully we have a sense of, perhaps if we have some sort of basic operation, then the amount of time that it takes to make that operation, well, we can't control it. But, we'll know that our algorithm takes not too many operations. And so, it won't take that long. Okay, so, we are deciding that we're just going to focus on what we can control, which is the number of operations we asked the computer to do, rather than how long the computer takes to do those operations. And, so we might go ahead and try to count them. In our example of the Flesch score, we want to start at the first part of our text, and go through it. And for example, we're going to count the number of syllables in the document. And so we want to count the number of operations that happens. The thing is, though, the number of operations that we have to do in order to count the syllables, depends on our input. So, if we're looking at the whole American tax code, that's going to be a whole lot of operations to count all the syllables in there. As opposed to, if we just looked at a short snippet, like in the example that Christine did, where it's still a complicated text, but we're just looking at a small part of it. And so the number of operations that we execute will be smaller. Now, notice that we're performing the exact same algorithm, whether we're looking at the whole tax code or just a small part of it. And so our measure of performance should be about the algorithm shouldn't be talking about the input. So what we need to do, is focus on how the performance scales, which means how does the number of operations grow as our input grows? So for example, we'll ask questions like, if the list, which is the string of text that we're processing, is twice as long, how many more operations will it take us to process it? Okay, so, we've moved away from just timing using the clock time, and we've gone to looking at the number of operations our algorithm takes, and we're not just going to look at the number as an absolute number. We're thinking about how it grows and changes as the input size changes. Okay, but is size all that matters? For example, if I have two collections of text, say the American tax code and then something else that is equally long, that are standing right in front of me and I try to process both of these texts one after the other. Will the number of operations that's required to process the IRS tax code, be exactly the same as the number of operations required to process something else? Maybe, maybe not. This is a worthwhile question to consider, and so we're going to take that into account as well, when we think about performance. So, our three strategies for talking about performance include, #1, count the operations instead of the time. So we abstract away from the machine considerations and the software considerations that aren't part of the algorithm design problem that we faced. #2, is focus on how the performance scales. Because we care about really big inputs and so we wanna know about what happens eventually as the output gets bigger and bigger and bigger, how does our algorithm handle it? And #3, go beyond the input size because what we'd like to know, is how the algorithm responds in all sorts of situations. And so we'd like our performance analysis to be able to capture not just the size of the input but also what might happen because of internal structure to the input. Okay so, in the next video what we're going to do, is delve right in to counting operations for specific codes snippets.