>>> By enrolling in this course you agree to the End User License Agreement as set out in the FAQ. Once enrolled you can access the license in the Resources area <<<
This course, Applied Artificial Intelligence with DeepLearning, is part of the IBM Advanced Data Science Certificate which IBM is currently creating and gives you easy access to the invaluable insights into Deep Learning models used by experts in Natural Language Processing, Computer Vision, Time Series Analysis, and many other disciplines. We’ll learn about the fundamentals of Linear Algebra and Neural Networks. Then we introduce the most popular DeepLearning Frameworks like Keras, TensorFlow, PyTorch, DeepLearning4J and Apache SystemML. Keras and TensorFlow are making up the greatest portion of this course. We learn about Anomaly Detection, Time Series Forecasting, Image Recognition and Natural Language Processing by building up models using Keras on real-life examples from IoT (Internet of Things), Financial Marked Data, Literature or Image Databases. Finally, we learn how to scale those artificial brains using Kubernetes, Apache Spark and GPUs.
IMPORTANT: THIS COURSE ALONE IS NOT SUFFICIENT TO OBTAIN THE "IBM Watson IoT Certified Data Scientist certificate". You need to take three other courses where two of them are currently built. The Specialization will be ready late spring, early summer 2018
Using these approaches, no matter what your skill levels in topics you would like to master, you can change your thinking and change your life. If you’re already an expert, this peep under the mental hood will give your ideas for turbocharging successful creation and deployment of DeepLearning models. If you’re struggling, you’ll see a structured treasure trove of practical techniques that walk you through what you need to do to get on track. If you’ve ever wanted to become better at anything, this course will help serve as your guide.
Prerequisites: Some coding skills are necessary. Preferably python, but any other programming language will do fine. Also some basic understanding of math (linear algebra) is a plus, but we will cover that part in the first week as well.
If you choose to take this course and earn the Coursera course certificate, you will also earn an IBM digital badge. To find out more about IBM digital badges follow the link ibm.biz/badging.

Enseigné par

Romeo Kienzler

Chief Data Scientist, Course Lead

Niketan Pansare

Senior Software Engineer

Tom Hanlon

Training Director

Max Pumperla

Deep Learning Engineer

Ilja Rasin

Data Scientist

Transcription

So welcome to the chat about gradient descent updater strategies. This is a very important lecture, because choosing the correct gradient descent updater strategy might heavily influence your learning. So remember in creating descent, we are testing one random value, because that random value is generated by the random initialization of the weight. And from there, you have to find the next step. So next step, you find of course by computing the first derivative of the cost function and you go in the direction of the steepest descent, but there are some catches. So let's start with great descent as it is. So in gradient descent, you update the weights theta by subtracting a value from it. So the value which you are subtracting is this one, so let's have a look how this is computed. Again, it starts with a cost function J. The cost function J is of course, dependent on theta, the weights, but also on the training data. So the training data, I have exemplified here using a matrix X and the matrix Y. So this is basically the complete data set. And once this cost function is computed under complete data set, we take the first derivative. We get the gradient, then we multiply by the learning rate. So the update will be smaller and then we subtract from the actual value of theta t and that gives us theta t+1. So that's how gradient descent works. So there's a variation, this is called stochastic gradient descent. And the only difference is that you are not computing the gradient under complete matrix x and the complete matrix y, you just take one training example. This is x(i) and y(i). So once you've completed the gradient, you update already the parameters theta. So a variation of this is the so-called mini batch gradient descent. So that's somehow in between. You don't use the complete data set and just don't use a single instance of your data. You just use a batch. So the batch size we have to define, usually you take values between 32 and thousand 24 or so and then you compute the gradient for that particular batch. And once this gradient has been computed, you update the parameter matrix theta. So now, another way of doing this is called momentum. So momentum, the idea is that we take also an update of a pastime step into consideration for computing the update of the current time step. So now, we have here a variable called nu. So nu computed from nu t-1 and usually we take a gamma of 0.9. And from this, basically subtract the actual gradient. Once you've computed nu t, we subtract nu t from theta t and we get the updated theta t+1. So there is a variation of momentum. This is called a Nesterov accelerated gradient. And the only addition here is that in the cost function, we subtract this term here already from theta t. So when we should be this is like a ball rolling down the hill and it's a smart ball. So whenever the slope starts to increase, the ball stops accelerating and breaks a bit. So Adagrad which stands for adaptive gradient tries to change the learning rate over time and not only for a complete patch. It tries to change the learning rate for each individual example. So you see here, this formula is dependent on i. So i stands for the actual training example. We see here that the learning rate theta is modified and it's modified by this term here. So Gt, ii is the matrix which contains information on the pass gradients per training example and taking this into account. It just reduces the learning rate. And note that there is a little term E, so that we avoid division by zero. This whole thing cannot only be computed per example, but also using matrix multiplication. And therefore, we can omit the i, because GT is a diagonal matrix. So don't worry, if you don't understand this. So maybe you have to revisit the linear algebra Basics. But anyway, it's just a way to compute the whole math in just one go. So Adadelta is only a variation of Adagrad and differences that it doesn't use a matrix G, but it continuously computes the mean of the historic gradients. So there are many, many other variations of gradient descent updater and I recommend you to have a look at Sebastian's blog later where everything is described really nicely. But the key take home point here is that the gradient descent updater strategy is a very important job, you have to tune. And as usual, tuning neural network is considered as a split magic or trial and error. So you just have to try a couple of those. So let's actually have a look at Sebastian's blog. So the best thing is this funny guy here. And you see here, those are the trajectories of different learning curves using different gradient descent updater strategies. So it's pretty interesting, because it's a very important type of parameter you can tune. So we have covered most of those, but what I wanted to show you is two little figures here. So those two are really interesting. So you see here different trajectories in optimization space using different gradient descent updater. So the red one here is stochastic gradient descent. And to see here in both problems, it performs really poor and even get stuck. So maybe in the first image, it won't get stuck, but it won't converge for ages. And in the second, this is a saddle point, it even gets stuck. And you can see here that, for example, a delta is performing best among all of those. Okay, I think that's enough for now. I hope that you know understood that this is an important type of parameter which you can tune and that's basically it.