This course will teach you how to perform data analysis using MongoDB's powerful Aggregation Framework.
You'll begin this course by building a foundation of essential aggregation knowledge. By understanding these features of the Aggregation Framework you will learn how to ask complex questions of your data. This will lay the groundwork for the remainder of the course where you'll dive deep and learn about schema design, relational data migrations, and machine learning with
MongoDB.
By the end of this course you'll understand how to best use MongoDB and its Aggregation Framework in your own data science workflow.

Revisiones

Filled StarFilled StarFilled StarFilled StarHalf Faded Star

4.7 (19 calificaciones)

5 stars

13 ratings

4 stars

6 ratings

De la lección

Machine Learning with MongoDB

This module is focused on demonstrating how MongoDB can be used in different machine learning workflows. You'll learn how to perform machine learning directly in MongoDB, how to prepare data for machine learning with MongoDB, and how to analyze data with MongoDB in preparation of doing machine learning in Python.

Impartido por:

Nathan Leniz

Curriculum Engineer

Kirby Kohlmorgen

Senior Curriculum Engineer

Transcripción

Let's see what this looks like from a programming standpoint. So first thing first, we're going to go ahead and import our dependencies, and connect to our Atlas cluster. And now, we're going to use another UCI Machine Learning data set. In this case, the data are the results of a chemical analysis of different wine grown in the same region of Italy, but derived from three different cultivators. We'll see what the sale looks like in just one moment. So we go ahead and get that collection. And we're going to go ahead and remove the _id field with the project. And then we can go ahead and just marshal our data into a data frame, so we can take a peak add up. So here are the thirteen fields of our data, and so each one of these measurements is a measurement of a wine from one of the three different cultivators. And this alcohol column tells us which cultivator is which. And so these first five documents are all from cultivator 1. So we can go ahead and then create a new dataframe, X, where we will go ahead and remove this column telling us which row is which one. And then before we dive into PCA, we're going to go ahead and scale our data to remove biases in the different fields. And this is where PCA really begins. I will point out, however, that we're taking the covariance matrix of the transpose of our matrices. And this is because PCA is performed on rows as features, whereas right now our features are columns. So, go ahead and calculate that. And now, we can go ahead and use NumPy's linear algebra module. And go ahead and calculate the eigenvalues and vectors of this covariance matrix. Let's take a peak at what these eigenvalues look like. And here they are. These eigenvalues represent how much variance is accounted for in each respective eigenvector. And so we really to find which eigenvectors contain the most variance. So we want to go ahead and zip the eigenvalues and vectors together into a variable called eigen_map. And then we can go ahead and sort them by eigenvalue. And now that these two pairs are sorted, we can go ahead and pull them both back out, And take a look at these sorted eigenvalues. And you can see they're now clearly sorted in descending order. And now we can go ahead and put the eigenvectors into a dataframe. And set the columns to the original dataframe, obviously, minus Alcohol. So we can see how the different components of each eigenvector map up to the original features of our data. And you can see that we have 12 different eigenvectors for the 12 different dimensions of our data. Excluding the one dimension that is, which variety each wine is. Now since these vectors are sorted by their eigenvalues, this means that the top vector up here represents the most variance in our data, and the bottom one the least. And when we look at the absolute values of each of these different terms, we can see how much weight each feature represents. So it looks like flavanoids and non-flavanoid phenols represent the most variance in our data. Now the question is how many eigenvectors should we keep? Which eigenvectors should we keep to maintain the most amount of data from our original data set? because that is the whole point of pca, right? We want to reduce dimensionality of our data. But we have to figure out how much we want to reduce it to. So to do this, we're going to look at our eigenvalues. And we're going to quantify how much variance each vector represents. So we're going to go ahead and sum all the values up, and then calculate the percentage of the total for each value. And then we'll use the cumulative sum function to progressively add up each of these percentages. And they should obviously add up to 100% at the end. And then real quick, we're going to go ahead and get a variable that tells us how many dimensions we have of our data. This will be useful for a bunch of plotting functions. And now we can just very simply go ahead and plot our cumulative sum array. And now we can use this graph to tell us how many principal components we should keep. Each dot here represents the number of the top principal components, and how much variance is explained. So with the first three components we can explain about 67% of our data. And obviously with all twelve components we are going to represent 100% of our data. The rule of thumb here is called the elbow method. We can see, where in this data do we have an elbow? Where in the data do we have diminishing returns after some point? And so our elbow here actually isn't super sharp, but you can see that the first three components explain almost 70% of the variance in our data. And so for the sake of this lesson, to keep it simple, we're just going to go ahead and take the top two. Which do account for over 50% of the variance in our data, which is actually pretty good. So we'll go ahead and pull out these first two eigenvectors. So this will be PC1 and PC2. And then we can go ahead and put these two vectors into their own matrix. And here's what this matrix looks like. And so this matrix only has two dimensions to it. We now want to go ahead and apply these two vectors, or this matrix, against our original dataset, so we can project it in the terms of these two principle components. And to that, we'll go ahead and just take the dot product of our original matrix, our original set of data, against this new matrix, and we'll call that Y. And then let's go ahead and plot, see what this looks like. And there's actually some pretty cool clustering going on. And we can now visually see relationships of over 50% of the different data points in our data, but only using two dimensions. Now, in practice you're not going to do all these steps manually, and that's because scikit-learn can do it for us. Performing PCA is really as simple as just importing the PCA function from sklearn, and then saying how many components you want, and then fitting it. And now we're going to go ahead and plot this using the output from scikit-learn. You'll see that we get the same graph, but it's just upside down. And now, let's see if we can go ahead and apply some machine learning to this transformed dataset. So, we'll first pull up the alcohol column, split up our data. And here I'm just going to use the logical regression function. And then we'll go ahead and fit our training data. And you can see when we run this machine learning algorithm on our original data set, we get 93% accuracy. But when we do the same thing on our dimensionality reduced one, we get 96, and this is because we have effectively removed the noise in our data with PCA. So we covered a lot of content in this lesson, so let's go ahead and recap what we've learned. We saw we can use PCA to reduce the dimensionality of our data. And we saw that this is performed by finding the eigenvectors of the covariance matrix of our data. And then once we have these vectors, we can go ahead and sort them by their eigenvalue, and then keep some number of the top components. And this will allow us to represent the majority of our data, with significantly less dimensions