Yellowbrick is a suite of visual diagnostic tools called âVisualizersâ that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines scikit-learn with matplotlib in the best tradition of the scikit-learn documentation, but to produce visualizations for your models! For more on Yellowbrick, please see the About.

Visualizers are estimators (objects that learn from data) whose primary objective is to create visualizations that allow insight into the model selection process. In Scikit-Learn terms, they can be similar to transformers when visualizing the data space or wrap an model estimator similar to how the âModelCVâ (e.g. RidgeCV, LassoCV) methods work. The primary goal of Yellowbrick is to create a sensical API similar to Scikit-Learn. Some of our most popular visualizers include:

Yellowbrick is a welcoming, inclusive project in the tradition of matplotlib and scikit-learn. Similar to those projects, we follow the Python Software Foundation Code of Conduct. Please donât hesitate to reach out to us for help or if you have any contributions or bugs to report!

The primary way to ask for help with Yellowbrick is to post on our Google Groups Listserv. This is an email list/forum that members of the community can join and respond to each other; you should be able to receive the quickest response here. Please also consider joining the group so you can respond to questions! You can also ask questions on Stack Overflow and tag them with âyellowbrickâ. Or you can add issues on GitHub. You can also tweet or direct message us on Twitter @scikit_yb.

If youâre new to Yellowbrick, this guide will get you started and help you include visualizers in your machine learning workflow. Before we begin, however, there are several notes about development environments that you should consider.

Yellowbrick has two primary dependencies: scikit-learn and matplotlib. If you do not have these Python packages, they will be installed alongside Yellowbrick. Note that Yellowbrick works best with scikit-learn version 0.20 or later and matplotlib version 3.0.1 or later. Both of these packages require some C code to be compiled, which can be difficult on some systems, like Windows. If youâre having trouble, try using a distribution of Python that includes these packages like Anaconda.

Yellowbrick is also commonly used inside of a Jupyter Notebook alongside Pandas data frames. Notebooks make it especially easy to coordinate code and visualizations; however, you can also use Yellowbrick inside of regular Python scripts, either saving figures to disk or showing figures in a GUI window. If youâre having trouble with this, please consult matplotlibâs backends documentation.

Note

Jupyter, Pandas, and other ancillary libraries like NLTK for text visualizers are not installed with Yellowbrick and must be installed separately.

There is a known bug installing matplotlib on Linux with Anaconda. If youâre having trouble please let us know on GitHub.

Once installed, you should be able to import Yellowbrick without an error, both in Python and inside of Jupyter notebooks. Note that because of matplotlib, Yellowbrick does not work inside of a virtual environment on macOS without jumping through some hoops.

The Yellowbrick API is specifically designed to play nicely with scikit-learn. The primary interface is therefore a Visualizer â an object that learns from data to produce a visualization. Visualizers are scikit-learn Estimator objects and have a similar interface along with methods for drawing. In order to use visualizers, you simply use the same workflow as with a scikit-learn model, import the visualizer, instantiate it, call the visualizerâs fit() method, then in order to render the visualization, call the visualizerâs poof() method, which does the magic!

For example, there are several visualizers that act as transformers, used to perform feature analysis prior to fitting a model. The following example visualizes a high-dimensional data set with parallel coordinates:

As you can see, the workflow is very similar to using a scikit-learn transformer, and visualizers are intended to be integrated along with scikit-learn utilities. Arguments that change how the visualization is drawn can be passed into the visualizer upon instantiation, similarly to how hyperparameters are included with scikit-learn models.

The poof() method finalizes the drawing (adding titles, axes labels, etc) and then renders the image on your behalf. If youâre in a Jupyter notebook, the image should just appear. If youâre in a Python script, a GUI window should open with the visualization in interactive form. However, you can also save the image to disk by passing in a file path as follows:

visualizer.poof(outpath="pcoords.png")

The extension of the filename will determine how the image is rendered. In addition to the .png extension, .pdf is also commonly used for high-quality publication ready images.

Note

Data input to Yellowbrick is identical to that of scikit-learn. Datasets are
usually described with a variable X (sometimes referred to simply as data) and an optional variable y (usually referred to as the target). The required data X is a table that contains instances (or samples) which are described by features. X is therefore a two-dimensional matrix with a shape of (n,m) where n is the number of instances (rows) and m is the number of features (columns). X can be a Pandas DataFrame, a NumPy array, or even a Python lists of lists.

The optional target data, y, is used to specify the ground truth in supervised machine learning. y is a vector (a one-dimensional array) that must have length n â the same number of elements as rows in X. y can be a Pandas Series, a Numpy array, or a Python list.

Visualizers can also wrap scikit-learn models for evaluation, hyperparameter tuning and algorithm selection. For example, to produce a visual heatmap of a classification report, displaying the precision, recall, F1 score, and support for each class in a classifier, wrap the estimator in a visualizer as follows:

Only two additional lines of code are required to add visual evaluation of the classifier model, the instantiation of a ClassificationReport visualizer that wraps the classification estimator and a call to its poof() method. In this way, Visualizers enhance the machine learning workflow without interrupting it.

The class-based API is meant to integrate with scikit-learn directly, however on occasion there are times when you just need a quick visualization. Yellowbrick supports quick functions for taking advantage of this directly. For example, the two visual diagnostics could have been instead implemented as follows:

Letâs consider a regression analysis as a simple example of the use of visualizers in the machine learning workflow. Using a bike sharing dataset based upon the one uploaded to the UCI Machine Learning Repository, we would like to predict the number of bikes rented in a given hour based on features like the season, weather, or if itâs a holiday.

Note

We have updated the dataset from the UCI ML repository to make it a bit easier to load into Pandas; make sure you download the Yellowbrick version of the dataset using the load_bikeshare method below. Please also note that Pandas is required to follow the supplied code. Pandas can be installed using pipinstallpandas if you havenât already installed it.

We can load our data using the yellowbrick.datasets module as follows:

The machine learning workflow is the art of creating model selection triples, a combination of features, algorithm, and hyperparameters that uniquely identifies a model fitted on a specific data set. As part of our feature selection, we want to identify features that have a linear relationship with each other, potentially introducing covariance into our model and breaking OLS (guiding us toward removing features or using regularization). We can use the Rank Features visualizer to compute Pearson correlations between all pairs of features as follows:

This figure shows us the Pearson correlation between pairs of features such that each cell in the grid represents two features identified in order on the x and y axes and whose color displays the magnitude of the correlation. A Pearson correlation of 1.0 means that there is a strong positive, linear relationship between the pairs of variables and a value of -1.0 indicates a strong negative, linear relationship (a value of zero indicates no relationship). Therefore we are looking for dark red and dark blue boxes to identify further.

In this chart, we see that the features temp and feelslike have a strong correlation and also that the feature season has a strong correlation with the feature month. This seems to make sense; the apparent temperature we feel outside depends on the actual temperature and other airquality factors, and the season of the year is described by the month! To dive in deeper, we can use the Direct Data Visualization (JointPlotVisualizer) to inspect those relationships.

This visualizer plots a scatter diagram of the apparent temperature on the y axis and the actual measured temperature on the x axis and draws a line of best fit using a simple linear regression. Additionally, univariate distributions are shown as histograms above the x axis for temp and next to the y axis for feelslike. The JointPlotVisualizer gives an at-a-glance view of the very strong positive correlation of the features, as well as the range and distribution of each feature. Note that the axes are normalized to the space between zero and one, a common technique in machine learning to reduce the impact of one feature over another.

This plot is very interesting because there appear to be some outliers in the dataset. These instances may need to be manually removed in order to improve the quality of the final model because they may represent data input errors, and potentially train the model on a skewed dataset which would return unreliable model predictions. The first instance of outliers occurs in the temp data where the feelslike value is approximately equal to 0.25 - showing a horizontal line of data, likely created by input error.

We can also see that more extreme temperatures create an exaggerated effect in perceived temperature; the colder it is, the colder people are likely to believe it to be, and the warmer it is, the warmer it is perceived to be, with moderate temperatures generally having little effect on individual perception of comfort. This gives us a clue that feelslike may be a better feature than temp - promising a more stable dataset, with less risk of running into outliers or errors.

We can ultimately confirm the assumption by training our model on either value, and scoring the results. If the temp value is indeed less reliable, we should remove the temp variable in favor of feelslike . In the meantime, we will use the feelslike value due to the absence of outliers and input error.

At this point, we can train our model; letâs fit a linear regression to our model and plot the residuals.

The residuals plot shows the error against the predicted value (the number of riders), and allows us to look for heteroskedasticity in the model; e.g. regions in the target where the error is greatest. The shape of the residuals can strongly inform us where OLS (ordinary least squares) is being most strongly affected by the components of our model (the features). In this case, we can see that the lower predicted number of riders results in lower model error, and conversely that the the higher predicted number of riders results in higher model error. This indicates that our model has more noise in certain regions of the target or that two variables are colinear, meaning that they are injecting error as the noise in their relationship changes.

The residuals plot also shows how the model is injecting error, the bold horizontal line at residuals=0 is no error, and any point above or below that line indicates the magnitude of error. For example, most of the residuals are negative, and since the score is computed as actual-expected, this means that the expected value is bigger than the actual value most of the time; e.g. that our model is primarily guessing more than the actual number of riders. Moreover, there is a very interesting boundary along the top right of the residuals graph, indicating an interesting effect in model space; possibly that some feature is strongly weighted in the region of that model.

Finally the residuals are colored by training and test set. This helps us identify errors in creating train and test splits. If the test error doesnât match the train error then our model is either overfit or underfit. Otherwise it could be an error in shuffling the dataset before creating the splits.

Along with generating the residuals plot, we also measured the performance by âscoringâ our model on the test data, e.g. the code snippet visualizer.score(X_test,y_test). Because we used a linear regression model, the scoring consists of finding the R-squared value of the data, which is a statistical measure of how close the data are to the fitted regression line. The R-squared value of any model may vary slightly between prediction/test runs, however it should generally be comparable. In our case, the R-squared value for this model was only 0.328, suggesting that linear correlation may not be the most appropriate to use for fitting this data. Letâs see if we can fit a better model using regularization, and explore another visualizer at the same time.

When exploring model families, the primary thing to consider is how the model becomes more complex. As the model increases in complexity, the error due to variance increases because the model is becoming more overfit and cannot generalize to unseen data. However, the simpler the model is the more error there is likely to be due to bias; the model is underfit and therefore misses its target more frequently. The goal therefore of most machine learning is to create a model that is just complex enough, finding a middle ground between bias and variance.

For a linear model, complexity comes from the features themselves and their assigned weight according to the model. Linear models therefore expect the least number of features that achieves an explanatory result. One technique to achieve this is regularization, the introduction of a parameter called alpha that normalizes the weights of the coefficients with each other and penalizes complexity. Alpha and complexity have an inverse relationship, the higher the alpha, the lower the complexity of the model and vice versa.

The question therefore becomes how you choose alpha. One technique is to fit a number of models using cross-validation and selecting the alpha that has the lowest error. The AlphaSelection visualizer allows you to do just that, with a visual representation that shows the behavior of the regularization. As you can see in the figure above, the error decreases as the value of alpha increases up until our chosen value (in this case, 3.181) where the error starts to increase. This allows us to target the bias/variance trade-off and to explore the relationship of regularization methods (for example Ridge vs. Lasso).

We can now train our final model and visualize it with the PredictionError visualizer:

The prediction error visualizer plots the actual (measured) vs. expected (predicted) values against each other. The dotted black line is the 45 degree line that indicates zero error. Like the residuals plot, this allows us to see where error is occurring and in what magnitude.

In this plot, we can see that most of the instance density is less than 200 riders. We may want to try orthogonal matching pursuit or splines to fit a regression that takes into account more regionality. We can also note that that weird topology from the residuals plot seems to be fixed using the Ridge regression, and that there is a bit more balance in our model between large and small values. Potentially the Ridge regularization cured a covariance issue we had between two features. As we move forward in our analysis using other model forms, we can continue to utilize visualizers to quickly compare and see our results.

Hopefully this workflow gives you an idea of how to integrate Visualizers into machine learning with scikit-learn and inspires you to use them in your work and write your own! For additional information on getting started with Yellowbrick, check out the Model Selection Tutorial. After that you can get up to speed on specific visualizers detailed in the Visualizers and API.

Discussions of machine learning are frequently characterized by a singular focus on model selection. Be it logistic regression, random forests, Bayesian methods, or artificial neural networks, machine learning practitioners are often quick to express their preference. The reason for this is mostly historical. Though modern third-party machine learning libraries have made the deployment of multiple models appear nearly trivial, traditionally the application and tuning of even one of these algorithms required many years of study. As a result, machine learning practitioners tended to have strong preferences for particular (and likely more familiar) models over others.

However, model selection is a bit more nuanced than simply picking the ârightâ or âwrongâ algorithm. In practice, the workflow includes:

selecting and/or engineering the smallest and most predictive feature set

choosing a set of algorithms from a model family, and

tuning the algorithm hyperparameters to optimize performance.

The model selection triple was first described in a 2015 SIGMOD paper by Kumar et al. In their paper, which concerns the development of next-generation database systems built to anticipate predictive modeling, the authors cogently express that such systems are badly needed due to the highly experimental nature of machine learning in practice. âModel selection,â they explain, âis iterative and exploratory because the space of [model selection triples] is usually infinite, and it is generally impossible for analysts to know a priori which [combination] will yield satisfactory accuracy and/or insights.â

Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search. By visualizing the model selection process, data scientists can steer towards final, explainable models and avoid pitfalls and traps.

The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. Yellowbrick extends the Scikit-Learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process, providing visual diagnostics throughout the transformation of high dimensional data.

This tutorial uses a modified version of the mushroom dataset from
the UCI Machine Learning Repository.
Our objective is to predict if a mushroom is poisonous or edible based on
its characteristics.

The data include descriptions of hypothetical samples corresponding to
23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each
species was identified as definitely edible, definitely poisonous, or of
unknown edibility and not recommended (this latter class was combined
with the poisonous one).

Our data, including the target, is categorical. We will need to change
these values to numeric ones for machine learning. In order to extract
this from the dataset, weâll have to use Scikit-Learn transformers to
transform our input dataset into something that can be fit to a model.
Luckily, Sckit-Learn does provide a transformer for converting
categorical labels into numeric integers:
sklearn.preprocessing.LabelEncoder.
Unfortunately it can only transform a single vector at a time, so weâll
have to adapt it in order to apply it to multiple columns.

fromsklearn.baseimportBaseEstimator,TransformerMixinfromsklearn.preprocessingimportLabelEncoder,OneHotEncoderclassEncodeCategorical(BaseEstimator,TransformerMixin):""" Encodes a specified list of columns or all columns if None. """def__init__(self,columns=None):self.columns=[colforcolincolumns]self.encoders=Nonedeffit(self,data,target=None):""" Expects a data frame with named columns to encode. """# Encode all columns if columns is Noneifself.columnsisNone:self.columns=data.columns# Fit a label encoder for each column in the data frameself.encoders={column:LabelEncoder().fit(data[column])forcolumninself.columns}returnselfdeftransform(self,data):""" Uses the encoders to transform a data frame. """output=data.copy()forcolumn,encoderinself.encoders.items():output[column]=encoder.transform(data[column])returnoutput

Precision is the number of correct positive results divided by the
number of all positive results (e.g. How many of the mushrooms we
predicted would be edible actually were?).

Recall is the number of correct positive results divided by the
number of positive results that should have been returned (e.g. How
many of the mushrooms that were poisonous did we accurately predict were
poisonous?).

The F1 score is a measure of a testâs accuracy. It considers both
the precision and the recall of the test to compute the score. The F1
score can be interpreted as a weighted average of the precision and
recall, where an F1 score reaches its best value at 1 and worst at 0.

Letâs build a way to evaluate multiple estimators â first using
traditional numeric scores (which weâll later compare to some visual
diagnostics from the Yellowbrick library).

fromsklearn.metricsimportf1_scorefromsklearn.pipelineimportPipelinedefmodel_selection(X,y,estimator):""" Test various estimators. """y=LabelEncoder().fit_transform(y.values.ravel())model=Pipeline([('label_encoding',EncodeCategorical(X.keys())),('one_hot_encoder',OneHotEncoder()),('estimator',estimator)])# Instantiate the classification model and visualizermodel.fit(X,y)expected=ypredicted=model.predict(X)# Compute and return the F1 score (the harmonic mean of precision and recall)return(f1_score(expected,predicted))

Now letâs refactor our model evaluation function to use Yellowbrickâs
ClassificationReport class, a model visualizer that displays the
precision, recall, and F1 scores. This visual model analysis tool
integrates numerical scores as well as color-coded heatmaps in order to
support easy interpretation and detection, particularly the nuances of
Type I and Type II error, which are very relevant (lifesaving, even) to
our use case!

Type I error (or a âfalse positiveâ) is detecting an effect that
is not present (e.g. determining a mushroom is poisonous when it is in
fact edible).

Type II error (or a âfalse negativeâ) is failing to detect an
effect that is present (e.g. believing a mushroom is edible when it is
in fact poisonous).

Note

When running in a Jupyter Notebook, be sure to add the following line at the top of the notebook: %matplotlibnotebook. This will ensure the figures are rendered correctly. For those running this code with a Python script, the figure should appear in a secondary window.

importmatplotlib.pyplotaspltfromsklearn.pipelineimportPipelinefromyellowbrick.classifierimportClassificationReportdefvisual_model_selection(X,y,estimator):""" Test various estimators. """y=LabelEncoder().fit_transform(y.values.ravel())model=Pipeline([('label_encoding',EncodeCategorical(X.keys())),('one_hot_encoder',OneHotEncoder()),('estimator',estimator)])# Create a new figure to draw the classification report on_,ax=plt.subplots()# Instantiate the classification model and visualizervisualizer=ClassificationReport(model,ax=ax,classes=['edible','poisonous'])visualizer.fit(X,y)visualizer.score(X,y)# Note that to save the figure to disk, you can specify an outpath# argument to the poof method!visualizer.poof()

Welcome to the API documentation for Yellowbrick! This section contains a complete listing of the currently available, production-ready visualizers along with code examples of how to use them. You may use the following links to navigate to the reference material for each visualization.

Yellowbrick hosts several datasets wrangled from the UCI Machine
Learning Repository to present the
examples used throughout this documentation. If you havenât downloaded the data, you can do so by
running:

$ python -m yellowbrick.download

This should create a folder named data in your current working directory that contains all of the datasets. You can load a specified dataset with pandas.read_csv as follows:

importpandasaspddata=pd.read_csv('data/concrete/concrete.csv')

The following code snippet can be found at the top of the examples/examples.ipynb notebook in Yellowbrick. Please reference this code when trying to load a specific data set:

importosfromyellowbrick.downloadimportdownload_all## The path to the test data setsFIXTURES=os.path.join(os.getcwd(),"data")## Dataset loading mechanismsdatasets={"bikeshare":os.path.join(FIXTURES,"bikeshare","bikeshare.csv"),"concrete":os.path.join(FIXTURES,"concrete","concrete.csv"),"credit":os.path.join(FIXTURES,"credit","credit.csv"),"energy":os.path.join(FIXTURES,"energy","energy.csv"),"game":os.path.join(FIXTURES,"game","game.csv"),"mushroom":os.path.join(FIXTURES,"mushroom","mushroom.csv"),"occupancy":os.path.join(FIXTURES,"occupancy","occupancy.csv"),"spam":os.path.join(FIXTURES,"spam","spam.csv"),}defload_data(name,download=True):""" Loads and wrangles the passed in dataset by name. If download is specified, this method will download any missing files. """# Get the path from the datasetspath=datasets[name]# Check if the data exists, otherwise download or raiseifnotos.path.exists(path):ifdownload:download_all()else:raiseValueError(("'{}' dataset has not been downloaded, ""use the download.py module to fetch datasets").format(name))# Return the data framereturnpd.read_csv(path)

Unless otherwise specified, most of the examples currently use one or more of the listed datasets. Each dataset has a README.md with detailed information about the data source, attributes, and target. Here is a complete listing of all datasets in Yellowbrick and their associated analytical tasks:

Checks to see if the dataset archive file exists in the data home directory,
found with get_data_home. By specifying the signature, this function
also checks to see if the archive is the latest version by comparing the
sha256sum of the local archive with the specified signature.

Parameters:

dataset:str

The name of the dataset; should either be a folder in data home or
specified in the yellowbrick.datasets.DATASETS variable.

signature:str

The SHA 256 signature of the dataset, used to determine if the archive
is the latest version of the dataset or not.

data_home:str, optional

The path on disk where data is stored. If not passed in, it is looked
up from YELLOWBRICK_DATA or the default returned by get_data_home.

Looks up the path to the dataset specified in the data home directory,
which is found using the get_data_home function. By default data home
is colocated with the code, but can be modified with the YELLOWBRICK_DATA
environment variable, or passing in a different directory.

The file returned will be by default, the name of the dataset in compressed
CSV format. Other files and extensions can be passed in to locate other data
types or auxilliary files.

If the dataset is not found a DatasetsError is raised by default.

Parameters:

dataset:str

The name of the dataset; should either be a folder in data home or
specified in the yellowbrick.datasets.DATASETS variable.

data_home:str, optional

The path on disk where data is stored. If not passed in, it is looked
up from YELLOWBRICK_DATA or the default returned by get_data_home.

fname:str, optional

The filename to look up in the dataset path, by default it will be the
name of the dataset. The fname must include an extension.

ext:str, default: â.csv.gzâ

The extension of the data to look up in the dataset path, if the fname
is specified then the ext parameter is ignored. If ext is None then
the directory of the dataset will be returned.

raises:bool, default: True

If the path does not exist, raises a DatasetsError unless this flag is set
to False, at which point None is returned (e.g. for checking if the
path exists or not).

Returns:

path:str or None

A path to the requested file, guaranteed to exist if an exception is
not raised during processing of the request (unless None is returned).

Return the path of the Yellowbrick data directory. This folder is used by
dataset loaders to avoid downloading data several times.

By default, this folder is colocated with the code in the install directory
so that data shipped with the package can be easily located. Alternatively
it can be set by the YELLOWBRICK_DATA environment variable, or
programmatically by giving a folder path. Note that the â~â symbol is
expanded to the user home directory, and environment variables are also
expanded when resolving the path.

Feature analysis visualizers are designed to visualize instances in data
space in order to detect features or targets that might impact
downstream fitting. Because ML operates on high-dimensional data sets
(usually at least 35), the visualizers focus on aggregation,
optimization, and other techniques to give overviews of the data. It is
our intent that the steering process will allow the data scientist to
zoom and filter and explore the relationships between their instances
and between dimensions.

At the moment we have the following feature analysis visualizers implemented:

Feature analysis visualizers implement the Transformer API from
scikit-learn, meaning they can be used as intermediate transform steps
in a Pipeline (particularly a VisualPipeline). They are
instantiated in the same way, and then fit and transform are called on
them, which draws the instances correctly. Finally poof or show
is called which displays the image.

# Feature Analysis Imports# NOTE that all these are available for import directly from the ``yellowbrick.features`` modulefromyellowbrick.features.rankdimportRank1D,Rank2Dfromyellowbrick.features.radvizimportRadVizfromyellowbrick.features.pcoordsimportParallelCoordinatesfromyellowbrick.features.jointplotimportJointPlotVisualizerfromyellowbrick.features.pcaimportPCADecompositionfromyellowbrick.features.manifoldimportManifoldfromyellowbrick.features.importancesimportFeatureImportancesfromyellowbrick.features.rfecvimportRFECV

RadViz is a multivariate data visualization algorithm that plots each
feature dimension uniformly around the circumference of a circle then
plots points on the interior of the circle such that the point
normalizes its values on the axes from the center to each arc. This
mechanism allows as many dimensions as will easily fit on a circle,
greatly expanding the dimensionality of the visualization.

Data scientists use this method to detect separability between classes.
E.g. is there an opportunity to learn from the feature set or is there
just too much noise?

If your data contains rows with missing values (numpy.nan), those missing
values will not be plotted. In other words, you may not get the entire
picture of your data. RadViz will raise a DataWarning to inform you of the
percent missing.

If you do receive this warning, you may want to look at imputation strategies.
A good starting place is the scikit-learn Imputer.

fromyellowbrick.datasetsimportload_occupancyfromyellowbrick.featuresimportRadViz# Load the classification datasetX,y=load_occupancy()# Specify the target classesclasses=["unoccupied","occupied"]# Instantiate the visualizervisualizer=RadViz(classes=classes)visualizer.fit(X,y)# Fit the data to the visualizervisualizer.transform(X)# Transform the datavisualizer.poof()# Draw/show/poof the data

RadViz is a multivariate data visualization algorithm that plots each
axis uniformely around the circumference of a circle then plots points on
the interior of the circle such that the point normalizes its values on
the axes from the center to each arc.

Parameters:

ax:matplotlib Axes, default: None

The axis to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

features:list, default: None

a list of feature names to use
If a DataFrame is passed to fit and features is None, feature
names are selected as the columns of the DataFrame.

classes:list, default: None

a list of class names for the legend
If classes is None and a y value is passed to fit then the classes
are selected from the target vector.

color:list or tuple, default: None

optional list or tuple of colors to colorize lines
Use either color to colorize the lines on a per class basis or
colormap to color them on a continuous scale.

colormap:string or cmap, default: None

optional string or matplotlib cmap to colorize lines
Use either color to colorize the lines on a per class basis or
colormap to color them on a continuous scale.

alpha:float, default: 1.0

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

These parameters can be influenced later on in the visualization
process, but can and should be set as early as possible.

Rank1D and Rank2D evaluate single features or pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1] allowing them to be ranked. A similar concept to SPLOMs, the scores are visualized on a lower-left triangle heatmap so that patterns between pairs of features can be easily discerned for downstream analysis.

In this example, weâll use the credit default data set from the UCI Machine Learning repository to rank features. The code below creates our instance matrix and target vector.

A one-dimensional ranking of features utilizes a ranking algorithm that takes into account only a single feature at a time (e.g. histogram analysis). By default we utilize the Shapiro-Wilk algorithm to assess the normality of the distribution of instances with respect to the feature. A barplot is then drawn showing the relative ranks of each feature.

fromyellowbrick.datasetsimportload_creditfromyellowbrick.featuresimportRank1D# Load the credit datasetX,y=load_credit()# Instantiate the 1D visualizer with the Sharpiro ranking algorithmvisualizer=Rank1D(algorithm='shapiro')visualizer.fit(X,y)# Fit the data to the visualizervisualizer.transform(X)# Transform the datavisualizer.poof()# Draw/show/poof the data

A two-dimensional ranking of features utilizes a ranking algorithm that takes into account pairs of features at a time (e.g. joint plot analysis). The pairs of features are then ranked by score and visualized using the lower left triangle of a feature co-occurence matrix.

fromyellowbrick.datasetsimportload_creditfromyellowbrick.featuresimportRank2D# Load the credit datasetX,y=load_credit()# Instantiate the visualizer with the Pearson ranking algorithmvisualizer=Rank2D(algorithm='pearson')visualizer.fit(X,y)# Fit the data to the visualizervisualizer.transform(X)# Transform the datavisualizer.poof()# Draw/show/poof the data

Alternatively, we can utilize the covariance ranking algorithm, which attempts to compute the mean value of the product of deviations of variates from their respective means. Covariance loosely attempts to detect a colinear relationship between features. Compare the output from Pearson above to the covariance ranking below.

fromyellowbrick.datasetsimportload_creditfromyellowbrick.featuresimportRank2D# Load the credit datasetX,y=load_credit()# Instantiate the visualizer with the covariance ranking algorithmvisualizer=Rank2D(algorithm='covariance')visualizer.fit(X,y)# Fit the data to the visualizervisualizer.transform(X)# Transform the datavisualizer.poof()# Draw/show/poof the data

Parallel coordinates is multi-dimensional feature visualization technique where the vertical axis is duplicated horizontally for each feature. Instances are displayed as a single line segment drawn from each vertical axes to the location representing their value for that feature. This allows many dimensions to be visualized at once; in fact given infinite horizontal space (e.g. a scrolling window), technically an infinite number of dimensions can be displayed!

Data scientists use this method to detect clusters of instances that have similar classes, and to note features that have high variance or different distributions. We can see this in action after first loading our occupancy classification dataset.

Note

These visualizations can be produced with either the ParallelCoordinates visualizer or by using the parallel_coordinates quick method.

fromyellowbrick.featuresimportParallelCoordinatesfromyellowbrick.datasetsimportload_occupancy# Load the classification data setX,y=load_occupancy()# Specify the features of interest and the classes of the targetfeatures=["temperature","relative humidity","light","CO2","humidity"]classes=["unoccupied","occupied"]# Instantiate the visualizervisualizer=ParallelCoordinates(classes=classes,features=features,sample=0.05,shuffle=True)# Fit and transform the data to the visualizervisualizer.fit_transform(X,y)# Finalize the title and axes then display the visualizationvisualizer.poof()

By inspecting the visualization closely, we can see that the combination of transparency and overlap gives us the sense of groups of similar instances, sometimes referred to as âbraidsâ. If there are distinct braids of different classes, it suggests that there is enough separability that a classification algorithm might be able to discern between each class.

Unfortunately, as we inspect this class, we can see that the domain of each feature may make the visualization hard to interpret. In the above visualization, the domain of the light feature is from in [0,1600], far larger than the range of temperature in [50,96]. To solve this problem, each feature should be scaled or normalized so they are approximately in the same domain.

Normalization techniques can be directly applied to the visualizer without pre-transforming the data (though you could also do this) by using the normalize parameter. Several transformers are available; try using minmax, minabs, standard, l1, or l2 normalization to change perspectives in the parallel coordinates as follows:

fromyellowbrick.featuresimportParallelCoordinatesfromyellowbrick.datasetsimportload_occupancy# Load the classification data setX,y=load_occupancy()# Specify the features of interest and the classes of the targetfeatures=["temperature","relative humidity","light","CO2","humidity"]classes=["unoccupied","occupied"]# Instantiate the visualizervisualizer=ParallelCoordinates(classes=classes,features=features,normalize='standard',sample=0.05,shuffle=True,)# Fit the visualizer and display itvisualizer.fit_transform(X,y)visualizer.poof()

Now we can see that each feature is in the range [-3,3] where the mean of the feature is set to zero and each feature has a unit variance applied between [-1,1] (because weâre using the StandardScaler via the standard normalize parameter). This version of parallel coordinates gives us a much better sense of the distribution of the features and if any features are highly variable with respect to any one class.

Parallel coordinates can take a long time to draw since each instance is represented by a line for each feature. Worse, this time is not well spent since a lot of overlap in the visualization makes the parallel coordinates less understandable. We propose two solutions to this:

Use sample=0.2 and shuffle=True parameters to shuffle and sample the dataset being drawn on the figure. The sample parameter will perform a uniform random sample of the data, selecting the percent specified.

Use the fast=True parameter to enable âfast drawing modeâ.

The âfastâ drawing mode vastly improves the performance of the parallel coordinates drawing algorithm by drawing each line segment by class rather than each instance individually. However, this improved performance comes at a cost, as the visualization produced is subtly different; compare the visualizations in fast and standard drawing modes below:

As you can see the âfastâ drawing algorithm does not have the same build up of color density where instances of the same class intersect. Because there is only one line per class, there is only a darkening effect between classes. This can lead to a different interpretation of the plot, though it still may be effective for analytical purposes, particularly when youâre plotting a lot of data. Needless to say, the performance benefits are dramatic:

Parallel coordinates displays each feature as a vertical axis spaced
evenly along the horizontal, and each instance as a line drawn between
each individual axis. This allows you to detect braids of similar instances
and separability that suggests a good classification problem.

Parameters:

ax:matplotlib Axes, default: None

The axis to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

features:list, default: None

a list of feature names to use
If a DataFrame is passed to fit and features is None, feature
names are selected as the columns of the DataFrame.

classes:list, default: None

a list of class names for the legend
If classes is None and a y value is passed to fit then the classes
are selected from the target vector.

normalize:string or None, default: None

specifies which normalization method to use, if any
Current supported options are âminmaxâ, âmaxabsâ, âstandardâ, âl1â,
and âl2â.

sample:float or int, default: 1.0

specifies how many examples to display from the data
If int, specifies the maximum number of samples to display.
If float, specifies a fraction between 0 and 1 to display.

random_state:int, RandomState instance or None

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random; only used if shuffle is True and sample < 1.0

shuffle:boolean, default: True

specifies whether sample is drawn randomly

color:list or tuple, default: None

optional list or tuple of colors to colorize lines
Use either color to colorize the lines on a per class basis or
colormap to color them on a continuous scale.

colormap:string or cmap, default: None

optional string or matplotlib cmap to colorize lines
Use either color to colorize the lines on a per class basis or
colormap to color them on a continuous scale.

alpha:float, default: None

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered lines more visible.
If None, the alpha is set to 0.5 in âfastâ mode and 0.25 otherwise.

fast:bool, default: False

Fast mode improves the performance of the drawing time of parallel
coordinates but produces an image that does not show the overlap of
instances in the same class. Fast mode should be used when drawing all
instances is too burdensome and sampling is not an option.

vlines:boolean, default: True

flag to determine vertical line display

vlines_kwds:dict, default: None

options to style or display the vertical lines, default: None

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Attributes

âââ

n_samples_:int

number of samples included in the visualization object

Notes

These parameters can be influenced later on in the visualization
process, but can and should be set as early as possible.

Draw the instances colored by the target y such that each line is a
single class. This is the âfastâ mode of drawing, since the number of
lines drawn equals the number of classes, rather than the number of
instances. However, this drawing method sacrifices inter-class density
of points using the alpha parameter.

Draw the instances colored by the target y such that each line is a
single instance. This is the âslowâ mode of drawing, since each
instance has to be drawn individually. However, in so doing, the
density of instances in braids is more apparent since lines have an
independent alpha that is compounded in the figure.

This is the default method of drawing.

Parameters:

X:ndarray of shape n x m

A matrix of n instances with m features

y:ndarray of length n

An array or series of target or class values

Notes

This method can be used to draw additional instances onto the parallel
coordinates before the figure is finalized.

The PCA Decomposition visualizer utilizes principal component analysis to decompose high dimensional data into two or three dimensions so that each instance can be plotted in a scatter plot. The use of PCA means that the projected dataset can be analyzed along axes of principal variation and can be interpreted to determine if spherical distance metrics can be utilized.

# Load the classification data setdata=load_data('credit')# Specify the features of interest and the targettarget="default"features=[colforcolindata.columnsifcol!=target]# Extract the instance data and the targetX=data[features]y=data[target]# Create a list of colors to assign to points in the plotcolors=np.array(['r'ifyielse'b'foryiiny])

The PCA projection can be enhanced to a biplot whose points are the projected instances and whose vectors represent the structure of the data in high dimensional space. By using the proj_features=True flag, vectors for each feature in the dataset are drawn on the scatter plot in the direction of the maximum variance for that feature. These structures can be used to analyze the importance of a feature to the decomposition or to find features of related variance for further analysis.

# Load the classification data setdata=load_data('concrete')# Specify the features of interest and the targettarget="strength"features=['cement','slag','ash','water','splast','coarse','fine','age']# Extract the instance data and the targetX=data[features]y=data[target]

Produce a two or three dimensional principal component plot of a data array
projected onto itâs largest sequential principal components. It is common
practice to scale the data array X before applying a PC decomposition.
Variable scaling can be controlled using the scale argument.

Parameters:

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes.
will be used (or generated if required).

features: list, default: None

a list of feature names to use
If a DataFrame is passed to fit and features is None, feature
names are selected as the columns of the DataFrame.

scale:bool, default: True

Boolean that indicates if user wants to scale data.

proj_dim:int, default: 2

Dimension of the PCA visualizer.

proj_features:bool, default: False

Boolean that indicates if the user wants to project the features
in the projected space. If True the plot will be similar to a biplot.

color:list or tuple of colors, default: None

Specify the colors for each individual class.

colormap:string or cmap, default: None

Optional string or matplotlib cmap to colorize lines.
Use either color to colorize the lines on a per class basis or
colormap to color them on a continuous scale.

If input data is larger than 500x500 and the number of components to
extract is lower than 80% of the smallest dimension of the data, then
the more efficient randomized solver is enabled, this parameter sets
the random state on this solver.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

The fitting or transformation process usually calls draw (not the
user). This function is implemented for developers to hook into the
matplotlib interface and to create an internal representation of the
data the visualizer was trained on in the form of a figure or axes.

The Manifold visualizer provides high dimensional visualization using
manifold learning
to embed instances described by many dimensions into 2, thus allowing the
creation of a scatter plot that shows latent structures in data. Unlike
decomposition methods such as PCA and SVD, manifolds generally use
nearest-neighbors approaches to embedding, allowing them to capture non-linear
structures that would be otherwise lost. The projections that are produced
can then be analyzed for noise or separability to determine if it is possible
to create a decision space in the data.

The Manifold visualizer allows access to all currently available
scikit-learn manifold implementations by specifying the manifold as a string to the visualizer. The currently implemented default manifolds are as follows:

Isomap seeks a lower dimensional embedding that maintains
geometric distances between each instance.

"mds"

MDS: multi-dimensional scaling uses similarity to plot
points that are near to each other close in the embedding.

"spectral"

Spectral Embedding a discrete approximation of the low
dimensional manifold using a graph representation.

"tsne"

t-SNE: converts the similarity of points into probabilities
then uses those probabilities to create an embedding.

Each manifold algorithm produces a different embedding and takes advantage of
different properties of the underlying data. Generally speaking, it requires
multiple attempts on new data to determine the manifold that works best for
the structures latent in your data. Note however, that different manifold
algorithms have different time, complexity, and resource requirements.

Manifolds can be used on many types of problems, and the color used in the
scatter plot can describe the target instance. In an unsupervised or
clustering problem, a single color is used to show structure and overlap. In
a classification problem discrete colors are used for each class. In a
regression problem, a color map can be used to describe points as a heat map
of their regression values.

In a classification or clustering problem, the instances can be described by
discrete labels - the classes or categories in the supervised problem, or the
clusters they belong to in the unsupervised version. The manifold visualizes
this by assigning a color to each label and showing the labels in a legend.

# Load the classification data setdata=load_data('occupancy')# Specify the features of interestfeatures=["temperature","relative humidity","light","C02","humidity"]# Extract the instances and targetX=data[features]y=data.occupancy

The visualization also displays the amount of time it takes to generate the
embedding; as you can see, this can take a long time even for relatively
small datasets. One tip is scale your data using the StandardScalar;
another is to sample your instances (e.g. using train_test_split to
preserve class stratification) or to filter features to decrease sparsity in
the dataset.

One common mechanism is to use SelectKBest to select the features that have
a statistical correlation with the target dataset. For example, we can use
the f_classif score to find the 3 best features in our occupancy dataset.

fromsklearn.pipelineimportPipelinefromsklearn.feature_selectionimportSelectKBestfromsklearn.feature_selectionimportf_classifmodel=Pipeline([("selectk",SelectKBest(k=3,score_func=f_classif)),("viz",Manifold(manifold='isomap',target='discrete')),])# Load the classification datasetdata=load_data("occupancy")# Specify the features of interestfeatures=["temperature","relative humidity","light","CO2","humidity"]# Extract the instances and targetX=data[features]y=data.occupancymodel.fit(X,y)model.named_steps['viz'].poof()

For a regression target or to specify color as a heat-map of continuous
values, specify target='continuous'. Note that by default the param
target='auto' is set, which determines if the target is discrete or
continuous by counting the number of unique values in y.

# Specify the features of interestfeature_names=['cement','slag','ash','water','splast','coarse','fine','age']target_name='strength'# Get the X and y data from the DataFrameX=data[feature_names]y=data[target_name]

The Manifold visualizer provides high dimensional visualization for feature
analysis by embedding data into 2 dimensions using the sklearn.manifold
package for manifold learning. In brief, manifold learning algorithms are
unsuperivsed approaches to non-linear dimensionality reduction (unlike PCA
or SVD) that help visualize latent structures in data.

The manifold algorithm used to do the embedding in scatter plot space can
either be a transformer or a string representing one of the already
specified manifolds as follows:

Each of these algorithms embeds non-linear relationships in different ways,
allowing for an exploration of various structures in the feature space.
Note however, that each of these algorithms has different time, memory and
complexity requirements; take special care when using large datasets!

The Manifold visualizer also shows the specified target (if given) as the
color of the scatter plot. If a classification or clustering target is
given, then discrete colors will be used with a legend. If a regression or
continuous target is specified, then a colormap and colorbar will be shown.

Parameters:

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None, the current axes will be used
or generated if required.

manifold:str or Transformer, default: âlleâ

Specify the manifold algorithm to perform the embedding. Either one of
the strings listed in the table above, or an actual scikit-learn
transformer. The constructed manifold is accessible with the manifold
property, so as to modify hyperparameters before fit.

n_neighbors:int, default: 10

Many manifold algorithms are nearest neighbors based, for those that
are, this parameter specfies the number of neighbors to use in the
embedding. If the manifold algorithm doesnât use nearest neighbors,
then this parameter is ignored.

colors:str or list of colors, default: None

Specify the colors used, though note that the specification depends
very much on whether the target is continuous or discrete. If
continuous, colors must be the name of a colormap. If discrete, then
colors can be the name of a palette or a list of colors to use for each
class in the target.

target:str, default: âautoâ

Specify the type of target as either âdiscreteâ (classes) or âcontinuousâ
(real numbers, usually for regression). If âautoâ, the Manifold will
attempt to determine the type by counting the number of unique values.

If the target is discrete, points will be colored by the target class
and a legend will be displayed. If continuous, points will be displayed
with a colormap and a color bar will be displayed. In either case, if
no target is specified, only a single color will be drawn.

alpha:float, default: 0.7

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

random_state:int or RandomState, default: None

Fixes the random state for stochastic manifold algorithms.

kwargs:dict

Keyword arguments passed to the base class and may influence the
feature visualization properties.

Notes

Specifying the target as 'continuous' or 'discrete' will influence
how the visualizer is finally displayed, donât rely on the automatic
determination from the Manifold!

Scaling your data with the standard scalar before applying it to the
visualizer is a great way of increasing performance. Additionally using
the SelectKBest transformer may also improve performance and lead to
better visualizations.

Warning

Manifold visualizers have extremly varying time, resource, and
complexity requirements. Sampling data or features may be necessary
in order to finish a manifold computation.

Fits the manifold on X and transforms the data to plot it on the axes.
The optional y specified can be used to declare discrete colors. If
the target is set to âautoâ, this method also determines the target
type, and therefore what colors will be used.

Note also that fit records the amount of time it takes to fit the
manifold and reports that information in the visualization.

Parameters:

X:array-like of shape (n, m)

A matrix or data frame with n instances and m features where m > 2.

y:array-like of shape (n,), optional

A vector or series with target values for each instance in X. This
vector is used to determine the color of the points in X.

The feature engineering process involves selecting the minimum required
features to produce a valid model because the more features a model contains,
the more complex it is (and the more sparse the data), therefore the more
sensitive the model is to errors due to variance. A common approach to
eliminating features is to describe their relative importance to a model,
then eliminate weak features or combinations of features and re-evalute to
see if the model fairs better during cross-validation.

Many model forms describe the underlying impact of features relative to each
other. In scikit-learn, Decision Tree models and ensembles of trees such as
Random Forest, Gradient Boosting, and Ada Boost provide a
feature_importances_ attribute when fitted. The Yellowbrick
FeatureImportances visualizer utilizes this attribute to rank and plot
relative importances. Letâs start with an example; first load a
classification dataset.

Then we can create a new figure (this is
optional, if an Axes isnât specified, Yellowbrick will use the current
figure or create one). We can then fit a FeatureImportances visualizer
with a GradientBoostingClassifier to visualize the ranked features.

The above figure shows the features ranked according to the explained variance
each feature contributes to the model. In this case the features are plotted
against their relative importance, that is the percent importance of the
most important feature. The visualizer also contains features_ and
feature_importances_ attributes to get the ranked numeric values.

For models that do not support a feature_importances_ attribute, the
FeatureImportances visualizer will also draw a bar plot for the coef_
attribute that many linear models provide.

When using a model with a coef_ attribute, it is better to set
relative=False to draw the true magnitude of the coefficient (which may
be negative). We can also specify our own set of labels if the dataset does
not have column names or to print better titles. In the example below we
title case our features for better readability:

fromsklearn.linear_modelimportLassofromyellowbrick.datasetsimportload_concretefromyellowbrick.featuresimportFeatureImportances# Load the regression datasetdataset=load_concrete(return_dataset=True)X,y=dataset.to_data()# Title case the feature for better display and create the visualizerlabels=list(map(lambdas:s.title(),dataset.meta['features']))viz=FeatureImportances(Lasso(),labels=labels,relative=False)# Fit and show the feature importancesviz.fit(X,y)viz.poof()

Some estimators return a multi-dimensonal array for either feature_importances_ or coef_ attributes. For example the LogisticRegression classifier returns a coef_ array in the shape of (n_classes,n_features) in the multiclass case. These coefficients map the importance of the feature to the prediction of the probability of a specific class. Although the interpretation of multi-dimensional feature importances depends on the specific estimator and model family, the data is treated the same in the FeatureImportances visualizer â namely the importances are averaged.

Taking the mean of the importances may be undesirable for several reasons. For example, a feature may be more informative for some classes than others. Multi-output estimators also do not benefit from having averages taken across what are essentially multiple internal models. In this case, use the stack=True parameter to draw a stacked bar chart of importances as follows:

Generalized linear models compute a predicted independent variable via the
linear combination of an array of coefficients with an array of dependent
variables. GLMs are fit by modifying the coefficients so as to minimize error
and regularization techniques specify how the model modifies coefficients in
relation to each other. As a result, an opportunity presents itself: larger
coefficients are necessarily âmore informativeâ because they contribute a
greater weight to the final prediction in most cases.

Additionally we may say that instance features may also be more or less
âinformativeâ depending on the product of the instance feature value with
the feature coefficient. This creates two possibilities:

We can compare models based on ranking of coefficients, such that a higher coefficient is âmore informativeâ.

We can compare instances based on ranking of feature/coefficient products such that a higher product is âmore informativeâ.

In both cases, because the coefficient may be negative (indicating a strong negative correlation) we must rank features by the absolute values of their coefficients. Visualizing a model or multiple models by most informative feature is usually done via bar chart where the y-axis is the feature names and the x-axis is numeric value of the coefficient such that the x-axis has both a positive and negative quadrant. The bigger the size of the bar, the more informative that feature is.

This method may also be used for instances; but generally there are very many instances relative to the number models being compared. Instead a heatmap grid is a better choice to inspect the influence of features on individual instances. Here the grid is constructed such that the x-axis represents individual features, and the y-axis represents individual instances. The color of each cell (an instance, feature pair) represents the magnitude of the product of the instance value with the featureâs coefficient for a single model. Visual inspection of this diagnostic may reveal a set of instances for which one feature is more predictive than another; or other types of regions of information in the model itself.

Displays the most informative features in a model by showing a bar chart
of features ranked by their importances. Although primarily a feature
engineering mechanism, this visualizer requires a model that has either a
coef_ or feature_importances_ parameter after fit.

Note: Some classification models such as LogisticRegression, return
coef_ as a multidimensional array of shape (n_classes,n_features).
In this case, the FeatureImportances visualizer computes the mean of the
coefs_ by class for each feature.

Parameters:

model:Estimator

A Scikit-Learn estimator that learns feature importances. Must support
either coef_ or feature_importances_ parameters.

ax:matplotlib Axes, default: None

The axis to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

labels:list, default: None

A list of feature names to use. If a DataFrame is passed to fit and
features is None, feature names are selected as the column names.

relative:bool, default: True

If true, the features are described by their relative importance as a
percentage of the strongest feature component; otherwise the raw
numeric description of the feature importance is shown.

absolute:bool, default: False

Make all coeficients absolute to more easily compare negative
coeficients with positive ones.

xlabel:str, default: None

The label for the X-axis. If None is automatically determined by the
underlying model and options provided.

stack:bool, default: False

If true and the classifier returns multi-class feature importance,
then a stacked bar plot is plotted; otherwise the mean of the
feature importance across classes are plotted.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. Features are ranked by the modelâs coef_ or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.

RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid. To find the optimal number of features cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features. The RFECV visualizer plots the number of features in the model along with their cross-validated test score and variability and visualizes the selected number of features.

To show how this works in practice, weâll start with a contrived example using a dataset that has only 3 informative features out of 25.

This figure shows an ideal RFECV curve, the curve jumps to an excellent accuracy when the three informative features are captured, then gradually decreases in accuracy as the non informative features are added into the model. The shaded area represents the variability of cross-validation, one standard deviation above and below the mean accuracy score drawn by the curve.

Exploring a real dataset, we can see the impact of RFECV on a credit default binary classifier.

In this example we can see that 19 features were selected, though there doesnât appear to be much improvement in the f1 score of the model after around 5 features. Selection of the features to eliminate plays a large role in determining the outcome of each recursion; modifying the step parameter to eliminate more than one feature at each step may help to eliminate the worst features early, strengthening the remaining features (and can also be used to speed up feature elimination for datasets with a large number of features).

See also

This visualizer is is based off of the visualization in the scikit-learn documentation: recursive feature elimination with cross-validation. However, the Yellowbrick version does not use sklearn.feature_selection.RFECV but instead wraps sklearn.feature_selection.RFE models. The fitted model can be accessed on the visualizer using the viz.rfe_estimator_ attribute, and in fact the visualizer acts as the fitted model when using predict() or score().

Selects the best subset of features for the supplied estimator by removing
0 to N features (where N is the number of features) using recursive
feature elimination, then selecting the best subset based on the
cross-validation score of the model. Recursive feature elimination
eliminates n features from a model by fitting the model multiple times and
at each step, removing the weakest features, determined by either the
coef_ or feature_importances_ attribute of the fitted model.

The visualization plots the score relative to each subset and shows trends
in feature elimination. If the feature elimination CV score is flat, then
potentially there are not enough features in the model. An ideal curve is
when the score jumps from low to high as the number of features removed
increases, then slowly decreases again from the optimal number of
features.

Parameters:

model:a scikit-learn estimator

An object that implements fit and provides information about the
relative importance of features with either a coef_ or
feature_importances_ attribute.

Note that the object is cloned for each validation.

ax:matplotlib.Axes object, optional

The axes object to plot the figure on.

step:int or float, optional (default=1)

If greater than or equal to 1, then step corresponds to the (integer)
number of features to remove at each iteration. If within (0.0, 1.0),
then step corresponds to the percentage (rounded down) of features to
remove at each iteration.

groups:array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into
train/test set.

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

This model wraps sklearn.feature_selection.RFE and not
sklearn.feature_selection.RFECV because access to the internals of the
CV and RFE estimators is required for the visualization. The visualizer
does take similar arguments, however it does not expose the same internal
attributes.

Additionally, the RFE model can be accessed via the rfe_estimator_
attribute. Once fitted, the visualizer acts as a wrapper for this
estimator and not for the original model passed to the model. This way the
visualizer model can be used to make predictions.

Caution

This visualizer requires a model that has either a coef_
or feature_importances_ attribute when fitted.

Attributes:

n_features_:int

The number of features in the selected subset

support_:array of shape [n_features]

A mask of the selected features

ranking_:array of shape [n_features]

The feature ranking, such that ranking_[i] corresponds to the
ranked position of feature i. Selected features are assigned rank 1.

cv_scores_:array of shape [n_subsets_of_features, n_splits]

The cross-validation scores for each subset of features and splits in
the cross-validation strategy.

rfe_estimator_:sklearn.feature_selection.RFE

A fitted RFE estimator wrapping the original estimator. All estimator
functions such as predict() and score() are passed through to
this estimator (it rewraps the original model).

n_feature_subsets_:array of shape [n_subsets_of_features]

The number of features removed on each iteration of RFE, computed by the
number of features in the dataset and the step parameter.

Sometimes for feature analysis you simply need a scatter plot to determine the distribution of data. Machine learning operates on high dimensional data, so the number of dimensions has to be filtered. As a result these visualizations are typically used as the base for larger visualizers; however you can also use them to quickly plot data during ML analysis.

Joint plots are useful for machine learning on multi-dimensional data, allowing for
the visualization of complex interactions between different data dimensions, their
varying distributions, and even their relationships to the target variable for
prediction.

The Yellowbrick JointPlot can be used both for pairwise feature analysis and
feature-to-target plots. For pairwise feature analysis, the columns argument can
be used to specify the index of the two desired columns in X. If y is also
specified, the plot can be colored with a heatmap or by class. For feature-to-target
plots, the user can provide either X and y as 1D vectors, or a columns
argument with an index to a single feature in X to be plotted against y.

Histograms can be included by setting the hist argument to True for a
frequency distribution, or to "density" for a probability density function. Note
that histograms requires matplotlib 2.0.2 or greater.

Parameters:

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes will be
used (or generated if required). This is considered the base axes where the
the primary joint plot is drawn. It will be shifted and two additional axes
added above (xhax) and to the right (yhax) if hist=True.

columns:int, str, [int, int], [str, str], default: None

Determines what data is plotted in the joint plot and acts as a selection index
into the data passed to fit(X,y). This data therefore must be indexable by
the column type (e.g. an int for a numpy array or a string for a DataFrame).

If None is specified then either both X and y must be 1D vectors and they will
be plotted against each other or X must be a 2D array with only 2 columns. If a
single index is specified then the data is indexed as X[columns] and plotted
jointly with the target variable, y. If two indices are specified then they are
both selected from X, additionally in this case, if y is specified, then it is
used to plot the color of points.

Note that these names are also used as the x and y axes labels if they arenât
specified in the joint_kws argument.

correlation:str, default: âpearsonâ

The algorithm used to compute the relationship between the variables in the
joint plot, one of: âpearsonâ, âcovarianceâ, âspearmanâ, âkendalltauâ.

kind:str in {âscatterâ, âhexâ}, default: âscatterâ

The type of plot to render in the joint axes. Note that when kind=âhexâ the
target cannot be plotted by color.

Draw histograms showing the distribution of the variables plotted jointly.
If set to âdensityâ, the probability density function will be plotted.
If set to True or âfrequencyâ then the frequency will be plotted.
Requires Matplotlib >= 2.0.2.

alpha:float, default: 0.65

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

{joint, hist}_kws:dict, default: None

Additional keyword arguments for the plot components.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Frequently, machine learning problems in the real world suffer from the curse of dimensionality; you have fewer training instances than youâd like and the predictive signal is distributed (often unpredictably!) across many different features.

Sometimes when the your target variable is continuously-valued, there simply arenât enough instances to predict these values to the precision of regression. In this case, we can sometimes transform the regression problem into a classification problem by binning the continuous values into makeshift classes.

To help the user select the optimal number of bins, the BalancedBinningReference visualizer takes the target variable y as input and generates a histogram with vertical lines indicating the recommended value points to ensure that the data is evenly distributed into each bin.

fromyellowbrick.targetimportBalancedBinningReference# Load the a regression data setdata=load_data("concrete")# Extract the target of interesty=data["strength"]# Instantiate the visualizervisualizer=BalancedBinningReference()visualizer.fit(y)# Fit the data to the visualizervisualizer.poof()# Draw/show/poof the data

One of the biggest challenges for classification models is an imbalance of classes in the training data. Severe class imbalances may be masked by relatively good F1 and accuracy scores â the classifier is simply guessing the majority class and not making any evaluation on the underrepresented class.

There are several techniques for dealing with class imbalance such as stratified sampling, down sampling the majority class, weighting, etc. But before these actions can be taken, it is important to understand what the class balance is in the training data. The ClassBalance visualizer supports this by creating a bar chart of the support for each class, that is the frequency of the classesâ representation in the dataset.

The resulting figure allows us to diagnose the severity of the balance issue. In this figure we can see that the "win" class dominates the other two classes. One potential solution might be to create a binary classifier: "win" vs "notwin" and combining the "loss" and "draw" classes into one class.

Warning

The ClassBalance visualizer interface has changed in version 0.9, a classification model is no longer required to instantiate the visualizer, it can operate on data only. Additionally the signature of the fit method has changed from fit(X,y=None) to fit(y_train,y_test=None), passing in X is no longer required.

If a class imbalance must be maintained during evaluation (e.g. the event being classified is actually as rare as the frequency implies) then stratified sampling should be used to create train and test splits. This ensures that the test data has roughly the same proportion of classes as the training data. While scikit-learn does this by default in train_test_split and other cv methods, it can be useful to compare the support of each class in both splits.

The ClassBalance visualizer has a âcompareâ mode, where the train and test data can be passed to fit(), creating a side-by-side bar chart instead of a single bar chart as follows:

fromsklearn.model_selectionimporttrain_test_splitfromyellowbrick.model_selectionimportClassBalance# Load the classification data setdata=load_data('occupancy')# Specify the features of interest and the targetfeatures=["temperature","relative_humidity","light","C02","humidity"]classes=["unoccupied","occupied"]# Extract the instances and targetX=data[features]y=data["occupancy"]# Create the train and test data_,_,y_train,y_test=train_test_split(X,y,test_size=0.2)# Instantiate the classification model and visualizervisualizer=ClassBalance(labels=classes)visualizer.fit(y_train,y_test)returnvisualizer.poof()

This visualization allows us to do a quick check to ensure that the proportion of each class is roughly similar in both splits. This visualization should be a first stop particularly when evaluation metrics are highly variable across different splits.

One of the biggest challenges for classification models is an imbalance of
classes in the training data. The ClassBalance visualizer shows the
relationship of the support for each class in both the training and test
data by displaying how frequently each class occurs as a bar graph.

The ClassBalance visualizer can be displayed in two modes:

Balance mode: show the frequency of each class in the dataset.

Compare mode: show the relationship of support in train and test data.

These modes are determined by what is passed to the fit() method.

Parameters:

ax:matplotlib Axes, default: None

The axis to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

labels: list, optional

A list of class names for the x-axis if the target is already encoded.
Ensure that the labels are ordered lexicographically with respect to
the values in the target. A common use case is to pass
LabelEncoder.classes_ as this parameter. If not specified, the labels
in the data will be used.

kwargs: dict, optional

Keyword arguments passed to the super class. Here, used
to colorize the bars in the histogram.

Attributes:

classes_:array-like

The actual unique classes discovered in the target.

support_:array of shape (n_classes,) or (2, n_classes)

A table representing the support of each class in the target. It is a
vector when in balance mode, or a table with two rows in compare mode.

This visualizer calculates Pearson correlation coefficients and mutual information between features and the dependent variable.
This visualization can be used in feature selection to identify features with high correlation or large mutual information with the dependent variable.

Mutual information between features and the dependent variable is calculated with sklearn.feature_selection.mutual_info_classif when method='mutual_info-classification' and mutual_info_regression when method='mutual_info-regression'.
It is very important to specify discrete features when calculating mutual information because the calculation for continuous and discrete variables are different.
See scikit-learn documentation for more details.

By fitting with a pandas DataFrame, the feature labels are automatically obtained from the column names.
This visualizer also allows sorting of the bar plot according to the calculated mutual information (or Pearson correlation coefficients) and selecting features to plot by specifying the names of the features or the feature index.

Regression models attempt to predict a target in a continuous space.
Regressor score visualizers display the instances in model space to
better understand how the model is making predictions. We currently have
implemented three regressor evaluations:

Residuals Plot: plot the difference between the expected and actual
values

Estimator score visualizers wrap Scikit-Learn estimators and expose
the Estimator API such that they have fit(), predict(), and
score() methods that call the appropriate estimator methods under
the hood. Score visualizers can wrap an estimator and be passed in as
the final step in a Pipeline or VisualPipeline.

Residuals, in the context of regression models, are the difference between the observed value of the target variable (y) and the predicted value (Ć·), e.g. the error of the prediction. The residuals plot shows the difference between residuals on the vertical axis and the dependent variable on the horizontal axis, allowing you to detect regions within the target that may be susceptible to more or less error.

fromsklearn.model_selectionimporttrain_test_split# Load the datadf=load_data('concrete')# Identify the feature and target columnsfeature_names=['cement','slag','ash','water','splast','coarse','fine','age']target_name='strength'# Separate the instance data from the target dataX=df[feature_names]y=df[target_name]# Create the train and test dataX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

fromsklearn.linear_modelimportRidgefromyellowbrick.regressorimportResidualsPlot# Instantiate the linear model and visualizerridge=Ridge()visualizer=ResidualsPlot(ridge)visualizer.fit(X_train,y_train)# Fit the training data to the modelvisualizer.score(X_test,y_test)# Evaluate the model on the test datavisualizer.poof()# Draw/show/poof the data

A common use of the residuals plot is to analyze the variance of the error of the regressor. If the points are randomly dispersed around the horizontal axis, a linear regression model is usually appropriate for the data; otherwise, a non-linear model is more appropriate. In the case above, we see a fairly random, uniform distribution of the residuals against the target in two dimensions. This seems to indicate that our linear model is performing well. We can also see from the histogram that our error is normally distributed around zero, which also generally indicates a well fitted model.

Note that if the histogram is not desired, it can be turned off with the hist=False flag:

Draw a histogram showing the distribution of the residuals on the
right side of the figure. Requires Matplotlib >= 2.0.2.
If set to âdensityâ, the probability density function will be plotted.
If set to True or âfrequencyâ then the frequency will be plotted.

train_color:color, default: âbâ

Residuals for training data are ploted with this color but also
given an opacity of 0.5 to ensure that the test data residuals
are more visible. Can be any matplotlib color.

test_color:color, default: âgâ

Residuals for test data are plotted with this color. In order to
create generalizable models, reserved test data residuals are of
the most analytical interest, so these points are highlighted by
having full opacity. Can be any matplotlib color.

line_color:color, default: dark grey

Defines the color of the zero error line, can be any matplotlib color.

alpha:float, default: 0.75

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

ResidualsPlot is a ScoreVisualizer, meaning that it wraps a model and
its primary entry point is the score() method.

Draw the residuals against the predicted value for the specified split.
It is best to draw the training split first, then the test split so
that the test split (usually smaller) is above the training split;
particularly if the histogram is turned on.

Parameters:

y_pred:ndarray or Series of length n

An array or series of predicted target values

residuals:ndarray or Series of length n

An array or series of the difference between the predicted and the
target values

train:boolean, default: False

If False, draw assumes that the residual points being plotted
are from the test data; if True, draw assumes the residuals
are the train data.

A prediction error plot shows the actual targets from the dataset against the predicted values generated by our model. This allows us to see how much variance is in the model. Data scientists can diagnose regression models using this plot by comparing against the 45 degree line, where the prediction exactly matches the model.

fromsklearn.model_selectionimporttrain_test_split# Load the regression data setdata=load_data('concrete')# Specify the features of interest and the targetfeatures=['cement','slag','ash','water','splast','coarse','fine','age']target='strength'# Extract the instances and targetX=data[features]y=data[target]# Create the train and test dataX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

fromsklearn.linear_modelimportLassofromyellowbrick.regressorimportPredictionError# Instantiate the linear model and visualizerlasso=Lasso()visualizer=PredictionError(lasso)visualizer.fit(X_train,y_train)# Fit the training data to the visualizervisualizer.score(X_test,y_test)# Evaluate the model on the test datag=visualizer.poof()# Draw/show/poof the data

The prediction error visualizer plots the actual targets from the dataset
against the predicted values generated by our model(s). This visualizer is
used to dectect noise or heteroscedasticity along a range of the target
domain.

Parameters:

model:a Scikit-Learn regressor

Should be an instance of a regressor, otherwise will raise a
YellowbrickTypeError exception on instantiation.

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

shared_limits:bool, default: True

If shared_limits is True, the range of the X and Y axis limits will
be identical, creating a square graphic with a true 45 degree line.
In this form, it is easier to diagnose under- or over- prediction,
though the figure will become more sparse. To localize points, set
shared_limits to False, but note that this will distort the figure
and should be accounted for during analysis.

bestfit:bool, default: True

Draw a linear best fit line to estimate the correlation between the
predicted and measured value of the target variable. The color of
the bestfit line is determined by the line_color argument.

identity: bool, default: True

Draw the 45 degree identity line, y=x in order to better show the
relationship or pattern of the residuals. E.g. to estimate if the
model is over- or under- estimating the given values. The color of the
identity line is a muted version of the line_color argument.

point_color:color

Defines the color of the error points; can be any matplotlib color.

line_color:color

Defines the color of the best fit line; can be any matplotlib color.

alpha:float, default: 0.75

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

PredictionError is a ScoreVisualizer, meaning that it wraps a model and
its primary entry point is the score() method.

The score function is the hook for visual interaction. Pass in test
data and the visualizer will create predictions on the data and
evaluate them with respect to the test values. The evaluation will
then be passed to draw() and the result of the estimator score will
be returned.

Regularization is designed to penalize model complexity, therefore the higher the alpha, the less complex the model, decreasing the error due to variance (overfit). Alphas that are too high on the other hand increase the error due to bias (underfit). It is important, therefore to choose an optimal alpha such that the error is minimized in both directions.

The AlphaSelection Visualizer demonstrates how different values of alpha influence model selection during the regularization of linear models. Generally speaking, alpha increases the affect of regularization, e.g. if alpha is zero there is no regularization and the higher the alpha, the more the regularization parameter influences the final model.

# Load the regression data setdf=load_data('concrete')# Specify the features of interest and the targetfeatures=['cement','slag','ash','water','splast','coarse','fine','age']target='strength'# Extract the instances and targetX=df[features]y=df[target]

importnumpyasnpfromsklearn.linear_modelimportLassoCVfromyellowbrick.regressorimportAlphaSelection# Create a list of alphas to cross-validate againstalphas=np.logspace(-10,1,400)# Instantiate the linear model and visualizermodel=LassoCV(alphas=alphas)visualizer=AlphaSelection(model)visualizer.fit(X,y)g=visualizer.poof()

The Alpha Selection Visualizer demonstrates how different values of alpha
influence model selection during the regularization of linear models.
Generally speaking, alpha increases the affect of regularization, e.g. if
alpha is zero there is no regularization and the higher the alpha, the
more the regularization parameter influences the final model.

Regularization is designed to penalize model complexity, therefore the
higher the alpha, the less complex the model, decreasing the error due to
variance (overfit). Alphas that are too high on the other hand increase
the error due to bias (underfit). It is important, therefore to choose an
optimal Alpha such that the error is minimized in both directions.

To do this, typically you would you use one of the âRegressionCVâ models
in Scikit-Learn. E.g. instead of using the Ridge (L2) regularizer, you
can use RidgeCV and pass a list of alphas, which will be selected
based on the cross-validation score of each alpha. This visualizer wraps
a âRegressionCVâ model and visualizes the alpha/error curve. Use this
visualization to detect if the model is responding to regularization, e.g.
as you increase or decrease alpha, the model responds and error is
decreased. If the visualization shows a jagged or random plot, then
potentially the model is not sensitive to that type of regularization and
another is required (e.g. L1 or Lasso regularization).

Parameters:

model:a Scikit-Learn regressor

Should be an instance of a regressor, and specifically one whose name
ends with âCVâ otherwise a will raise a YellowbrickTypeError exception
on instantiation. To use non-CV regressors see:
ManualAlphaSelection.

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

This class expects an estimator whose name ends with âCVâ. If you wish to
use some other estimator, please see the ManualAlphaSelection
Visualizer for manually iterating through all alphas and selecting the
best one.

This Visualizer hooks into the Scikit-Learn API during fit(). In
order to pass a fitted model to the Visualizer, call the draw() method
directly after instantiating the visualizer with the fitted model.

Note, each âRegressorCVâ module has many different methods for storing
alphas and error. This visualizer attempts to get them all and is known
to work for RidgeCV, LassoCV, LassoLarsCV, and ElasticNetCV. If your
favorite regularization method doesnât work, please submit a bug report.

The AlphaSelection visualizer requires a âRegressorCVâ, that is a
specialized class that performs cross-validated alpha-selection on behalf
of the model. If the regressor you wish to use doesnât have an associated
âCVâ estimator, or for some reason you would like to specify more control
over the alpha selection process, then you can use this manual alpha
selection visualizer, which is essentially a wrapper for
cross_val_score, fitting a model for each alpha specified.

Parameters:

model:a Scikit-Learn regressor

Should be an instance of a regressor, and specifically one whose name
doesnât end with âCVâ. The regressor must support a call to
set_params(alpha=alpha) and be fit multiple times. If the
regressor name ends with âCVâ a YellowbrickValueError is raised.

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

The fit method is the primary entry point for the manual alpha
selection visualizer. It sets the alpha param for each alpha in the
alphas list on the wrapped estimator, then scores the model using the
passed in X and y data set. Those scores are then aggregated and
drawn using matplotlib.

Classification models attempt to predict a target in a discrete space, that is assign an instance of dependent variables one or more categories. Classification score visualizers display the differences between classes as well as a number of classifier-specific visual evaluations. We currently have implemented the following classifier evaluations:

Classification Report: A visual classification report that displays precision, recall, and F1 per-class as a heatmap.

Confusion Matrix: A heatmap view of the confusion matrix of pairs of classes in multi-class classification.

ROCAUC: Graphs the receiver operating characteristics and area under the curve.

Class Balance: Visual inspection of the target to show the support of each class to the final estimator.

Class Prediction Error: An alternative to the confusion matrix that shows both support and the difference between actual and predicted classes.

Discrimination Threshold: Shows precision, recall, f1, and queue rate over all thresholds for binary classifiers that use a discrimination probability or score.

Estimator score visualizers wrap scikit-learn estimators and expose the
Estimator API such that they have fit(), predict(), and score()
methods that call the appropriate estimator methods under the hood. Score
visualizers can wrap an estimator and be passed in as the final step in
a Pipeline or VisualPipeline.

The classification report visualizer displays the precision, recall, F1, and support scores for the model. In order to support easier interpretation and problem detection, the report integrates numerical scores with a color-coded heatmap. All heatmaps are in the range (0.0,1.0) to facilitate easy comparison of classification models across different classification reports.

fromsklearn.model_selectionimporttrain_test_splitfromsklearn.naive_bayesimportGaussianNBfromyellowbrick.classifierimportClassificationReportfromyellowbrick.datasetsimportload_occupancy# Load the classification data setX,y=load_occupancy()# Specify the classes of the targetclasses=["unoccupied","occupied"]# Create the train and test dataX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)# Instantiate the classification model and visualizerbayes=GaussianNB()visualizer=ClassificationReport(bayes,classes=classes,support=True)visualizer.fit(X_train,y_train)# Fit the visualizer and the modelvisualizer.score(X_test,y_test)# Evaluate the model on the test datavisualizer.poof()# Draw/show/poof the data

The classification report shows a representation of the main classification metrics on a per-class basis. This gives a deeper intuition of the classifier behavior over global accuracy which can mask functional weaknesses in one class of a multiclass problem. Visual classification reports are used to compare classification models to select models that are âredderâ, e.g. have stronger classification metrics or that are more balanced.

The metrics are defined in terms of true and false positives, and true and false negatives. Positive and negative in this case are generic names for the classes of a binary classification problem. In the example above, we would consider true and false occupied and true and false unoccupied. Therefore a true positive is when the actual class is positive as is the estimated class. A false positive is when the actual class is negative but the estimated class is positive. Using this terminology the meterics are defined as follows:

precision

Precision is the ability of a classiifer not to label an instance positive that is actually negative. For each class it is defined as as the ratio of true positives to the sum of true and false positives. Said another way, âfor all instances classified positive, what percent was correct?â

recall

Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives. Said another way, âfor all instances that were actually positive, what percent was classified correctly?â

f1 score

The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

support

Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesnât change between models but instead diagnoses the evaluation process.

The ConfusionMatrix visualizer is a ScoreVisualizer that takes a
fitted scikit-learn classifier and a set of test X and y values and
returns a report showing how each of the test values predicted classes
compare to their actual classes. Data scientists use confusion matrices
to understand which classes are most easily confused. These provide
similar information as what is available in a ClassificationReport, but
rather than top-level scores, they provide deeper insight into the
classification of individual data points.

Below are a few examples of using the ConfusionMatrix visualizer; more
information can be found by looking at the
scikit-learn documentation on confusion matrices.

fromsklearn.datasetsimportload_digitsfromsklearn.model_selectionimporttrain_test_splitasttsfromsklearn.linear_modelimportLogisticRegressionfromyellowbrick.classifierimportConfusionMatrix# We'll use the handwritten digits data set from scikit-learn.# Each feature of this dataset is an 8x8 pixel image of a handwritten number.# Digits.data converts these 64 pixels into a single array of featuresdigits=load_digits()X=digits.datay=digits.targetX_train,X_test,y_train,y_test=tts(X,y,test_size=0.2,random_state=11)model=LogisticRegression(multi_class="auto",solver="liblinear")# The ConfusionMatrix visualizer taxes a modelcm=ConfusionMatrix(model,classes=[0,1,2,3,4,5,6,7,8,9])# Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted modelcm.fit(X_train,y_train)# To create the ConfusionMatrix, we need some test data. Score runs predict() on the data# and then creates the confusion_matrix from scikit-learn.cm.score(X_test,y_test)# How did we do?cm.poof()

Class names can be added to a ConfusionMatrix plot using the label_encoder argument. The label_encoder can be a sklearn.preprocessing.LabelEncoder (or anything with an inverse_transform method that performs the mapping), or a dict with the encoding-to-string mapping as in the example below:

Creates a heatmap visualization of the sklearn.metrics.confusion_matrix().
A confusion matrix shows each combination of the true and predicted
classes for a test data set.

The default color map uses a yellow/orange/red color scale. The user can
choose between displaying values as the percent of true (cell value
divided by sum of row) or as direct counts. If percent of true mode is
selected, 100% accurate predictions are highlighted in green.

Requires a classification model.

Parameters:

model:estimator

Must be a classifier, otherwise raises YellowbrickTypeError

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

sample_weight: array-like of shape = [n_samples], optional

Passed to confusion_matrix to weight the samples.

percent: bool, default: False

Determines whether or not the confusion_matrix is displayed as counts
or as a percent of true predictions. Note, if specifying a subset of
classes, percent should be set to False or inaccurate figures will be
displayed.

classes:list, default: None

a list of class names to use in the confusion_matrix.
This is passed to the labels parameter of
sklearn.metrics.confusion_matrix(), and follows the behaviour
indicated by that function. It may be used to reorder or select a
subset of labels. If None, classes that appear at least once in
y_true or y_pred are used in sorted order.

label_encoder:dict or LabelEncoder, default: None

When specifying the classes argument, the input to fit()
and score() must match the expected labels. If the X and y
datasets have been encoded prior to training and the labels must be
preserved for the visualization, use this argument to provide a
mapping from the encoded class to the correct label. Because typically
a Scikit-Learn LabelEncoder is used to perform this operation, you
may provide it directly to the class to utilize its fitted encoding.

cmap:string, default: 'YlOrRd'

Specify a colormap to define the heatmap of the predicted class
against the actual class in the confusion matrix.

fontsize:int, default: None

Specify the fontsize of the text in the grid and labels to make the
matrix a bit easier to read. Uses rcParams font size by default.

A ROCAUC (Receiver Operating Characteristic/Area Under the Curve) plot allows the user to visualize the tradeoff between the classifierâs sensitivity and specificity.

The Receiver Operating Characteristic (ROC) is a measure of a classifierâs predictive quality that compares and visualizes the tradeoff between the modelâs sensitivity and specificity. When plotted, a ROC curve displays the true positive rate on the Y axis and the false positive rate on the X axis on both a global average and per-class basis. The ideal point is therefore the top-left corner of the plot: false positives are zero and true positives are one.

This leads to another metric, area under the curve (AUC), which is a computation of the relationship between false positives and true positives. The higher the AUC, the better the model generally is. However, it is also important to inspect the âsteepnessâ of the curve, as this describes the maximization of the true positive rate while minimizing the false positive rate.

fromsklearn.model_selectionimporttrain_test_splitfromyellowbrick.classifierimportROCAUCfromsklearn.linear_modelimportLogisticRegressionfromyellowbrick.datasetsimportload_occupancy# Load the classification data setX,y=load_occupancy()# Specify the classes of the targetclasses=["unoccupied","occupied"]# Create the train and test dataX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)# Instantiate the visualizer with the classification modelvisualizer=ROCAUC(LogisticRegression(multi_class="auto",solver="liblinear"),classes=classes)visualizer.fit(X_train,y_train)# Fit the training data to the visualizervisualizer.score(X_test,y_test)# Evaluate the model on the test datavisualizer.poof()# Draw/show/poof the data

Versions of Yellowbrick =< v0.8 had a bug
that triggered an IndexError when attempting binary classification using
a Scikit-learn-style estimator with only a decision_function. This has been
fixed as of v0.9, where the micro, macro, and per-class parameters of
ROCAUC are set to False for such classifiers.

Yellowbrickâs ROCAUC Visualizer does allow for plotting multiclass classification curves.
ROC curves are typically used in binary classification, and in fact the Scikit-Learn roc_curve metric is only able to perform metrics for binary classifiers. Yellowbrick addresses this by binarizing the output (per-class) or to use one-vs-rest (micro score) or one-vs-all (macro score) strategies of classification.

fromsklearn.model_selectionimporttrain_test_splitfromyellowbrick.classifierimportROCAUCfromyellowbrick.datasetsimportload_gamefromsklearn.linear_modelimportRidgeClassifierfromsklearn.preprocessingimportOrdinalEncoder,LabelEncoder# Load multi-class classification datasetX,y=load_game()classes=["win","loss","draw"]# Encode the non-numeric columnsX=OrdinalEncoder().fit_transform(X)y=LabelEncoder().fit_transform(y)# Create the train and test dataX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)visualizer=ROCAUC(RidgeClassifier(),classes=classes)visualizer.fit(X_train,y_train)# Fit the training data to the visualizervisualizer.score(X_test,y_test)# Evaluate the model on the test datavisualizer.poof()# Draw/show/poof the data

The target y must be numeric for this figure to work, or update to the latest version of sklearn.

By default with multi-class ROCAUC visualizations, a curve for each class is plotted, in addition to the micro- and macro-average curves for each class. This enables the user to inspect the tradeoff between sensitivity and specificity on a per-class basis. Note that for multi-class ROCAUC, at least one of the micro, macro, or per_class parameters must be set to True (by default, all are set to True).

Receiver Operating Characteristic (ROC) curves are a measure of a
classifierâs predictive quality that compares and visualizes the tradeoff
between the modelsâ sensitivity and specificity. The ROC curve displays
the true positive rate on the Y axis and the false positive rate on the
X axis on both a global average and per-class basis. The ideal point is
therefore the top-left corner of the plot: false positives are zero and
true positives are one.

This leads to another metric, area under the curve (AUC), a computation
of the relationship between false positives and true positives. The higher
the AUC, the better the model generally is. However, it is also important
to inspect the âsteepnessâ of the curve, as this describes the
maximization of the true positive rate while minimizing the false positive
rate. Generalizing âsteepnessâ usually leads to discussions about
convexity, which we do not get into here.

Parameters:

model:estimator

Must be a classifier, otherwise raises YellowbrickTypeError

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

classes:list

A list of class names for the legend. If classes is None and a y value
is passed to fit then the classes are selected from the target vector.
Note that the curves must be computed based on what is in the target
vector passed to the score() method. Class names are used for
labeling only and must be in the correct order to prevent confusion.

micro:bool, default = True

Plot the micro-averages ROC curve, computed from the sum of all true
positives and false positives across all classes. Micro is not defined
for binary classification problems with estimators with only a
decision_function method.

macro:bool, default = True

Plot the macro-averages ROC curve, which simply takes the average of
curves across all classes. Macro is not defined for binary
classification problems with estimators with only a decision_function
method.

per_class:bool, default = True

Plot the ROC curves for each individual class. This should be set
to false if only the macro or micro average curves are required. Per-
class classification is not defined for binary classification problems
with estimators with only a decision_function method.

kwargs:keyword arguments passed to the super class.

Currently passing in hard-coded colors for the Receiver Operating
Characteristic curve and the diagonal.
These will be refactored to a default Yellowbrick style.

Notes

ROC curves are typically used in binary classification, and in fact the
Scikit-Learn roc_curve metric is only able to perform metrics for
binary classifiers. As a result it is necessary to binarize the output or
to use one-vs-rest or one-vs-all strategies of classification. The
visualizer does its best to handle multiple situations, but exceptions can
arise from unexpected models or outputs.

Another important point is the relationship of class labels specified on
initialization to those drawn on the curves. The classes are not used to
constrain ordering or filter curves; the ROC computation happens on the
unique values specified in the target vector to the score method. To
ensure the best quality visualization, do not use a LabelEncoder for this
and do not pass in class labels.

Precision-Recall curves are a metric used to evaluate a classifierâs quality,
particularly when classes are very imbalanced. The precision-recall curve
shows the tradeoff between precision, a measure of result relevancy, and
recall, a measure of how many relevant results are returned. A large area
under the curve represents both high recall and precision, the best case
scenario for a classifier, showing a model that returns accurate results
for the majority of classes it selects.

The base case for precision-recall curves is the binary classification case, and this case is also the most visually interpretable. In the figure above we can see the precision plotted on the y-axis against the recall on the x-axis. The larger the filled in area, the stronger the classifier is. The red line annotates the average precision, a summary of the entire plot computed as the weighted average of precision achieved at each threshold such that the weight is the difference in recall from the previous threshold.

To support multi-label classification, the estimator is wrapped in a OneVsRestClassifier to produce binary comparisons for each class (e.g. the positive case is the class and the negative case is any other class). The Precision-Recall curve is then computed as the micro-average of the precision and recall for all classes:

A more complex Precision-Recall curve can be computed, however, displaying the each curve individually, along with F1-score ISO curves (e.g. that show the relationship between precision and recall for various F1 scores).

Precision-Recall curves are a metric used to evaluate a classifierâs quality,
particularly when classes are very imbalanced. The precision-recall curve
shows the tradeoff between precision, a measure of result relevancy, and
recall, a measure of how many relevant results are returned. A large area
under the curve represents both high recall and precision, the best case
scenario for a classifier, showing a model that returns accurate results
for the majority of classes it selects.

Parameters:

model:the Scikit-Learn estimator

A classification model to score the precision-recall curve on.

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

classes:list

A list of class names for the legend. If classes is None and a y value
is passed to fit then the classes are selected from the target vector.
Note that the curves must be computed based on what is in the target
vector passed to the score() method. Class names are used for
labeling only and must be in the correct order to prevent confusion.

fill_area:bool, default=True

Fill the area under the curve (or curves) with the curve color.

ap_score:bool, default=True

Annotate the graph with the average precision score, a summary of the
plot that is computed as the weighted mean of precisions at each
threshold, with the increase in recall from the previous threshold used
as the weight.

micro:bool, default=True

If multi-class classification, draw the precision-recall curve for the
micro-average of all classes. In the multi-class case, either micro or
per-class must be set to True. Ignored in the binary case.

iso_f1_curves:bool, default=False

Draw ISO F1-Curves on the plot to show how close the precision-recall
curves are to different F1 scores.

iso_f1_values:tuple , default=(0.2,0.4,0.6,0.8)

Values of f1 score for which to draw ISO F1-Curves

per_class:bool, default=False

If multi-class classification, draw the precision-recall curve for
each class using a OneVsRestClassifier to compute the recall on a
per-class basis. In the multi-class case, either micro or per-class
must be set to True. Ignored in the binary case.

fill_opacity:float, default=0.2

Specify the alpha or opacity of the fill area (0 being transparent,
and 1.0 being completly opaque).

line_opacity:float, default=0.8

Specify the alpha or opacity of the lines (0 being transparent, and
1.0 being completly opaque).

Either "binary" or "multiclass" depending on the type of target
fit to the visualizer. If "multiclass" then the estimator is
wrapped in a OneVsRestClassifier classification strategy.

score_:float or dict of floats

Average precision, a summary of the plot as a weighted mean of
precision at each threshold, weighted by the increase in recall from
the previous threshold. In the multiclass case, a mapping of class/metric
to the average precision score.

precision_:array or dict of array with shape=[n_thresholds + 1]

Precision values such that element i is the precision of predictions
with score >= thresholds[i] and the last element is 1. In the multiclass
case, a mapping of class/metric to precision array.

recall_:array or dict of array with shape=[n_thresholds + 1]

Decreasing recall values such that element i is the recall of
predictions with score >= thresholds[i] and the last element is 0.
In the multiclass case, a mapping of class/metric to recall array.

The Yellowbrick ClassPredictionError plot is a twist on other and sometimes more familiar classification model diagnostic tools like the Confusion Matrix and Classification Report. Like the Classification Report, this plot shows the support (number of training samples) for each class in the fitted classification model as a stacked bar chart. Each bar is segmented to show the proportion of predictions (including false negatives and false positives, like a Confusion Matrix) for each class. You can use a ClassPredictionError to visualize which classes your classifier is having a particularly difficult time with, and more importantly, what incorrect answers it is giving on a per-class basis. This can often enable you to better understand strengths and weaknesses of different models and particular challenges unique to your dataset.

The class prediction error chart provides a way to quickly understand how good your classifier is at predicting the right classes.

In the above example, while the RandomForestClassifier appears to be fairly good at correctly predicting apples based on the features of the fruit, it often incorrectly labels pears as kiwis and mistakes kiwis for bananas.

By contrast, in the following example, the RandomForestClassifier does a great job at correctly predicting accounts in default, but it is a bit of a coin toss in predicting account holders who stayed current on bills.

fromsklearn.model_selectionimporttrain_test_splitfromsklearn.ensembleimportRandomForestClassifierfromyellowbrick.classifierimportClassPredictionErrorfromyellowbrick.datasetsimportload_creditX,y=load_credit()classes=['account in default','current with bills']# Perform 80/20 training/test splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42)# Instantiate the classification model and visualizervisualizer=ClassPredictionError(RandomForestClassifier(n_estimators=10),classes=classes)# Fit the training data to the visualizervisualizer.fit(X_train,y_train)# Evaluate the model on the test datavisualizer.score(X_test,y_test)# Draw visualizationvisualizer.poof()

Class Prediction Error chart that shows the support for each class in the
fitted classification model displayed as a stacked bar. Each bar is
segmented to show the distribution of predicted classes for each
class. It is initialized with a fitted model and generates a
class prediction error chart on draw.

Parameters:

ax: axes

the axis to plot the figure on.

model: estimator

Scikit-Learn estimator object. Should be an instance of a classifier,
else __init__() will raise an exception.

classes: list

A list of class names for the legend. If classes is None and a y value
is passed to fit then the classes are selected from the target vector.

kwargs: dict

Keyword arguments passed to the super class. Here, used
to colorize the bars in the histogram.

Notes

These parameters can be influenced later on in the visualization
process, but can and should be set as early as possible.

Attributes:

score_:float

Global accuracy score

predictions_:ndarray

An ndarray of predictions whose rows are the true classes and
whose columns are the predicted classes

A visualization of precision, recall, f1 score, and queue rate with respect to the discrimination threshold of a binary classifier. The discrimination threshold is the probability or score at which the positive class is chosen over the negative class. Generally, this is set to 50% but the threshold can be adjusted to increase or decrease the sensitivity to false positives or to other application factors.

fromsklearn.linear_modelimportLogisticRegressionfromyellowbrick.classifierimportDiscriminationThresholdfromyellowbrick.datasetsimportload_spam# Load a binary classification datasetX,y=load_spam()# Instantiate the classification model and visualizermodel=LogisticRegression(multi_class="auto",solver="liblinear")visualizer=DiscriminationThreshold(model)visualizer.fit(X,y)# Fit the data to the visualizervisualizer.poof()# Draw/show/poof the data

One common use of binary classification algorithms is to use the score or probability they produce to determine cases that require special treatment. For example, a fraud prevention application might use a classification algorithm to determine if a transaction is likely fraudulent and needs to be investigated in detail. In the figure above, we present an example where a binary classifier determines if an email is âspamâ (the positive case) or ânot spamâ (the negative case). Emails that are detected as spam are moved to a hidden folder and eventually deleted.

Many classifiers use either a decision_function to score the positive class or a predict_proba function to compute the probability of the positive class. If the score or probability is greater than some discrimination threshold then the positive class is selected, otherwise, the negative class is.

Generally speaking, the threshold is balanced between cases and set to 0.5 or 50% probability. However, this threshold may not be the optimal threshold: often there is an inverse relationship between precision and recall with respect to a discrimination threshold. By adjusting the threshold of the classifier, it is possible to tune the F1 score (the harmonic mean of precision and recall) to the best possible fit or to adjust the classifier to behave optimally for the specific application. Classifiers are tuned by considering the following metrics:

Precision: An increase in precision is a reduction in the number of false positives; this metric should be optimized when the cost of special treatment is high (e.g. wasted time in fraud preventing or missing an important email).

Recall: An increase in recall decrease the likelihood that the positive class is missed; this metric should be optimized when it is vital to catch the case even at the cost of more false positives.

F1 Score: The F1 score is the harmonic mean between precision and recall. The fbeta parameter determines the relative weight of precision and recall when computing this metric, by default set to 1 or F1. Optimizing this metric produces the best balance between precision and recall.

Queue Rate: The âqueueâ is the spam folder or the inbox of the fraud investigation desk. This metric describes the percentage of instances that must be reviewed. If review has a high cost (e.g. fraud prevention) then this must be minimized with respect to business requirements; if it doesnât (e.g. spam filter), this could be optimized to ensure the inbox stays clean.

In the figure above we see the visualizer tuned to look for the optimal F1 score, which is annotated as a threshold of 0.43. The model is run multiple times over multiple train/test splits in order to account for the variability of the model with respect to the metrics (shown as the fill area around the median curve).

Visualizes how precision, recall, f1 score, and queue rate change as the
discrimination threshold increases. For probabilistic, binary classifiers,
the discrimination threshold is the probability at which you choose the
positive class over the negative. Generally this is set to 50%, but
adjusting the discrimination threshold will adjust sensitivity to false
positives which is described by the inverse relationship of precision and
recall with respect to the threshold.

The visualizer also accounts for variability in the model by running
multiple trials with different train and test splits of the data. The
variability is visualized using a band such that the curve is drawn as the
median score of each trial and the band is from the 10th to 90th
percentile.

The visualizer is intended to help users determine an appropriate
threshold for decision making (e.g. at what threshold do we have a human
review the data), given a tolerance for precision and recall or limiting
the number of records to check (the queue rate).

Caution

This method only works for binary, probabilistic classifiers.

Parameters:

model:Classification Estimator

A binary classification estimator that implements predict_proba or
decision_function methods. Will raise TypeError if the model
cannot be used with the visualizer.

ax:matplotlib Axes, default: None

The axis to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

n_trials:integer, default: 50

Number of times to shuffle and split the dataset to account for noise
in the threshold metrics curves. Note if cv provides > 1 splits,
the number of trials will be n_trials * cv.get_n_splits()

cv:float or cross-validation generator, default: 0.1

Determines the splitting strategy for each trial. Possible inputs are:

float, to specify the percent of the test split

object to be used as cross-validation generator

This attribute is meant to give flexibility with stratified splitting
but if a splitter is provided, it should only return one split and
have shuffle set to True.

fbeta:float, 1.0 by default

The strength of recall versus precision in the F-score.

argmax:str, default: âfscoreâ

Annotate the threshold maximized by the supplied metric (see exclude
for the possible metrics to use). If None, will not annotate the
graph.

exclude:str or list, optional

Specify metrics to omit from the graph, can include:

"precision"

"recall"

"queue_rate"

"fscore"

All metrics not excluded will be displayed in the graph, nor will they
be available in thresholds_; however, they will be computed on fit.

quantiles:sequence, default: np.array([0.1, 0.5, 0.9])

Specify the quantiles to view model variability across a number of
trials. Must be monotonic and have three elements such that the first
element is the lower bound, the second is the drawn curve, and the
third is the upper bound. By default the curve is drawn at the median,
and the bounds from the 10th percentile to the 90th percentile.

random_state:int, optional

Used to seed the random state for shuffling the data while composing
different train and test splits. If supplied, the random state is
incremented in a deterministic fashion for each split.

Note that if a splitter is provided, itâs random state will also be
updated with this random state, even if it was previously set.

kwargs:dict

Keyword arguments that are passed to the base visualizer class.

Notes

The term âdiscrimination thresholdâ is rare in the literature. Here, we
use it to mean the probability at which the positive class is selected
over the negative class in binary classification.

Classification models must implement either a decision_function or
predict_proba method in order to be used with this class. A
YellowbrickTypeError is raised otherwise.

Fit is the entry point for the visualizer. Given instances described
by X and binary classes described in the target y, fit performs n
trials by shuffling and splitting the dataset then computing the
precision, recall, f1, and queue rate scores for each trial. The
scores are aggregated by the quantiles expressed then drawn.

Parameters:

X:ndarray or DataFrame of shape n x m

A matrix of n instances with m features

y:ndarray or Series of length n

An array or series of target or class values. The target y must
be a binary classification target.

Clustering models are unsupervised methods that attempt to detect patterns in unlabeled data. There are two primary classes of clustering algorithm: agglomerative clustering links similar data points together, whereas centroidal clustering attempts to find centers or partitions in the data. Yellowbrick provides the yellowbrick.cluster module to visualize and evaluate clustering behavior. Currently we provide several visualizers to evaluate centroidal mechanisms, particularly K-Means clustering, that help us to discover an optimal \(K\) parameter in the clustering metric:

Elbow Method: visualize the clusters according to some scoring function, look for an âelbowâ in the curve.

Because it is very difficult to score a clustering model, Yellowbrick visualizers wrap scikit-learn clusterer estimators via their fit() method. Once the clustering model is trained, then the visualizer can call poof() to display the clustering evaluation metric.

The KElbowVisualizer implements the âelbowâ method to help data scientists select the optimal number of clusters by fitting the model with a range of values for \(K\). If the line chart resembles an arm, then the âelbowâ (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.

To demonstrate, in the following example the KElbowVisualizer fits the KMeans model for a range of \(K\) values from 4 to 11 on a sample two-dimensional dataset with 8 random clusters of points. When the model is fit with 8 clusters, we can see an âelbowâ in the graph, which in this case we know to be the optimal number.

fromsklearn.clusterimportKMeansfromsklearn.datasetsimportmake_blobsfromyellowbrick.clusterimportKElbowVisualizer# Generate synthetic dataset with 8 random clustersX,y=make_blobs(n_samples=1000,n_features=12,centers=8,random_state=42)# Instantiate the clustering model and visualizermodel=KMeans()visualizer=KElbowVisualizer(model,k=(4,12))visualizer.fit(X)# Fit the data to the visualizervisualizer.poof()# Draw/show/poof the data

By default, the scoring parameter metric is set to distortion, which
computes the sum of squared distances from each point to its assigned center.
However, two other metrics can also be used with the KElbowVisualizer â silhouette and calinski_harabaz. The silhouette score calculates the mean Silhouette Coefficient of all samples, while the calinski_harabaz score computes the ratio of dispersion between and within clusters.

The KElbowVisualizer also displays the amount of time to train the clustering model per \(K\) as a dashed green line, but is can be hidden by setting timings=False. In the following example, weâll use the calinski_harabaz score and hide the time to fit the model.

fromsklearn.clusterimportKMeansfromsklearn.datasetsimportmake_blobsfromyellowbrick.clusterimportKElbowVisualizer# Generate synthetic dataset with 8 random clustersX,y=make_blobs(n_samples=1000,n_features=12,centers=8,random_state=42)# Instantiate the clustering model and visualizermodel=KMeans()visualizer=KElbowVisualizer(model,k=(4,12),metric='calinski_harabaz',timings=False)visualizer.fit(X)# Fit the data to the visualizervisualizer.poof()# Draw/show/poof the data

The K-Elbow Visualizer implements the âelbowâ method of selecting the
optimal number of clusters for K-means clustering. K-means is a simple
unsupervised machine learning algorithm that groups data into a specified
number (k) of clusters. Because the user must specify in advance what k to
choose, the algorithm is somewhat naive â it assigns all members to k
clusters even if that is not the right k for the dataset.

The elbow method runs k-means clustering on the dataset for a range of
values for k (say from 1-10) and then for each value of k computes an
average score for all clusters. By default, the distortion score is
computed, the sum of square distances from each point to its assigned
center. Other metrics can also be used such as the silhouette score,
the mean silhouette coefficient for all samples or the
calinski_harabaz score, which computes the ratio of dispersion between
and within clusters.

When these overall metrics for each model are plotted, it is possible to
visually determine the best value for K. If the line chart looks like an
arm, then the âelbowâ (the point of inflection on the curve) is the best
value of k. The âarmâ can be either up or down, but if there is a strong
inflection point, it is a good indication that the underlying model fits
best at that point.

Parameters:

model:a Scikit-Learn clusterer

Should be an instance of a clusterer, specifically KMeans or
MiniBatchKMeans. If it is not a clusterer, an exception is raised.

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

k:integer, tuple, or iterable

The k values to compute silhouette scores for. If a single integer
is specified, then will compute the range (2,k). If a tuple of 2
integers is specified, then k will be in np.arange(k[0], k[1]).
Otherwise, specify an iterable of integers to use as values for k.

metric:string, default: "distortion"

Select the scoring metric to evaluate the clusters. The default is the
mean distortion, defined by the sum of squared distances between each
observation and its closest centroid. Other metrics include:

distortion: mean sum of squared distances to centers

silhouette: mean ratio of intra-cluster and nearest-cluster distance

calinski_harabaz: ratio of within to between cluster dispersion

timings:bool, default: True

Display the fitting time per k to evaluate the amount of time required
to train the clustering model.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

If you get a visualizer that doesnât have an elbow or inflection point,
then this method may not be working. The elbow method does not work well
if the data is not very clustered; in this case, you might see a smooth
curve and the value of k is unclear. Other scoring methods, such as BIC or
SSE, also can be used to explore if clustering is a correct choice.

The Silhouette Coefficient is used when the ground-truth about the dataset is unknown and computes the density of clusters computed by the model. The score is computed by averaging the silhouette coefficient for each sample, computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value. This produces a score between 1 and -1, where 1 is highly dense clusters and -1 is completely incorrect clustering.

The Silhouette Visualizer displays the silhouette coefficient for each sample on a per-cluster basis, visualizing which clusters are dense and which are not. This is particularly useful for determining cluster imbalance, or for selecting a value for \(K\) by comparing multiple visualizers.

fromsklearn.clusterimportMiniBatchKMeansfromsklearn.datasetsimportmake_blobsfromyellowbrick.clusterimportSilhouetteVisualizer# Generate synthetic dataset with 8 random clustersX,y=make_blobs(n_samples=1000,n_features=12,centers=8,random_state=42)# Instantiate the clustering model and visualizermodel=MiniBatchKMeans(6)visualizer=SilhouetteVisualizer(model)visualizer.fit(X)# Fit the data to the visualizervisualizer.poof()# Draw/show/poof the data

The Silhouette Visualizer displays the silhouette coefficient for each
sample on a per-cluster basis, visually evaluating the density and
separation between clusters. The score is calculated by averaging the
silhouette coefficient for each sample, computed as the difference
between the average intra-cluster distance and the mean nearest-cluster
distance for each sample, normalized by the maximum value. This produces a
score between -1 and +1, where scores near +1 indicate high separation
and scores near -1 indicate that the samples may have been assigned to
the wrong cluster.

In SilhouetteVisualizer plots, clusters with higher scores have wider
silhouettes, but clusters that are less cohesive will fall short of the
average score across all clusters, which is plotted as a vertical dotted
red line.

This is particularly useful for determining cluster imbalance, or for
selecting a value for K by comparing multiple visualizers.

Parameters:

model:a Scikit-Learn clusterer

Should be an instance of a centroidal clustering algorithm (KMeans
or MiniBatchKMeans).

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.

fromsklearn.clusterimportKMeansfromsklearn.datasetsimportmake_blobsfromyellowbrick.clusterimportInterclusterDistance# Generate synthetic dataset with 12 random clustersX,y=make_blobs(n_samples=1000,n_features=12,centers=12,random_state=42)# Instantiate the clustering model and visualizermodel=KMeans(6)visualizer=InterclusterDistance(model)visualizer.fit(X)# Fit the data to the visualizervisualizer.poof()# Draw/show/poof the data

Intercluster distance maps display an embedding of the cluster centers in
2 dimensions with the distance to other centers preserved. E.g. the closer
to centers are in the visualization, the closer they are in the original
feature space. The clusters are sized according to a scoring metric. By
default, they are sized by membership, e.g. the number of instances that
belong to each center. This gives a sense of the relative importance of
clusters. Note however, that because two clusters overlap in the 2D space,
it does not imply that they overlap in the original feature space.

Parameters:

model:a Scikit-Learn clusterer

Should be an instance of a centroidal clustering algorithm (or a
hierarchical algorithm with a specified number of clusters). Also
accepts some other models like LDA for text clustering.
If it is not a clusterer, an exception is raised.

ax:matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes
will be used (or generated if required).

min_size:int, default: 400

The size, in points, of the smallest cluster drawn on the graph.
Cluster sizes will be scaled between the min and max sizes.

max_size:int, default: 25000

The size, in points, of the largest cluster drawn on the graph.
Cluster sizes will be scaled between the min and max sizes.

embedding:default: âmdsâ

The algorithm used to embed the cluster centers in 2 dimensional space
so that the distance between clusters is represented equivalently to
their relationship in feature spaceself.
Embedding algorithm options include:

mds: multidimensional scaling

tsne: stochastic neighbor embedding

scoring:default: âmembershipâ

The scoring method used to determine the size of the clusters drawn on
the graph so that the relative importance of clusters can be viewed.
Scoring method options include:

membership: number of instances belonging to each cluster

legend:bool, default: True

Whether or not to draw the size legend onto the graph, omit the legend
to more easily see clusters that overlap.

legend_loc:str, default: âlower leftâ

The location of the legend on the graph, used to move the legend out
of the way of clusters into open space. The same legend location
options for matplotlib are used here.

Keyword arguments passed to the base class and may influence the
feature visualization properties.

Notes

Currently the only two embeddings supportted are MDS and TSNE. Soon to
follow will be PCoA and a customized version of PCoA for LDA. The only
supported scoring metric is membership, but in the future, silhouette
scores and cluster diameter will be added.

In terms of algorithm support, right now any clustering algorithm that has
a learned cluster_centers_ and labels_ attribute will work with
the visualizer. In the future, we will update this to work with hierarchical
clusterers that have n_components and LDA.

Searches for or creates cluster centers for the specified clustering
algorithm. This algorithm ensures that that the centers are
appropriately drawn and scaled so that distance between clusters are
maintained.

Finalize the visualization to create an âorigin gridâ feel instead of
the default matplotlib feel. Set the title, remove spines, and label
the grid with components. This function also adds a legend from the
sizes if required.

Returns the legend axes, creating it only on demand by creating a 2â
by 2â inset axes that has no grid, ticks, spines or face frame (e.g
is mostly invisible). The legend can then be drawn on this axes.

Yellowbrick visualizers are intended to steer the model selection process. Generally, model selection is a search problem defined as follows: given N instances described by numeric properties and (optionally) a target for estimation, find a model described by a triple composed of features, an algorithm and hyperparameters that best fits the data. For most purposes the âbestâ triple refers to the triple that receives the best cross-validated score for the model type.

The yellowbrick.model_selection package provides visualizers for inspecting the performance of cross validation and hyper parameter tuning. Many visualizers wrap functionality found in sklearn.model_selection and others build upon it for performing multi-model comparisons.

The currently implemented model selection visualizers are as follows:

Validation Curve: visualizes how the adjustment of a hyperparameter influences training and test scores to tune the bias/variance trade-off.

Learning Curve: shows how the size of training data influences the model to diagnose if a model suffers more from variance error vs. bias error.

Model selection makes heavy use of cross validation to measure the performance of an estimator. Cross validation splits a dataset into a training data set and a test data set; the model is fit on the training data and evaluated on the test data. This helps avoid a common pitfall, overfitting, where the model simply memorizes the training data and does not generalize well to new or unknown input.

Model validation is used to determine how effective an estimator is on data that it has been trained on as well as how generalizable it is to new input. To measure a modelâs performance we first split the dataset into training and test splits, fitting the model on the training data and scoring it on the reserved test data.

In order to maximize the score, the hyperparameters of the model must be selected which best allow the model to operate in the specified feature space. Most models have multiple hyperparameters and the best way to choose a combination of those parameters is with a grid search. However, it is sometimes useful to plot the influence of a single hyperparameter on the training and test data to determine if the estimator is underfitting or overfitting for some hyperparameter values.

In our first example, weâll explore using the ValidationCurve visualizer with a regression dataset and in the second, a classification dataset. Note that any estimator that implements fit() and predict() and has an appropriate scoring mechanism can be used with this visualizer.

importnumpyasnpfromyellowbrick.datasetsimportload_energyfromyellowbrick.model_selectionimportValidationCurvefromsklearn.treeimportDecisionTreeRegressor# Load a regression datasetX,y=load_energy()viz=ValidationCurve(DecisionTreeRegressor(),param_name="max_depth",param_range=np.arange(1,11),cv=10,scoring="r2")# Fit and poof the visualizerviz.fit(X,y)viz.poof()

After loading and wrangling the data, we initialize the ValidationCurve with a DecisionTreeRegressor. Decision trees become more overfit the deeper they are because at each level of the tree the partitions are dealing with a smaller subset of data. One way to deal with this overfitting process is to limit the depth of the tree. The validation curve explores the relationship of the "max_depth" parameter to the R2 score with 10 shuffle split cross-validation. The param_range argument specifies the values of max_depth, here from 1 to 10 inclusive.

We can see in the resulting visualization that a depth limit of less than 5 levels severely underfits the model on this data set because the training score and testing score climb together in this parameter range, and because of the high variability of cross validation on the test scores. After a depth of 7, the training and test scores diverge, this is because deeper trees are beginning to overfit the training data, providing no generalizability to the model. However, because the cross validation score does not necessarily decrease, the model is not suffering from high error due to variance.

In the next visualizer, we will see an example that more dramatically visualizes the bias/variance tradeoff.

After loading data and one-hot encoding it using the Pandas get_dummies function, we create a stratified k-folds cross-validation strategy. The hyperparameter of interest is the gamma of a support vector classifier, the coefficient of the RBF kernel. Gamma controls how much influence a single example has, the larger gamma is, the tighter the support vector is around single points (overfitting the model).

In this visualization we see a definite inflection point around gamma=0.1. At this point the training score climbs rapidly as the SVC memorizes the data, while the cross-validation score begins to decrease as the model cannot generalize to unseen data.

Warning

Note that running this and the next example may take a long time. Even with parallelism using n_jobs=8, it can take several hours to go through all the combinations. Reducing the parameter range and minimizing the amount of cross-validation can speed up the validation curve visualization.

Validation curves can be performance intensive since they are training n_params*n_splits models and scoring them. It is critically important to ensure that the specified hyperparameter range is correct, as we will see in the next example.

fromsklearn.neighborsimportKNeighborsClassifiercv=StratifiedKFold(4)param_range=np.arange(3,20,2)oz=ValidationCurve(KNeighborsClassifier(),param_name="n_neighbors",param_range=param_range,cv=cv,scoring="f1_weighted",n_jobs=4,)# Using the same game dataset as in the SVC exampleoz.fit(X,y)oz.poof()

The k nearest neighbors (kNN) model is commonly used when similarity is important to the interpretation of the model. Choosing k is difficult, the higher k is the more data is included in a classification, creating more complex decision topologies, whereas the lower k is, the simpler the model is and the less it may generalize. Using a validation curve seems like an excellent strategy for choosing k, and often it is. However in the example above, all we can see is a decreasing variability in the cross-validated scores.

This validation curve poses two possibilities: first, that we do not have the correct param_range to find the best k and need to expand our search to larger values. The second is that other hyperparameters (such as uniform or distance based weighting, or even the distance metric) may have more influence on the default model than k by itself does. Although validation curves can give us some intuition about the performance of a model to a single hyperparameter, grid search is required to understand the performance of a model with respect to multiple hyperparameters.

See also

This visualizer is based on the validation curve described in the scikit-learn documentation: Validation Curves. The visualizer wraps the validation_curve function and most of the arguments are passed directly to it.

Visualizes the validation curve for both test and training data for a
range of values for a single hyperparameter of the model. Adjusting the
value of a hyperparameter adjusts the complexity of a model. Less complex
models suffer from increased error due to bias, while more complex models
suffer from increased error due to variance. By inspecting the training
and cross-validated test score error, it is possible to estimate a good
value for a hyperparameter that balances the bias/variance trade-off.

The visualizer evaluates cross-validated training and test scores for the
different hyperparameters supplied. The curve is plotted so that the
x-axis is the value of the hyperparameter and the y-axis is the model
score. This is similar to a grid search with a single hyperparameter.

The cross-validation generator splits the dataset k times, and scores are
averaged over all k runs for the training and test subsets. The curve
plots the mean score, and the filled in area suggests the variability of
cross-validation by plotting one standard deviation above and below the
mean for each split.

Parameters:

model:a scikit-learn estimator

An object that implements fit and predict, can be a
classifier, regressor, or clusterer so long as there is also a valid
associated scoring metric.

Note that the object is cloned for each validation.

param_name:string

Name of the parameter that will be varied.

param_range:array-like, shape (n_values,)

The values of the parameter that will be evaluated.

ax:matplotlib.Axes object, optional

The axes object to plot the figure on.

logx:boolean, optional

If True, plots the x-axis with a logarithmic scale.

groups:array-like, with shape (n_samples,)

Optional group labels for the samples used while splitting the dataset
into train/test sets.

A learning curve shows the relationship of the training score vs the cross validated test score for an estimator with a varying number of training samples. This visualization is typically used two show two things:

How much the estimator benefits from more data (e.g. do we have âenough dataâ or will the estimator get better if used in an online fashion).

If the estimator is more sensitive to error due to variance vs. error due to bias.

Consider the following learning curves (generated with Yellowbrick, but from Plotting Learning Curves in the scikit-learn documentation):

If the training and cross validation scores converge together as more data is added (shown in the left figure), then the model will probably not benefit from more data. If the training score is much greater than the validation score then the model probably requires more training examples in order to generalize more effectively.

The curves are plotted with the mean scores, however variability during cross-validation is shown with the shaded areas that represent a standard deviation above and below the mean for all cross-validations. If the model suffers from error due to bias, then there will likely be more variability around the training score curve. If the model suffers from error due to variance, then there will be more variability around the cross validated score.

Note

Learning curves can be generated for all estimators that have fit() and predict() methods as well as a single scoring metric. This includes classifiers, regressors, and clustering as we will see in the following examples.

In the following example we show how to visualize the learning curve of a classification model. After loading a DataFrame and performing categorical encoding, we create a StratifiedKFold cross-validation strategy to ensure all of our classes in each split are represented with the same proportion. We then fit the visualizer using the f1_weighted scoring metric as opposed to the default metric, accuracy, to get a better sense of the relationship of precision and recall in our classifier.

importnumpyasnpfromsklearn.naive_bayesimportMultinomialNBfromsklearn.model_selectionimportStratifiedKFoldfromyellowbrick.model_selectionimportLearningCurve# Load a classification data setdata=load_data('game')# Specify the features of interest and the targettarget="outcome"features=[colforcolindata.columnsifcol!=target]# Encode the categorical data with one-hot encodingX=pd.get_dummies(data[features])y=data[target]# Create the learning curve visualizercv=StratifiedKFold(12)sizes=np.linspace(0.3,1.0,10)viz=LearningCurve(MultinomialNB(),cv=cv,train_sizes=sizes,scoring='f1_weighted',n_jobs=4)# Fit and poof the visualizerviz.fit(X,y)viz.poof()

This learning curve shows high test variability and a low score up to around 30,000 instances, however after this level the model begins to converge on an F1 score of around 0.6. We can see that the training and test scores have not yet converged, so potentially this model would benefit from more training data. Finally, this model suffers primarily from error due to variance (the CV scores for the test data are more variable than for training data) so it is possible that the model is overfitting.

Building a learning curve for a regression is straight forward and very similar. In the below example, after loading our data and selecting our target, we explore the learning curve score according to the coefficient of determination or R2 score.

fromsklearn.linear_modelimportRidgeCV# Load a regression datasetdata=load_data('energy')# Specify features of interest and the targettargets=["heating load","cooling load"]features=[colforcolindata.columnsifcolnotintargets]# Extract the instances and targetX=data[features]y=data[targets[0]]# Create the learning curve visualizer, fit and poofviz=LearningCurve(RidgeCV(),train_sizes=sizes,scoring='r2')viz.fit(X,y)viz.poof()

This learning curve shows a very high variability and much lower score until about 350 instances. It is clear that this model could benefit from more data because it is converging at a very high score. Potentially, with more data and a larger alpha for regularization, this model would become far less variable in the test data.

Learning curves also work for clustering models and can use metrics that specify the shape or organization of clusters such as silhouette scores or density scores. If the membership is known in advance, then rand scores can be used to compare clustering performance as shown below:

fromsklearn.clusterimportKMeansfromsklearn.datasetsimportmake_blobs# Create a dataset of blobsX,y=make_blobs(n_samples=1000,centers=5)viz=LearningCurve(KMeans(),train_sizes=sizes,scoring="adjusted_rand_score")viz.fit(X,y)viz.poof()

Unfortunately, with random data these curves are highly variable, but serve to point out some clustering-specific items. First, note the y-axis is very narrow, roughly speaking these curves are converged and actually the clustering algorithm is performing very well. Second, for clustering, convergence for data points is not necessarily a bad thing; in fact we want to ensure as more data is added, the training and cross-validation scores do not diverge.

See also

This visualizer is based on the validation curve described in the scikit-learn documentation: Learning Curves. The visualizer wraps the learning_curve function and most of the arguments are passed directly to it.

Visualizes the learning curve for both test and training data for
different training set sizes. These curves can act as a proxy to
demonstrate the implied learning rate with experience (e.g. how much data
is required to make an adequate model). They also demonstrate if the model
is more sensitive to error due to bias vs. error due to variance and can
be used to quickly check if a model is overfitting.

The visualizer evaluates cross-validated training and test scores for
different training set sizes. These curves are plotted so that the x-axis
is the training set size and the y-axis is the score.

The cross-validation generator splits the whole dataset k times, scores
are averaged over all k runs for the training subset. The curve plots the
mean score for the k splits, and the filled in area suggests the
variability of the cross-validation by plotting one standard deviation
above and below the mean for each split.

Parameters:

model:a scikit-learn estimator

An object that implements fit and predict, can be a
classifier, regressor, or clusterer so long as there is also a valid
associated scoring metric.

Note that the object is cloned for each validation.

ax:matplotlib.Axes object, optional

The axes object to plot the figure on.

groups:array-like, with shape (n_samples,)

Optional group labels for the samples used while splitting the dataset
into train/test sets.

train_sizes:array-like, shape (n_ticks,)

default: np.linspace(0.1,1.0,5)

Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as
a fraction of the maximum size of the training set, otherwise it is
interpreted as absolute sizes of the training sets.

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random. Used when shuffle is True.

kwargs:dict

Keyword arguments that are passed to the base class and may influence
the visualization as defined in other Visualizers.

Notes

This visualizer is essentially a wrapper for the
sklearn.model_selection.learning_curveutility, discussed in the
validation curves
documentation.

See also

The documentation for the
learning_curve
function, which this visualizer wraps.

Generally we determine whether a given model is optimal by looking at itâs F1, precision, recall, and accuracy (for classification), or itâs coefficient of determination (R2) and error (for regression). However, real world data is often distributed somewhat unevenly, meaning that the fitted model is likely to perform better on some sections of the data than on others. Yellowbrickâs CVScores visualizer enables us to visually explore these variations in performance using different cross validation strategies.

Cross-validation starts by shuffling the data (to prevent any unintentional ordering errors) and splitting it into k folds. Then k models are fit on \(\frac{k-1} {k}\) of the data (called the training split) and evaluated on \(\frac {1} {k}\) of the data (called the test split). The results from each evaluation are averaged together for a final score, then the final model is fit on the entire dataset for operationalization.

In Yellowbrick, the CVScores visualizer displays cross-validated scores as a bar chart (one bar for each fold) with the average score across all folds plotted as a horizontal dotted line.

In the following example we show how to visualize cross-validated scores for a classification model. After loading a DataFrame, we create a StratifiedKFold cross-validation strategy to ensure all of our classes in each split are represented with the same proportion. We then fit the CVScores visualizer using the f1_weighted scoring metric as opposed to the default metric, accuracy, to get a better sense of the relationship of precision and recall in our classifier across all of our folds.

importpandasaspdimportmatplotlib.pyplotaspltfromsklearn.naive_bayesimportMultinomialNBfromsklearn.model_selectionimportStratifiedKFoldfromyellowbrick.model_selectionimportCVScores# Load the classification data setdata=load_data("occupancy")# Specify the features of interestfeatures=["temperature","relative humidity","light","C02","humidity"]# Extract the instances and targetX=data[features]y=data.occupancy# Create a new figure and axes_,ax=plt.subplots()# Create a cross-validation strategycv=StratifiedKFold(12)# Create the cv score visualizeroz=CVScores(MultinomialNB(),ax=ax,cv=cv,scoring='f1_weighted')oz.fit(X,y)oz.poof()

Our resulting visualization shows that while our average cross-validation score is quite high, there are some splits for which our fitted MultinomialNB classifier performs significantly less well.

In this next example we show how to visualize cross-validated scores for a regression model. After loading our energy data into a DataFrame, we instantiate a simple KFold cross-validation strategy. We then fit the CVScores visualizer using the r2 scoring metric, to get a sense of the coefficient of determination for our regressor across all of our folds.

fromsklearn.linear_modelimportRidgefromsklearn.model_selectionimportKFold# Load the regression data setdata=load_data("energy")# Specify the features of interest and the targettargets=["heating load","cooling load"]features=[colforcolindata.columnsifcolnotintargets]# Extract the instances and targetX=data[features]y=data[targets[1]]# Create a new figure and axes_,ax=plt.subplots()cv=KFold(12)oz=CVScores(Ridge(),ax=ax,cv=cv,scoring='r2')oz.fit(X,y)oz.poof()

As with our classification CVScores visualization, our regression visualization suggests that our Ridge regressor performs very well (e.g. produces a high coefficient of determination) across nearly every fold, resulting in another fairly high overall R2 score.

Yellowbrick provides the yellowbrick.text module for text-specific visualizers. The TextVisualizer class specifically deals with datasets that are corpora and not simple numeric arrays or DataFrames, providing utilities for analyzing word dispersion and distribution, showing document similarity, or simply wrapping some of the other standard visualizers with text-specific display properties.

As in the previous sections, Yellowbrick has provided a sample dataset to run the following cells. In particular, we are going to use a text corpus wrangled from the Baleen RSS Corpus to present the following examples. If you havenât already downloaded the data, you can do so by running:

$ python -m yellowbrick.download

Note that this will create a directory called data in your current working directory that contains subdirectories with the provided datasets.

Note

If youâve already followed the instructions from downloading example datasets, you donât have to repeat these steps here. Simply check to ensure there is a directory called hobbies in your data directory.

The following code snippet creates a utility that will load the corpus from disk into a scikit-learn Bunch object. This method creates a corpus that is exactly the same as the one found in the âworking with text dataâ example on the scikit-learn website, hopefully making the examples easier to use.

importosfromsklearn.datasets.baseimportBunchdefload_corpus(path):""" Loads and wrangles the passed in text corpus by path. """# Check if the data exists, otherwise download or raiseifnotos.path.exists(path):raiseValueError(("'{}' dataset has not been downloaded, ""use the yellowbrick.download module to fetch datasets").format(path))# Read the directories in the directory as the categories.categories=[catforcatinos.listdir(path)ifos.path.isdir(os.path.join(path,cat))]files=[]# holds the file names relative to the rootdata=[]# holds the text read from the filetarget=[]# holds the string of the category# Load the data from the files in the corpusforcatincategories:fornameinos.listdir(os.path.join(path,cat)):files.append(os.path.join(path,cat,name))target.append(cat)withopen(os.path.join(path,cat,name),'r')asf:data.append(f.read())# Return the data bunch for use similar to the newsgroups examplereturnBunch(categories=categories,files=files,data=data,target=target,)

This is a fairly long bit of code, so letâs walk through it step by step. The data in the corpus directory is stored as follows:

Each of the documents in the corpus is stored in a text file labeled with its hash signature in a directory that specifies its label or category. Therefore the first step after checking to make sure the specified path exists is to list all the directories in the hobbies directoryâthis gives us each of our categories, which we will store later in the bunch.

The second step is to create placeholders for holding filenames, text data, and labels. We can then loop through the list of categories, list the files in each category directory, add those files to the files list, add the category name to the target list, then open and read the file to add it to data.

To load the corpus into memory, we can simply use the following snippet:

A method for visualizing the frequency of tokens within and across corpora is frequency distribution. A frequency distribution tells us the frequency of each vocabulary item in the text. In general, it could count any kind of observable event. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.

Note that the FreqDistVisualizer does not perform any normalization or vectorization, and it expects text that has already be count vectorized.

We first instantiate a FreqDistVisualizer object, and then call fit() on that object with the count vectorized documents and the features (i.e. the words from the corpus), which computes the frequency distribution. The visualizer then plots a bar chart of the top 50 most frequent terms in the corpus, with the terms listed along the x-axis and frequency counts depicted at y-axis values. As with other Yellowbrick visualizers, when the user invokes poof(), the finalized visualization is shown.

It is also interesting to explore the differences in tokens across a corpus. The hobbies corpus that comes with Yellowbrick has already been categorized (try corpus['categories']), so letâs visually compare the differences in the frequency distributions for two of the categories: âcookingâ and âgamingâ.

A frequency distribution tells us the frequency of each vocabulary
item in the text. In general, it could count any kind of observable
event. It is a distribution because it tells us how the total
number of word tokens in the text are distributed across the
vocabulary items.

Parameters:

features:list, default: None

The list of feature names from the vectorizer, ordered by index. E.g.
a lexicon that specifies the unique vocabulary of the corpus. This
can be typically fetched using the get_feature_names() method of
the transformer in Scikit-Learn.

The fit method is the primary drawing input for the frequency
distribution visualization. It requires vectorized lists of
documents and a list of features, which are the actual words
from the original corpus (needed to label the x-axis ticks).

Parameters:

X:ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus
of frequency vectorized documents.

One very popular method for visualizing document similarity is to use t-distributed stochastic neighbor embedding, t-SNE. Scikit-learn implements this decomposition method as the sklearn.manifold.TSNE transformer. By decomposing high-dimensional document vectors into 2 dimensions using probability distributions from both the original dimensionality and the decomposed dimensionality, t-SNE is able to effectively cluster similar documents. By decomposing to 2 or 3 dimensions, the documents can be visualized with a scatter plot.

Unfortunately, TSNE is very expensive, so typically a simpler decomposition method such as SVD or PCA is applied ahead of time. The TSNEVisualizer creates an inner transformer pipeline that applies such a decomposition first (SVD with 50 components by default), then performs the t-SNE embedding. The visualizer then plots the scatter plot, coloring by cluster or by class, or neither if a structural analysis is required.

This means we donât have to use class labels at all. Instead we can use cluster membership from K-Means to label each document. This will allow us to look for clusters of related text by their contents:

Display a projection of a vectorized corpus in two dimensions using TSNE,
a nonlinear dimensionality reduction method that is particularly well
suited to embedding in two or three dimensions for visualization as a
scatter plot. TSNE is widely used in text analysis to show clusters or
groups of documents or utterances and their relative proximities.

TSNE will return a scatter plot of the vectorized corpus, such that each
point represents a document or utterance. The distance between two points
in the visual space is embedded using the probability distribution of
pairwise similarities in the higher dimensionality; thus TSNE shows
clusters of similar documents and the relationships between groups of
documents as a scatter plot.

TSNE can be used with either clustering or classification; by specifying
the classes argument, points will be colored based on their similar
traits. For example, by passing cluster.labels_ as y in fit(), all
points in the same cluster will be grouped together. This extends the
neighbor embedding with more information about similarity, and can allow
better interpretation of both clusters and classes.

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random. The random state is applied to the preliminary
decomposition as well as tSNE.

alpha:float, default: 0.7

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

Called from the fit method, this method draws the TSNE scatter plot,
from a set of decomposed points in 2 dimensions. This method also
accepts a third dimension, target, which is used to specify the colors
of each of the points. If the target is not specified, then the points
are plotted as a single cloud to show similar documents.

The fit method is the primary drawing input for the TSNE projection
since the visualization requires both X and an optional y value. The
fit method expects an array of numeric vectors, so text documents must
be vectorized before passing them to this method.

Parameters:

X:ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus of
vectorized documents to visualize with tsne.

y:ndarray or Series of length n

An optional array or series of target or class values for
instances. If this is specified, then the points will be colored
according to their class. Often cluster labels are passed in to
color the documents in cluster space, so this method is used both
for classification and clustering methods.

Creates an internal transformer pipeline to project the data set into
2D space using TSNE, applying an pre-decomposition technique ahead of
embedding if necessary. This method will reset the transformer on the
class, and can be used to explore different decompositions.

Parameters:

decompose:string or None, default: 'svd'

A preliminary decomposition is often used prior to TSNE to make
the projection faster. Specify "svd" for sparse data or "pca"
for dense data. If decompose is None, the original data set will
be used.

decompose_by:int, default: 50

Specify the number of components for preliminary decomposition, by
default this is 50; the more components, the slower TSNE will be.

UMAP is a nonlinear
dimensionality reduction method that is well suited to embedding in two
or three dimensions for visualization as a scatter plot. UMAP is a
relatively new technique but very effective for visualizing clusters or
groups of data points and their relative proximities. It does a good job
of learning the local structure within your data but also attempts to
preserve the relationships between your groups as can be seen in its
exploration of
MNIST.
It is fast, scalable, and can be applied directly to sparse matrices,
eliminating the need to run a truncatedSVD as a pre-processing step.
Additionally, it supports a wide variety of distance measures allowing
for easy exploration of your data. For a more detailed explanation of the algorithm
the paper can be found here.

In this example we represent documents via a term frequency inverse
document
frequency (TF-IDF)
vector. Then use UMAP to find a low dimensional representation of these
documents. The yellowbrick visualizer then plots the scatter plot,
coloring by cluster or by class, or neither if a structural analysis is
required.

Now that the corpus is vectorized we can visualize it, showing the
distribution of classes.

umap=UMAPVisualizer()umap.fit(docs,labels)umap.poof()

Alternatively, if we believed that cosine distance was a more
appropriate metric on our feature space we could specify that via a
metric paramater passed through to the underlying UMAP function by
the UMAPVisualizer.

umap=UMAPVisualizer(metric='cosine')umap.fit(docs,labels)umap.poof()

If we omit the target during fit, we can visualize the whole dataset to
see if any meaningful patterns are observed.

This means we donât have to use class labels at all. Instead we can use
cluster membership from K-Means to label each document. This will allow
us to look for clusters of related text by their contents:

On one hand, these clusters arenât particularly well concentrated by the
two dimensional embedding of UMAP, on the other hand, the true labels
for this data are. That is a good indication that your data does indeed
live on a manifold in your TF-IDF space and that structure is being
ignored by the kmeans algorithms. Clustering can be quite tricky in high
dimensional spaces and it is often a good idea to reduce your dimension
before running clustering algorithms on your data.

UMAP, it should be noted, is a manifold learning technique and as such
does not seek to preserve the distances between your data points in high
space but instead to learn the distances along an underlying manifold on
which your data points lie. As such one shouldnât be too surprised when
it disagrees with a non-manifold based clustering technique. A detailed
explanation of this phenomenon can be found in this UMAP
documentation.

Display a projection of a vectorized corpus in two dimensions using UMAP,
a nonlinear dimensionality reduction method that is particularly well
suited to embedding in two or three dimensions for visualization as a
scatter plot. UMAP is a relatively new technique but is often used to visualize clusters or
groups of data points and their relative proximities. It typically is fast, scalable,
and can be applied directly to sparse matrices eliminating the need to run a truncatedSVD
as a pre-processing step.

UMAP will return a scatter plot of the vectorized corpus, such that each
point represents a document or utterance. By default, the distance between two points
in the visual space is embedded using the cosine distance between the high dimensional feature
vectors. Thus, UMAP shows the clusters of similar documents and the relationships between
groups of documents as a scatter plot.

UMAP can be used with either clustering or classification; by specifying
the classes argument, points will be colored based on their similar
traits. For example, by passing cluster.labels_ as y in fit(), all
points in the same cluster will be grouped together. This extends the
neighbor embedding with more information about similarity, and can allow
better interpretation of both clusters and classes.

The current default for UMAP is Euclidean distance. Hellinger distance would be a more appropriate distance
function to use with CountVectorize data. That will be released in the next version of UMAP.
In the meantime cosine distance is likely a better text default that Euclidean and can be set using
the keyword argument metric=âcosineâ.

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random. The random state is applied to the preliminary
decomposition as well as UMAP.

alpha:float, default: 0.7

Specify a transparency where 1 is completely opaque and 0 is completely
transparent. This property makes densely clustered points more visible.

Called from the fit method, this method draws the UMAP scatter plot,
from a set of decomposed points in 2 dimensions. This method also
accepts a third dimension, target, which is used to specify the colors
of each of the points. If the target is not specified, then the points
are plotted as a single cloud to show similar documents.

The fit method is the primary drawing input for the UMAP projection
since the visualization requires both X and an optional y value. The
fit method expects an array of numeric vectors, so text documents must
be vectorized before passing them to this method.

Parameters:

X:ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus of
vectorized documents to visualize with UMAP.

y:ndarray or Series of length n

An optional array or series of target or class values for
instances. If this is specified, then the points will be colored
according to their class. Often cluster labels are passed in to
color the documents in cluster space, so this method is used both
for classification and clustering methods.

A wordâs importance can be weighed by its dispersion in a corpus. Lexical dispersion is a measure of a wordâs homogeneity across the parts of a corpus. This plot notes the occurrences of a word and how many words from the beginning of the corpus it appears.

fromyellowbrick.textimportDispersionPlotfromyellowbrick.datasetsimportload_hobbies# Load the text datacorpus=load_hobbies()# Create a list of words from the corpus texttext=[doc.split()fordocincorpus.data]# Choose words whose occurence in the text will be plottedtarget_words=['Game','player','score','oil','Man']# Create the visualizer and draw the plotvisualizer=DispersionPlot(target_words)visualizer.fit(text)visualizer.poof()

DispersionPlotVisualizer allows for visualization of the lexical dispersion
of words in a corpus. Lexical dispersion is a measure of a wordâs
homeogeneity across the parts of a corpus. This plot notes the occurences
of a word and how many words from the beginning it appears.

Parameters:

target_words:list

A list of target words whose dispersion across a corpus passed at fit
will be visualized.

ax:matplotlib axes, default: None

The axes to plot the figure on.

labels:list of strings

The names of the classes in the target, used to create a legend.
Labels must match names of classes in sorted order.

colors:list or tuple of colors

Specify the colors for each individual class

colormap:string or matplotlib cmap

Qualitative colormap for discrete target

ignore_case:boolean, default: False

Specify whether input will be case-sensitive.

annotate_docs:boolean, default: False

Specify whether document boundaries will be displayed. Vertical lines
are positioned at the end of each document.

The yellowbrick.contrib package contains a variety of extra tools and experimental visualizers that are outside of core support or are still in development. Here is a listing of the contrib modules currently available:

DecisionBoundariesVisualizer is a bivariate data visualization algorithm
that plots the decision boundaries of each class.

Parameters:

model:the Scikit-Learn estimator

Should be an instance of a classifier, else the __init__ will
return an error.

x:string, default: None

The feature name that corresponds to a column name or index postion
in the matrix that will be plotted against the x-axis

y:string, default: None

The feature name that corresponds to a column name or index postion
in the matrix that will be plotted against the y-axis

classes:a list of class names for the legend, default: None

If classes is None and a y value is passed to fit then the classes
are selected from the target vector.

features:list of strings, default: None

The names of the features or columns

show_scatter:boolean, default: True

If boolean is True, then a scatter plot with points will be drawn
on top of the decision boundary graph

step_size:float percentage, default: 0.0025

Determines the step size for creating the numpy meshgrid that will
later become the foundation of the decision boundary graph. The
default value of 0.0025 means that the step size for constructing
the meshgrid will be 0.25%% of differenes of the max and min of x
and y for each feature.

Called from the fit method, this method creates a decision boundary
plot, and if self.scatter is True, it will scatter plot that draws
each instance as a class or target colored point, whose location
is determined by the feature data set.

Sometimes for feature analysis you simply need a scatter plot to determine the distribution of data. Machine learning operates on high dimensional data, so the number of dimensions has to be filtered. As a result these visualizations are typically used as the base for larger visualizers; however you can also use them to quickly plot data during ML analysis.

A scatter visualizer simply plots two features against each other and colors the points according to the target. This can be useful in assessing the relationship of pairs of features to an individual target.

fromyellowbrick.contribimportScatterVisualizerfromyellowbrick.datasetsimportload_occupancy# Load the classification datasetX,y=load_occupancy()# Specify the target classesclasses=["unoccupied","occupied"]# Instantiate the visualizervisualizer=ScatterVisualizer(x="light",y="CO2",classes=classes)visualizer.fit(X,y)# Fit the data to the visualizervisualizer.transform(X)# Transform the datavisualizer.poof()# Draw/show/poof the data

ScatterVisualizer is a bivariate feature data visualization algorithm that
plots using the Cartesian coordinates of each point.

Parameters:

ax:a matplotlib plot, default: None

The axis to plot the figure on.

x :string, default: None

The feature name that corresponds to a column name or index postion
in the matrix that will be plotted against the x-axis

y :string, default: None

The feature name that corresponds to a column name or index postion
in the matrix that will be plotted against the y-axis

features :a list of two feature names to use, default: None

List of two features that correspond to the columns in the array.
The order of the two features correspond to X and Y axes on the
graph. More than two feature names or columns will raise an error.
If a DataFrame is passed to fit and features is None, feature names
are selected that are the columns of the DataFrame.

classes :a list of class names for the legend, default: None

If classes is None and a y value is passed to fit then the classes
are selected from the target vector.

Yellowbrick believes that visual diagnostics are more effective if visualizations are appealing. As a result, we have borrowed familiar styles from Seaborn and use the new matplotlib 2.0 styles. We hope that these out-of-the-box styles will make your visualizations publication ready, though you can also still customize your own look and feel by directly modifying the visualizations with matplotlib.

For most visualizers, Yellowbrick prioritizes color in its visualizations. There are two types of color sets that can be provided to a visualizer: a palette and a sequence. Palettes are discrete color values usually of a fixed length and are typically used for classification or clustering by showing each class, cluster, or topic. Sequences are continuous color values that do not have a fixed length but rather a range and are typically used for regression or clustering, showing all possible values in the target or distances between items in clusters.

In order to make the distinction easy, most matplotlib colors (both palettes and sequences) can be referred to by name. A complete listing can be imported as follows:

Refer to the API listing of each visualizer for specifications on how the color argument is handled. In the next two sections, we will show every possible color palette and sequence currently available in Yellowbrick.

Color palettes are discrete color lists that have a fixed length. The most common palettes are ordered as âblueâ, âgreenâ, âredâ, âmaroonâ, âyellowâ, âcyanâ, and an optional âkeyâ. This allows you to specify these named colors by the first character, e.g. âbgrmyckâ for matplotlib visualizations.

To change the global color palette, use the set_palette function as follows:

fromyellowbrick.styleimportset_paletteset_palette('flatui')

Color palettes are most often used for classifiers to show the relationship between discrete class labels. They can also be used for clustering algorithms to show membership in discrete clusters.

A complete listing of the Yellowbrick color palettes can be visualized as follows:

Color sequences are continuous representations of color and are usually defined as a fixed number of steps between a minimum and maximal value. Sequences must be created with a total number of bins (or length) before plotting to ensure that the values are assigned correctly. In the listing below, each sequence is shown with varying lengths to describe the range of colors in detail.

Color sequences are most often used in regressions to show the distribution in the range of target values. They can also be used in clustering and distribution analysis to show distance or histogram data.

Below is a complete listing of all the sequence names available in Yellowbrick:

Generates a list of colors based on common color arguments, for example
the name of a colormap or palette or another iterable of colors. The list
is then truncated (or multiplied) to the specific number of requested
colors.

Parameters:

n_colors:int, default: None

Specify the length of the list of returned colors, which will either
truncate or multiple the colors available. If None the length of the
colors will not be modified.

colormap:str, default: None

The name of the matplotlib color map with which to generate colors.

colors:iterable, default: None

A collection of colors to use specifically with the plot.

Returns:

colors:list

A list of colors that can be used in matplotlib plots.

Notes

This function was originally based on a similar function in the pandas
plotting library that has been removed in the new version of the library.

Implements the variety of colors that yellowbrick allows access to by name.
This code was originally based on Seabornâs rcmody.py but has since been
cleaned up to be Yellowbrick-specific and to dereference tools we donât use.
Note that these functions alter the matplotlib rc dictionary on the fly.

Calling this function with palette=None will return the current
matplotlib color cycle.

This function can also be used in a with statement to temporarily
set the color cycle for a plot or set of plots.

Parameters:

palette:None or str or sequence

Name of a palette or None to return the current palette. If a
sequence the input colors are used but possibly cycled.

Available palette names from yellowbrick.colors.palettes are:

accent

dark

paired

pastel

bold

muted

colorblind

sns_colorblind

sns_deep

sns_muted

sns_pastel

sns_bright

sns_dark

flatui

neural_paint

n_colors:None or int

Number of colors in the palette. If None, the default will depend
on how palette is specified. Named palettes default to 6 colors
which allow the use of the names âbgrmyckâ, though others do have more
or less colors; therefore reducing the size of the list can only be
done by specifying this parameter. Asking for more colors than exist
in the palette will cause it to cycle.

Returns:

list(tuple)

Returns a ColorPalette object, which behaves like a list, but can be
used as a context manager and possesses functions to convert colors.

Modifies the matplotlib rcParams in order to make yellowbrick more appealing.
This has been modified from Seabornâs rcmod.py: github.com/mwaskom/seaborn in
order to alter the matplotlib rc dictionary on the fly.

NOTE: matplotlib 2.0 styles mean we can simply convert this to a stylesheet!

Many examples utilize data from the UCI Machine Learning repository. In order to run the accompanying code, make sure to follow the instructions in Example Datasets to download and load the required data.

A guide to finding the visualizer youâre looking for: generally speaking, visualizers can be data visualizers which visualize instances relative to the model space; score visualizers which visualize model performance; model selection visualizers which compare multiple model forms against each other; and application specific-visualizers. This can be a bit confusing, so weâve grouped visualizers according to the type of analysis they are well suited for.

Feature analysis visualizers are where youâll find the primary implementation of data visualizers. Regression, classification, and clustering analysis visualizers can be found in their respective libraries. Finally, visualizers for text analysis are also available in Yellowbrick! Other utilities, such as styles, best fit lines, and Anscombeâs visualization, can also be found in the links above.

We are looking for people to help us Alpha test the Yellowbrick project!
Helping is simple: simply create a notebook that applies the concepts in
this Getting Started guide to a small-to-medium size dataset of your
choice. Run through the examples with the dataset, and try to change
options and customize as much as possible. After youâve exercised the
code with your examples, respond to our alpha testing
survey!

Please open the questionnaire, in order to familiarize yourself with the
type of feedback we are looking to receive. We are very interested in
identifying any bugs in Yellowbrick. Please include any cells in your
Jupyter notebook that produce errors so that we may reproduce the
problem.

Select a multivariate dataset of your own. The greater the variety of datasets that we can run through Yellowbrick, the more likely weâll discover edge cases and exceptions! Please note that your dataset must be well-suited to modeling with scikit-learn. In particular, we recommend choosing a dataset whose target is suited to one of the following supervised learning tasks:

There are datasets that are well suited to both types of analysis; either way, you can use the testing methodology from this notebook for either type of task (or both). In order to find a dataset, we recommend you try the following places:

Youâre more than welcome to choose a dataset of your own, but we do ask that you make at least the notebook containing your testing results publicly available for us to review. If the data is also public (or youâre willing to share it with the primary contributors) that will help us figure out bugs and required features much more easily!

You can follow along with our examples directory (check out examples.ipynb) or even create your own custom visualizers! The goal is that you create an end-to-end model from data loading to estimator(s) with visualizers along the way.

IMPORTANT: please make sure you record all errors that you get and any tracebacks you receive for step three!

This form is allowing us to aggregate multiple submissions and bugs so that we can coordinate the creation and management of issues. If you are the first to report a bug or feature request, we will make sure youâre notified (weâll tag you using your Github username) about the created issue!

Thank you for helping us make Yellowbrick better! Weâd love to see pull requests for features you think should be added to the library. Weâll also be doing a user study that we would love for you to participate in. Stay tuned for more great things from Yellowbrick!

Yellowbrick is an open source project that is supported by a community who will gratefully and humbly accept any contributions you might make to the project. Large or small, any contribution makes a big difference; and if youâve never contributed to an open source project before, we hope you will start with Yellowbrick!

Principally, Yellowbrick development is about the addition and creation of visualizers â objects that learn from data and create a visual representation of the data or model. Visualizers integrate with scikit-learn estimators, transformers, and pipelines for specific purposes and as a result, can be simple to build and deploy. The most common contribution is a new visualizer for a specific model or model family. Weâll discuss in detail how to build visualizers later.

As you can see, there are lots of ways to get involved and we would be very happy for you to join us! The only thing we ask is that you abide by the principles of openness, respect, and consideration of others as described in our Code of Conduct.

Note

If youâre unsure where to start, perhaps the best place is to drop the maintainers a note via our mailing list: http://bit.ly/yb-listserv.

Review the code with core contributors who will guide you to a high quality submission.

Merge your contribution into the Yellowbrick codebase.

Note

Please create a pull request as soon as possible, even before youâve started coding. This will allow the core contributors to give you advice about where to add your code or utilities and discuss other style choices and implementation details as you go. Donât wait!

We believe that contribution is collaboration and therefore emphasize communication throughout the open source process. We rely heavily on GitHubâs social coding tools to allow us to do this.

The first step is to fork the repository into your own account. This will create a copy of the codebase that you can edit and write to. Do so by clicking the âforkâ button in the upper right corner of the Yellowbrick GitHub page.

Once forked, use the following steps to get your development environment set up on your computer:

Clone the repository.

After clicking the fork button, you should be redirected to the GitHub page of the repository in your user account. You can then clone a copy of the code to your local machine.:

Yellowbrick developers typically use virtualenv (and virtualenvwrapper), pyenv or conda envs in order to manage their Python version and dependencies. Using the virtual environment tool of your choice, create one for Yellowbrick. Hereâs how with virtualenv:

$ virtualenv venv

Install dependencies.

Yellowbrickâs dependencies are in the requirements.txt document at the root of the repository. Open this file and uncomment any dependencies marked as for development only. Then install the package in editable mode:

$ pip install -e .

This will add Yellowbrick to your PYTHONPATH so that you donât need to reinstall it each time you make a change during development.

Note that there may be other dependencies required for development and testing; you can simply install them with pip. For example to install
the additional dependencies for building the documentation or to run the
test suite, use the requirements.txt files in those directories:

The Yellowbrick repository has a develop branch that is the primary working branch for contributions. It is probably already the branch youâre on, but you can make sure and switch to it as follows:

$ git fetch
$ git checkout develop

At this point youâre ready to get started writing code. If youâre going to take on a specific task, weâd strongly encourage you to check out the issue on Waffle and create a pull requestbefore you start coding to better foster communication with other contributors. More on this in the next section.

The Yellowbrick repository is set up in a typical production/release/development cycle as described in âA Successful Git Branching Model.â The primary working branch is the develop branch. This should be the branch that you are working on and from, since this has all the latest code. The master branch contains the latest stable version and release, which is pushed to PyPI. No one but core contributors will generally push to master.

You should work directly in your fork. In order to reduce the number of merges (and merge conflicts) we kindly request that you utilize a feature branch off of develop to work in:

$ git checkout -b feature-myfeature develop

We also recommend setting up an upstream remote so that you can easily pull the latest development changes from the main Yellowbrick repository (see configuring a remote for a fork). You can do that as follows:

A pull request (PR) is a GitHub tool for initiating an exchange of code and creating a communication channel for Yellowbrick maintainers to discuss your contribution. In essenence, you are requesting that the maintainers merge code from your forked repository into the develop branch of the primary Yellowbrick repository. Once completed, your code will be part of Yellowbrick!

When starting a Yellowbrick contribution, open the pull request as soon as possible. We use your PR issue page to discuss your intentions and to give guidance and direction. Every time you push a commit into your forked repository, the commit is automatically included with your pull request, therefore we can review as you code. The earlier you open a PR, the more easily we can incorporate your updates, weâd hate for you to do a ton of work only to discover someone else already did it or that you went in the wrong direction and need to refactor.

When you open a pull request, ensure it is from your forked repository to the develop branch of github.com/districtdatalabs/yellowbrick; we will not merge a PR into the master branch. Title your Pull Request so that it is easy to understand what youâre working on at a glance. Also be sure to include a reference to the issue that youâre working on so that correct references are set up.

Note

All pull requests should be into the yellowbrick/develop branch from your forked repository.

After you open a PR, you should get a message from one of the maintainers. Use that time to discuss your idea and where best to implement your work. Feel free to go back and forth as you are developing with questions in the comment thread of the PR. Once you are ready, please ensure that you explicitly ping the maintainer to do a code review. Before code review, your PR should contain the following:

Your code contribution

Tests for your contribution

Documentation for your contribution

A PR comment describing the changes you made and how to use them

A PR comment that includes an image/example of your visualizer

At this point your code will be formally reviewed by one of the contributors. We use GitHubâs code review tool, starting a new code review and adding comments to specific lines of code as well as general global comments. Please respond to the comments promptly, and donât be afraid to ask for help implementing any requested changes! You may have to go back and forth a couple of times to complete the code review.

When the following is true:

Code is reviewed by at least one maintainer

Continuous Integration tests have passed

Code coverage and quality have not decreased

Code is up to date with the yellowbrick develop branch

Then we will âSquash and Mergeâ your contribution, combining all of your commits into a single commit and merging it into the develop branch of Yellowbrick. Congratulations! Once your contribution has been merged into master, you will be officially listed as a contributor.

In this section, weâll discuss the basics of developing visualizers. This of course is a big topic, but hopefully these simple tips and tricks will help make sense. First thing though, check out this presentation that we put together on yellowbrick development, it discusses the expected user workflow, our integration with scikit-learn, our plans and roadmap, etc:

Feature Visualizers are high dimensional data visualizations that are essentially transformers.

Score Visualizers wrap a scikit-learn regressor, classifier, or clusterer and visualize the behavior or performance of the model on test data.

These two basic types of visualizers map well to the two basic objects in scikit-learn:

Transformers take input data and return a new data set.

Estimators are fit to training data and can make predictions.

The scikit-learn API is object oriented, and estimators and transformers are initialized with parameters by instantiating their class. Hyperparameters can also be set using the set_attrs() method and retrieved with the corresponding get_attrs() method. All scikit-learn estimators have a fit(X,y=None) method that accepts a two dimensional data array, X, and optionally a vector y of target values. The fit() method trains the estimator, making it ready to transform data or make predictions. Transformers have an associated transform(X) method that returns a new dataset, Xprime and models have a predict(X) method that returns a vector of predictions, yhat. Models also have a score(X,y) method that evaluate the performance of the model.

Visualizers interact with scikit-learn objects by intersecting with them at the methods defined above. Specifically, visualizers perform actions related to fit(), transform(), predict(), and score() then call a draw() method which initializes the underlying figure associated with the visualizer. The user calls the visualizerâs poof() method, which in turn calls a finalize() method on the visualizer to draw legends, titles, etc. and then poof() renders the figure. The Visualizer API is therefore:

draw(): add visual elements to the underlying axes object

finalize(): prepare the figure for rendering, adding final touches such as legends, titles, axis labels, etc.

poof(): render the figure for the user (or saves it to disk).

Creating a visualizer means defining a class that extends Visualizer or one of its subclasses, then implementing several of the methods described above. A barebones implementation is as follows:

This simple visualizer simply draws a line graph for some input dataset X, intersecting with the scikit-learn API at the fit() method. A user would use this visualizer in the typical style:

visualizer=MyVisualizer()visualizer.fit(X)visualizer.poof()

Score visualizers work on the same principle but accept an additional required model argument. Score visualizers wrap the model (which can be either instantiated or uninstantiated) and then pass through all attributes and methods through to the underlying model, drawing where necessary.

Yellowbrick gives easy access to several datasets that are used for the examples in the documentation and testing. These datasets are hosted in our CDN and must be downloaded for use. Typically, when a user calls one of the data loader functions, e.g. load_bikeshare() the data is automatically downloaded if itâs not already on the userâs computer. However, for development and testing, or if you know you will be working without internet access, it might be easier to simply download all the data at once.

The data downloader script can be run as follows:

$ python -m yellowbrick.download

This will download the data to the fixtures directory inside of the Yellowbrick site packages. You can specify the location of the download either as an argument to the downloader script (use --help for more details) or by setting the $YELLOWBRICK_DATA environment variable. This is the preferred mechanism because this will also influence how data is loaded in Yellowbrick.

Note that developers who have downloaded data from Yellowbrick versions earlier than v1.0 may experience some problems with the older data format. If this occurs, you can clear out your data cache as follows:

$ python -m yellowbrick.download --cleanup

This will remove old datasets and download the new ones. You can also use the --no-download flag to simply clear the cache without re-downloading data. Users who are having difficulty with datasets can also use this or they can uninstall and reinstall Yellowbrick using pip.

The test package mirrors the yellowbrick package in structure and also contains several helper methods and base functionality. To add a test to your visualizer, find the corresponding file to add the test case, or create a new test file in the same place you added your code.

Visual tests are notoriously difficult to create â how do you test a visualization or figure? Moreover, testing scikit-learn models with real data can consume a lot of memory. Therefore the primary test you should create is simply to test your visualizer from end to end and make sure that no exceptions occur. To assist with this, we have two primary helpers, VisualTestCase and DatasetMixin. Create your unittest as follows:

importpytestfromtests.baseimportVisualTestCasefromtests.datasetimportDatasetMixinclassMyVisualizerTests(VisualTestCase,DatasetMixin):deftest_my_visualizer(self):""" Test MyVisualizer on a real dataset """# Load the data from the fixturedataset=self.load_data('occupancy')# Get the dataX=dataset[["temperature","relative_humidity","light","C02","humidity"]]y=dataset['occupancy'].astype(int)try:visualizer=MyVisualizer()visualizer.fit(X)visualizer.poof()exceptExceptionase:pytest.fail("my visualizer didn't work")

Writing an image based comparison test is only a little more difficult than the simple testcase presented above. We have adapted matplotlibâs image comparison test utility into an easy to use assert method : self.assert_images_similar(visualizer)

The main consideration is that you must specify the âbaselineâ, or expected, image in the tests/baseline_images/ folder structure.

For example, create your unittest located in tests/test_regressor/test_myvisualizer.py as follows:

The first time this test is run, there will be no baseline image to compare against, so the test will fail. Copy the output images (in this case tests/actual_images/test_regressor/test_myvisualizer/test_my_visualizer_output.png) to the correct subdirectory of baseline_images tree in the source directory (in this case tests/baseline_images/test_regressor/test_myvisualizer/test_my_visualizer_output.png). Put this new file under source code revision control (with git add). When rerunning the tests, they should now pass.

We also have a helper script, tests/images.py to clean up and manage baseline images automatically. It is run using the python-m command to execute a module as main, and it takes as an argument the path to your test file. To copy the figures as above:

$ python -m tests.images tests/test_regressor/test_myvisualizer.py

This will move all related test images from actual_images to baseline_images on your behalf (note youâll have had to run the tests at least once to generate the images). You can also clean up images from both actual and baseline as follows:

$ python -m tests.images -C tests/test_regressor/test_myvisualizer.py

This is useful particularly if youâre stuck trying to get an image comparison to work. For more information on the images helper script, use python-mtests.images--help.

The initial documentation for your visualizer will be a well structured docstring. Yellowbrick uses Sphinx to build documentation, therefore docstrings should be written in reStructuredText in numpydoc format (similar to scikit-learn). The primary location of your docstring should be right under the class definition, here is an example:

classMyVisualizer(Visualizer):""" This initial section should describe the visualizer and what it's about, including how to use it. Take as many paragraphs as needed to get as much detail as possible. In the next section describe the parameters to __init__. Parameters ---------- model : a scikit-learn regressor Should be an instance of a regressor, and specifically one whose name ends with "CV" otherwise a will raise a YellowbrickTypeError exception on instantiation. To use non-CV regressors see: ``ManualAlphaSelection``. ax : matplotlib Axes, default: None The axes to plot the figure on. If None is passed in the current axes will be used (or generated if required). kwargs : dict Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers. Examples -------- >>> model = MyVisualizer() >>> model.fit(X) >>> model.poof() Notes ----- In the notes section specify any gotchas or other info. """

When your visualizer is added to the API section of the documentation, this docstring will be rendered in HTML to show the various options and functionality of your visualizer!

To add the visualizer to the documentation it needs to be added to the docs/api folder in the correct subdirectory. For example if your visualizer is a model score visualizer related to regression it would go in the docs/api/regressor subdirectory. If you have a question where your documentation should be located, please ask the maintainers via your pull request, weâd be happy to help!

There are two primary files that need to be created:

mymodule.rst: the reStructuredText document

mymodule.py: a python file that generates images for the rst document

There are quite a few examples in the documentation on which you can base your files of similar types. The primary format for the API section is as follows:

..-*-mode:rst-*-MyVisualizer=============Introtomyvisualizer..code::python# Example to run MyVisualizervisualizer=MyVisualizer(LinearRegression())visualizer.fit(X,y)g=visualizer.poof()..image::images/my_visualizer.pngDiscussionaboutmyvisualizerAPIReference-------------..automodule::yellowbrick.regressor.mymodule:members:MyVisualizer:undoc-members::show-inheritance:

This is a pretty good structure for a documentation page; a brief introduction followed by a code example with a visualization included (using the mymodule.py to generate the images into the local directoryâs images subdirectory). The primary section is wrapped up with a discussion about how to interpret the visualizer and use it in practice. Finally the APIReference section will use automodule to include the documentation from your docstring.

At this point there are several places where you can list your visualizer, but to ensure it is included in the documentation it must be listed in the TOC of the local index. Find the index.rst file in your subdirectory and add your rst file (without the .rst extension) to the ..toctree:: directive. This will ensure the documentation is included when it is built.

Speaking of, you can build your documentation by changing into the docs directory and running makehtml, the documentation will be built and rendered in the _build/html directory. You can view it by opening _build/html/index.html then navigating to your documentation in the browser.

There are several other places that you can list your visualizer including:

docs/index.rst for a high level overview of our visualizers

DESCRIPTION.rst for inclusion on PyPI

README.md for inclusion on GitHub

Please ask for the maintainerâs advice about how to include your visualizer in these pages.

In this section we discuss more advanced contributing guidelines such as code conventions,the release life cycle or branch management. This section is intended for maintainers and core contributors of the Yellowbrick project. If you would like to be a maintainer please contact one of the current maintainers of the project.

We use several strategies when reviewing pull requests from contributors to Yellowbrick. If the pull request affects only a single file or a small portion of the code base, it is sometimes sufficient to review the code using GitHubâs lightweight code review feature. However, if the changes impact a number of files or modify the documentation, our convention is to add the contributorâs fork as a remote, pull, and check out their feature branch locally. From inside your fork of Yellowbrick, this can be done as follows:

This will allow you to inspect their changes, run the tests, and build the docs locally. If the contributor has elected to allow reviewers to modify their feature branch, you will also be able to push changes directly to their branch:

As the visualizer API has matured over time, weâve realized that there are a number of routine items that must be in place to consider a visualizer truly complete and ready for prime time. This list is also extremely helpful for reviewing code submissions to ensure that visualizers are consistently implemented, tested, and documented. Though we do not expect these items to be checked off on every PR, the below list includes some guidance about what to look for when reviewing or writing a new Visualizer.

The basic principle of the visualizer API is that scikit-learn methods such as fit(), transform(), score(), etc. perform interactions with scikit-learn or other computations and call the draw() method. Calls to matplotlib should happen only in draw() or finalize().

Create a quick method for the visualizer.

In addition to creating the visualizer class, ensure there is an associated quick method that returns the visualizer and creates the visualization in one line of code!

Subclass the correct visualizer.

Ensure that the visualizer is correctly subclassed in the class hierarchy. If youâre not sure what to subclass, please ping a maintainer, theyâd be glad to help!

Ensure numpy array comparisons are not ambiguous.

Often there is code such as ify: where y is an array. However this is ambiguous when used with numpy arrays and other data containers. Change this code to yisnotNone or len(y)>0 or use np.all or np.any to test if the contents of the array are truthy/falsy.

Add random_state argument to visualizer.

If the visualizer uses/wraps a utility that also has random_state, then the visualizer itself needs to also have this argument which defaults to None and is passed to all internal stochastic behaviors. This ensures that image comparison testing will work and that users can get repeated behavior from visualizers.

Use np.unique instead of set.

If you need the unique values from a list or array, we prefer to use numpy methods wherever possible. We performed some limited benchmarking and believe that np.unique is a bit faster and more efficient.

Use sklearn underscore suffix for learned parameters.

Any parameters that are learned during fit() should only be added to the visualizer when fit() is called (this is also how we determine if a visualizer is fitted or not) and should be identified with an underscore suffix. For example, in classification visualizers, the classes can be either passed in by the user or determined when they are passed in via fit, therefore it should be self.classes_. This is also true for other learned parameters, e.g. self.score_, even though this is not created during fit().

Correctly set the title in finalize.

Use the self.set_title() method to set a default title; this allows the user to specify a custom title in the initialization arguments.

Ensure there is at least one image comparison test per visualizer. This is the primary regression testing of Yellowbrick and these tests catch a lot when changes occur in our dependencies or environment.

Use pytest assertions rather than unittest.

We prefer assert2+2==4 rather than self.assertEquals(2+2,4). Though there is a lot of legacy unittest assertions, weâve moved to pytest and one day believe we will have removed the unittest dependency.

Use test fixtures and sklearn dataset generators.

Data is the key to testing with Yellowbrick; often the test package will have fixtures in conftest.py that can be directly used (e.g. binary vs. multiclass in the test_classifier package). If one isnât available feel free to use randomly generated datasets from the sklearn.datasets module e.g. make_classification, make_regression, or make_blobs. For integration testing, please feel free to use one of the Yellowbrick datasets.

Fix all random_state arguments.

Be on the lookout for any method (particularly sklearn methods) that have a random_state argument and be sure to fix them so that tests always pass!

Test a variety of inputs.

Machine learning can be done on a variety of inputs for X and y, ensure there is a test with numpy arrays, pandas DataFrame and Series objects, and with Python lists.

Test that fit() returns self.

When doing end-to-end testing, we like to assertoz.fit()isoz to ensure the API is maintained.

Test that score() between zero and one.

With visualizers that have a score() method, we like to assert0.0<=oz.score()>=1.0 to ensure the API is maintained.

The visualizer docstring should be present under the class and contain a narrative about the visualizer and its arguments with the numpydoc style.

API Documentation.

All visualizers should have their own API page under docs/api/[yb-module]. This documentation should include an automodule statement. Generally speaking there is also an image generation script of the same name in this folder so that the documentation images can be generated on demand.

Listing the visualizer.

The visualizer should be listed in a number of places including: docs/api/[yb-module]/index.rst, docs/api/index.rst, docs/index.rst, README.md, and DESCRIPTION.rst.

Include a gallery image.

Please also add the visualizer image to the gallery!

Update added to the changelog.

To reduce the time it takes to put together the changelog, weâd like to update it when we add new features and visualizers rather than right before the release.

Our convention is that the person who performs the code review should merge the pull request (since reviewing is hard work and deserves due credit!). Only core contributors have write access to the repository and can merge pull requests. Some preferences for commit messages when merging in pull requests:

Make sure to use the âSquash and Mergeâ option in order to create a Git history that is understandable.

Keep the title of the commit short and descriptive; be sure it includes the PR #.

Craft a commit message body that is 1-3 sentences, depending on the complexity of the commit; it should explicitly reference any issues being closed or opened using GitHubâs commit message keywords.

When ready to create a new release we branch off of develop as follows:

$ git checkout -b release-x.x

This creates a release branch for version x.x. At this point do the version bump by modifying version.py and the test version in tests/__init__.py. Make sure all tests pass for the release and that the documentation is up to date. Note, to build the docs see the documentation notes. There may be style changes or deployment options that have to be done at this phase in the release branch. At this phase youâll also modify the changelog with the features and changes in the release that have not already been marked.

Once the release is ready for prime-time, merge into master:

$ git checkout master
$ git merge --no-ff --no-edit release-x.x

Tag the release in GitHub:

$ git tag -a vx.x
$ git push origin vx.x

Youâll have to go to the release page to edit the release with similar information as added to the changelog. Once done, push the release to PyPI:

$ make build
$ make deploy

Check that the PyPI page is updated with the correct version and that pipinstall-Uyellowbrick updates the version and works correctly. Also check the documentation on PyHosted, ReadTheDocs, and on our website to make sure that it was correctly updated. Finally merge the release into develop and clean up:

Yellowbrick generates visualizations by wrapping matplotlib, the most prominent Python scientific visualization library. Because of this, Yellowbrick is able to generate publication-ready images for a variety of GUI backends, image formats, and Jupyter notebooks. Yellowbrick strives to provide well-styled visual diagnostic tools and complete information. However, to customize figures or roll your own visualizers, a strong background in using matplotlib is required.

This graphic from the matplotlib faq is gold. Keep it handy to understand the different terminology of a plot.

Most of the terms are straightforward but the main thing to remember is that the Figure is the final image that may contain 1 or more axes. The Axes represent an individual plot. Once you understand what these are and how to access them through the object oriented API, the rest of the process starts to fall into place.

The other benefit of this knowledge is that you have a starting point when you see things on the web. If you take the time to understand this point, the rest of the matplotlib API will start to make sense.

Matplotlib keeps a global reference to the global figure and axes objects which can be modified by the pyplot API. To access this import matplotlib as follows:

importmatplotlib.pyplotaspltaxes=plt.gca()

The plt.gca() function gets the current axes so that you can draw on it directly. You can also directly create a figure and axes as follows:

fig=plt.figure()ax=fig.add_subplot(111)

Yellowbrick will use plt.gca() by default to draw on. You can access the Axes object on a visualizer via its ax property:

fromsklearn.linear_modelimportLinearRegressionfromyellowbrick.regressorimportPredictionError# Fit the visualizermodel=PredictionError(LinearRegression())model.fit(X_train,y_train)model.score(X_test,y_test)# Call finalize to draw the final yellowbrick-specific elementsmodel.finalize()# Get access to the axes object and modify labelsmodel.ax.set_xlabel("measured concrete strength")model.ax.set_ylabel("predicted concrete strength")plt.savefig("peplot.pdf")

You can also pass an external Axes object directly to the visualizer:

model=PredictionError(LinearRegression(),ax=ax)

Therefore you have complete control of the style and customization of a Yellowbrick visualizer.

The first step with any visualization is to plot the data. Often the simplest way to do this is using the standard pandas plotting function (given a DataFrame called top_10):

top_10.plot(kind='barh',y="Sales",x="Name")

The reason I recommend using pandas plotting first is that it is a quick and easy way to prototype your visualization. Since most people are probably already doing some level of data manipulation/analysis in pandas as a first step, go ahead and use the basic plots to get started.

Assuming you are comfortable with the gist of this plot, the next step is to customize it. Some of the customizations (like adding titles and labels) are very simple to use with the pandas plot function. However, you will probably find yourself needing to move outside of that functionality at some point. Thatâs why it is recommended to create your own Axes first and pass it to the plotting function in Pandas:

The resulting plot looks exactly the same as the original but we added an additional call to plt.subplots() and passed the ax to the plotting function. Why should you do this? Remember when I said it is critical to get access to the axes and figures in matplotlib? Thatâs what we have accomplished here. Any future customization will be done via the ax or fig objects.

We have the benefit of a quick plot from pandas but access to all the power from matplotlib now. An example should show what we can do now. Also, by using this naming convention, it is fairly straightforward to adapt othersâ solutions to your unique needs.

Suppose we want to tweak the x limits and change some axis labels? Now that we have the axes in the ax variable, we have a lot of control:

To further demonstrate this approach, we can also adjust the size of this image. By using the plt.subplots() function, we can define the figsize in inches. We can also remove the legend using ax.legend().set_visible(False):

There are plenty of things you probably want to do to clean up this plot. One of the biggest eye sores is the formatting of the Total Revenue numbers. Matplotlib can help us with this through the use of the FuncFormatter . This versatile function can apply a user defined function to a value and return a nicely formatted string to place on the axis.

Here is a currency formatting function to gracefully handle US dollars in the several hundred thousand dollar range:

defcurrency(x,pos):""" The two args are the value and tick position """ifx>=1000000:return'${:1.1f}M'.format(x*1e-6)return'${:1.0f}K'.format(x*1e-3)

Now that we have a formatter function, we need to define it and apply it to the x axis. Here is the full code:

Thatâs much nicer and shows a good example of the flexibility to define your own solution to the problem.

The final customization feature I will go through is the ability to add annotations to the plot. In order to draw a vertical line, you can use ax.axvline() and to add custom text, you can use ax.text().

For this example, weâll draw a line showing an average and include labels showing three new customers. Here is the full code with comments to pull it all together.

# Create the figure and the axesfig,ax=plt.subplots()# Plot the data and get the averagetop_10.plot(kind='barh',y="Sales",x="Name",ax=ax)avg=top_10['Sales'].mean()# Set limits and labelsax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')# Add a line for the averageax.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Annotate the new customersforcustin[3,5,8]:ax.text(115000,cust,"New Customer")# Format the currencyformatter=FuncFormatter(currency)ax.xaxis.set_major_formatter(formatter)# Hide the legendax.legend().set_visible(False)

While this may not be the most exciting plot it does show how much power you have when following this approach.

Up until now, all the changes we have made have been with the individual plot. Fortunately, we also have the ability to add multiple plots on a figure as well as save the entire figure using various options.

If we decided that we wanted to put two plots on the same figure, we should have a basic understanding of how to do it. First, create the figure, then the axes, then plot it all together. We can accomplish this using plt.subplots():

fig,(ax0,ax1)=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(7,4))

In this example, Iâm using nrows and ncols to specify the size because this is very clear to the new user. In sample code you will frequently just see variables like 1,2. I think using the named parameters is a little easier to interpret later on when youâre looking at your code.

I am also using sharey=True so that the y-axis will share the same labels.

This example is also kind of nifty because the various axes get unpacked to ax0 and ax1. Now that we have these axes, you can plot them like the examples above but put one plot on ax0 and the other on ax1.

# Get the figure and the axesfig,(ax0,ax1)=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(7,4))top_10.plot(kind='barh',y="Sales",x="Name",ax=ax0)ax0.set_xlim([-10000,140000])ax0.set(title='Revenue',xlabel='Total Revenue',ylabel='Customers')# Plot the average as a vertical lineavg=top_10['Sales'].mean()ax0.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Repeat for the unit plottop_10.plot(kind='barh',y="Purchases",x="Name",ax=ax1)avg=top_10['Purchases'].mean()ax1.set(title='Units',xlabel='Total Units',ylabel='')ax1.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Title the figurefig.suptitle('2014 Sales Analysis',fontsize=14,fontweight='bold');# Hide the legendsax1.legend().set_visible(False)ax0.legend().set_visible(False)

When writing code in a Jupyter notebook you can take advantage of the %matplotlibinline or %matplotlibnotebook directives to render figures inline. More often, however, you probably want to save your images to disk. Matplotlib supports many different formats for saving files. You can use fig.canvas.get_supported_filetypes() to see what your system supports:

For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods. In fact, Yellowbrick grew out of teaching data science courses at Georgetownâs School of Continuing Studies!

The following slide deck presents an approach to teaching students about the machine learning workflow (the model selection triple), including:

Yellowbrick is an open source, pure Python project that extends the scikit-learn API with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models and assist in diagnosing problems throughout the machine learning workflow.

Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search. By visualizing the model selection process, data scientists can steer towards final, explainable models and avoid pitfalls and traps.

The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. It extends the scikit-learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the scikit-learn pipeline process, providing visual diagnostics throughout the transformation of high-dimensional data.

Discussions of machine learning are frequently characterized by a singular focus on model selection. Be it logistic regression, random forests, Bayesian methods, or artificial neural networks, machine learning practitioners are often quick to express their preference. The reason for this is mostly historical. Though modern third-party machine learning libraries have made the deployment of multiple models appear nearly trivial, traditionally the application and tuning of even one of these algorithms required many years of study. As a result, machine learning practitioners tended to have strong preferences for particular (and likely more familiar) models over others.

However, model selection is a bit more nuanced than simply picking the ârightâ or âwrongâ algorithm. In practice, the workflow includes:

selecting and/or engineering the smallest and most predictive feature set

choosing a set of algorithms from a model family

tuning the algorithm hyperparameters to optimize performance

The model selection triple was first described in a 2015 SIGMOD paper by Kumar et al. In their paper, which concerns the development of next-generation database systems built to anticipate predictive modeling, the authors cogently express that such systems are badly needed due to the highly experimental nature of machine learning in practice. âModel selection,â they explain, âis iterative and exploratory because the space of [model selection triples] is usually infinite, and it is generally impossible for analysts to know a priori which [combination] will yield satisfactory accuracy and/or insights.â

The Yellowbrick package gets its name from the fictional element in the 1900 childrenâs novel The Wonderful Wizard of Oz by American author L. Frank Baum. In the book, the yellow brick road is the path that the protagonist, Dorothy Gale, must travel in order to reach her destination in the Emerald City.

âThe road is first introduced in the third chapter of The Wonderful Wizard of Oz. The road begins in the heart of the eastern quadrant called Munchkin Country in the Land of Oz. It functions as a guideline that leads all who follow it, to the roadâs ultimate destinationâthe imperial capital of Oz called Emerald City that is located in the exact center of the entire continent. In the book, the novelâs main protagonist, Dorothy, is forced to search for the road before she can begin her quest to seek the Wizard. This is because the cyclone from Kansas did not release her farmhouse closely near it as it did in the various film adaptations. After the council with the native Munchkins and their dear friend the Good Witch of the North, Dorothy begins looking for it and sees many pathways and roads nearby, (all of which lead in various directions). Thankfully it doesnât take her too long to spot the one paved with bright yellow bricks.â

Yellowbrick is developed by data scientists who believe in open source and the project enjoys contributions from Python developers all over the world. The project was started by @rebeccabilbro and @bbengfort as an attempt to better explain machine learning concepts to their students; they quickly realized, however, that the potential for visual steering could have a large impact on practical data science and developed it into a high-level Python library.

Yellowbrick is incubated by District Data Labs, an organization that is dedicated to collaboration and open source development. As part of District Data Labs, Yellowbrick was first introduced to the Python Community at PyCon 2016 in both talks and during the development sprints. The project was then carried on through DDL Research Labs (semester-long sprints where members of the DDL community contribute to various data-related projects).

Yellowbrick is an open source project and its license is an implementation of the FOSS Apache 2.0 license by the Apache Software Foundation. In plain English this means that you can use Yellowbrick for commercial purposes, modify and distribute the source code, and even sublicense it. We want you to use Yellowbrick, profit from it, and contribute back if you do cool things with it.

There are, however, a couple of requirements that we ask from you. First, when you copy or distribute Yellowbrick source code, please include our copyright and license found in the LICENSE.txt at the root of our software repository. In addition, if we create a file called âNOTICEâ in our project you must also include that in your source distribution. The âNOTICEâ file will include attribution and thanks to those who have worked so hard on the project! Finally you canât hold District Data Labs or any Yellowbrick contributor liable for your use of our software, nor use any of our names, trademarks, or logos.

We think thatâs a pretty fair deal, and weâre big believers in open source. If you make any changes to our software, use it commercially or academically, or have any other interest, weâd love to hear about it.

We hope that Yellowbrick facilitates machine learning of all kinds and weâre particularly fond of academic work and research. If youâre writing a scientific publication that uses Yellowbrick you can cite Bengfort et al. (2018) with the following BibTex:

@software{bengfort_yellowbrick_2018,title={Yellowbrick},rights={Apache License 2.0},url={http://www.scikit-yb.org/en/latest/},abstract={Yellowbrick is an open source, pure Python project that extends the Scikit-Learn {API} with visual analysis and diagnostic tools. The Yellowbrick {API} also wraps Matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.},version={0.6},author={Bengfort, Benjamin and Danielsen, Nathan and Bilbro, Rebecca and Gray, Larry and {McIntyre}, Kristen and Richardson, George and Miller, Taylor and Mayfield, Gary and Schafer, Phillip and Keung, Jason},date={2018-03-17},doi={10.5281/zenodo.1206264}}

You can also find DOI (digital object identifiers) for every version of Yellowbrick on zenodo.org; use the BibTeX on this site to reference specific versions or changes made to the software.

Weâre also currently working on a scientific paper that describes Yellowbrick in the context of steering the model selection process. Stay tuned for a pre-release of this paper on arXiv.

Welcome to our frequently asked questions page. Weâre glad that youâre using Yellowbrick! If your question is not captured here, please submit it to our Google Groups Listserv. This is an email list/forum that you, as a Yellowbrick user, can join and interact with other users to address and troubleshoot Yellowbrick issues. The Google Groups Listserv is where you should be able to receive the quickest response. We would welcome and encourage you to join the group so that you can respond to othersâ questions! You can also ask questions on Stack Overflow and tag them with âyellowbrickâ. Finally, you can add issues on GitHub and you can tweet or direct message us on Twitter @scikit_yb.

The Yellowbrick project is an open source, Python affiliated project. As a result, all interactions that occur with Yellowbrick must meet the guidelines described by the Python Software Foundation Code of Conduct. This includes interactions on all websites, tools, and resources used by Yellowbrick members including (but not limited to) mailing lists, issue trackers, GitHub, StackOverflow, etc.

In general this means everyone is expected to be open, considerate, and
respectful of others no matter what their position is within the project.

Beyond this code of conduct, Yellowbrick is striving to set a very particular tone for contributors to the project. We show gratitiude for any contribution, no matter how small. We donât only point out constructive criticism, we always identify positive feedback. When we communicate via text, we write as though we are speaking to each other and our mothers are in the room with us. Our goal is to make Yellowbrick the best possible place to do your first open source contribution, no matter who you are.

Python 2 Deprecation: Please note that this release deprecates Yellowbrickâs support for Python 2.7. After careful consideration and following the lead of our primary dependencies (NumPy, scikit-learn, and Matplolib), we have chosen to move forward with the community and support Python 3.4 and later.

Major Changes:

New JointPlot visualizer that is specifically designed for machine learning. The new visualizer can compare a feature to a target, features to features, and even feature to feature to target using color. The visualizer gives correlation information at a glance and is designed to work on ML datasets.

New datasets module that provide greater support for interacting with Yellowbrick example datasets including support for Pandas, npz, and text corpora.

UMAPVisualizer as an alternative manifold to TSNE for corpus visualization that is fast enough to not require preprocessing PCA or SVD decomposition and preserves higher order similarities and distances.

Added ..plot:: directives to the documentation to automatically build the images along with the docs and keep them as up to date as possible. The directives also include the source code making it much simpler to recreate examples.

Minor Changes:

Updated Rank2D to include Kendall-Tau metric.

Added target_color_type functionality to determine continuous or discrete color representations based on the type of the target variable.

Added user specification of ISO F1 values to PrecisionRecallCurve and updated the quick method to accept train and test splits.

Added code review checklist and conventions to the documentation and expanded the contributing docs to include other tricks and tips.

New Feature! The FeatureImportances Visualizer enables the user to visualize the most informative (relative and absolute) features in their model, plotting a bar graph of feature_importances_ or coef_ attributes.

New Feature! The ExplainedVariance Visualizer produces a plot of the explained variance resulting from a dimensionality reduction to help identify the best tradeoff between number of dimensions and amount of information retained from the data.

New Feature! The GridSearchVisualizer creates a color plot showing the best grid search scores across two parameters.

New Feature! The ClassPredictionError Visualizer is a heatmap implementation of the class balance visualizer, which provides a way to quickly understand how successfully your classifier is predicting the correct classes.

New Feature! The ThresholdVisualizer allows the user to visualize the bounds of precision, recall and queue rate at different thresholds for binary targets after a given number of trials.

New MultiFeatureVisualizer helper class to provide base functionality for getting the names of features for use in plot annotation.

Adds font size param to the confusion matrix to adjust its visibility.

Add quick method for the confusion matrix

Tests: In this version, weâve switched from using nose to pytest. Image comparison tests have been added and the visual tests are updated to matplotlib 2.2.0. Test coverage has also been improved for a number of visualizers, including JointPlot, AlphaPlot, FreqDist, RadViz, ElbowPlot, SilhouettePlot, ConfusionMatrix, Rank1D, and Rank2D.

Documentation updates, including discussion of Image Comparison Tests for contributors.

Bug Fixes:

Fixes the resolve_colors function. You can now pass in a number of colors and a colormap and get back the correct number of colors.

Fixes TSNEVisualizer Value Error when no classes are specified.

Adds the circle back to RadViz! This visualizer has also been updated to ensure thereâs a visualization even when there are missing values

This release is an intermediate version bump in anticipation of the PyCon 2017 sprints.

The primary goals of this version were to (1) update the Yellowbrick dependencies (2) enhance the Yellowbrick documentation to help orient new users and contributors, and (3) make several small additions and upgrades (e.g. pulling the Yellowbrick utils into a standalone module).

We have updated the scikit-learn and SciPy dependencies from version 0.17.1 or later to 0.18 or later. This primarily entails moving from fromsklearn.cross_validationimporttrain_test_split to fromsklearn.model_selectionimporttrain_test_split.

The updates to the documentation include new Quickstart and Installation guides, as well as updates to the Contributors documentation, which is modeled on the scikit-learn contributing documentation.

This version also included upgrades to the KMeans visualizer, which now supports not only silhouette_score but also distortion_score and calinski_harabaz_score. The distortion_score computes the mean distortion of all samples as the sum of the squared distances between each observation and its closest centroid. This is the metric that KMeans attempts to minimize as it is fitting the model. The calinski_harabaz_score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.

Finally, this release includes a prototype of the VisualPipeline, which extends scikit-learnâs Pipeline class, allowing multiple Visualizers to be chained or sequenced together.

This release is the culmination of the Spring 2017 DDL Research Labs that focused on developing Yellowbrick as a community effort guided by a sprint/agile workflow. We added several more visualizers, did a lot of user testing and bug fixes, updated the documentation, and generally discovered how best to make Yellowbrick a friendly project to contribute to.

Notable in this release is the inclusion of two new feature visualizers that use few, simple dimensions to visualize features against the target. The JointPlotVisualizer graphs a scatter plot of two dimensions in the data set and plots a best fit line across it. The ScatterVisualizer also uses two features, but also colors the graph by the target variable, adding a third dimension to the visualization.

This release also adds support for clustering visualizations, namely the elbow method for selecting K, KElbowVisualizer and a visualization of cluster size and density using the SilhouetteVisualizer. The release also adds support for regularization analysis using the AlphaSelection visualizer. Both the text and classification modules were also improved with the inclusion of the PosTagVisualizer and the ConfusionMatrix visualizer respectively.

This release also added an Anaconda repository and distribution so that users can condainstall yellowbrick. Even more notable, we got Yellowbrick stickers! Weâve also updated the documentation to make it more friendly and a bit more visual; fixing the API rendering errors. All-in-all, this was a big release with a lot of contributions and we thank everyone that participated in the lab!

Intermediate sprint to demonstrate prototype implementations of text visualizers for NLP models. Primary contributions were the FreqDistVisualizer and the TSNEVisualizer.

The TSNEVisualizer displays a projection of a vectorized corpus in two dimensions using TSNE, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. TSNE is widely used in text analysis to show clusters or groups of documents or utterances and their relative proximities.

The FreqDistVisualizer implements frequency distribution plot that tells us the frequency of each vocabulary item in the text. In general, it could count any kind of observable event. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.

Hardened the Yellowbrick API to elevate the idea of a Visualizer to a first principle. This included reconciling shifts in the development of the preliminary versions to the new API, formalizing Visualizer methods like draw() and finalize(), and adding utilities that revolve around scikit-learn. To that end we also performed administrative tasks like refreshing the documentation and preparing the repository for more and varied open source contributions.

This release marks a major change from the previous MVP releases as Yellowbrick moves towards direct integration with scikit-learn for visual diagnostics and steering of machine learning and could therefore be considered the first alpha release of the library. To that end we have created a Visualizer model which extends sklearn.base.BaseEstimator and can be used directly in the ML Pipeline. There are a number of visualizers that can be used throughout the model selection process, including for feature analysis, model selection, and hyperparameter tuning.

In this release specifically, we focused on visualizers in the data space for feature analysis and visualizers in the model space for scoring and evaluating models. Future releases will extend these base classes and add more functionality.

Created an API for visualization with machine learning: Visualizers that are BaseEstimators.

Created a class hierarchy for Visualizers throughout the ML process particularly feature analysis and model evaluation

Visualizer interface is draw method which can be called multiple times on data or model spaces and a poof method to finalize the figure and display or save to disk.

ScoreVisualizers wrap scikit-learn estimators and implement fit() and predict() (pass-throughs to the estimator) and also score which calls draw in order to visually score the estimator. If the estimator isnât appropriate for the scoring method an exception is raised.

ROCAUC is a ScoreVisualizer that plots the receiver operating characteristic curve and displays the area under the curve score.

ClassificationReport is a ScoreVisualizer that renders the confusion matrix of a classifier as a heatmap.

PredictionError is a ScoreVisualizer that plots the actual vs. predicted values and the 45 degree accuracy line for regressors.

ResidualPlot is a ScoreVisualizer that plots the residuals (y - yhat) across the actual values (y) with the zero accuracy line for both train and test sets.

ClassBalance is a ScoreVisualizer that displays the support for each class as a bar plot.

FeatureVisualizers are scikit-learn Transformers that implement fit() and transform() and operate on the data space, calling draw to display instances.

ParallelCoordinates plots instances with class across each feature dimension as line segments across a horizontal space.

RadViz plots instances with class in a circular space where each feature dimension is an arc around the circumference and points are plotted relative to the weight of the feature.

Rank2D plots pairwise scores of features as a heatmap in the space [-1, 1] to show relative importance of features. Currently implemented ranking functions are Pearson correlation and covariance.

Coordinated and added palettes in the bgrmyck space and implemented a version of the Seaborn set_palette and set_color_codes functions as well as the ColorPalette object and other matplotlib.rc modifications.

Inherited Seabornâs notebook context and whitegrid axes style but make them the default, donât allow user to modify (if theyâd like to, theyâll have to import Seaborn). This gives Yellowbrick a consistent look and feel without giving too much work to the user and prepares us for matplotlib 2.0.