Pages

Machine Learning with Python - Logistic Regression

Sunday, November 6, 2011

Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets. Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function). Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others.

Visualizing the Data

Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression. For each training example, you have the applicant's scores on two exams and the admissions decision. We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

The sigmoid function has special properties that can result values in the range [0,1]. So you have large positive values of X, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0.

Sigmoid Logistic Function

The cost function and gradient for logistic regression is given as below:

and the gradient of the cost is a vector theta where the j element is defined as follows:

You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs. It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).

The parameters are:

The initial values of the parameters you are trying to optimize;

A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.

Evaluating logistic regression

Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set. If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:

Where 1 represents admited and -1 not admited.

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct. Source code.

89% , not bad hun?!

Regularized logistic regression

But when your data can not be separated into positive and negative examples by a straight-line trought the plot ? Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected. We have a dataset of test results on past microships, from which we can build a logistic regression model.

Visualizing the data

Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.

Microship training set

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases. Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out. This is known as overfitting. You can imagine that if you were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.

Feature mapping

One way to fit the data better is to create more features from each data point. We will map the features into all polynomial terms of x1 tand x2 up to the sixth power.

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.

Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.

Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n. The gradient of the cost function is a vector where the jn element is defined as follows:

Now let's learn the optimal parameters theta. Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta.

Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples.

Decision Boundary

As you can see our model succesfully predicted our data with accuracy of 83.05%.Code

Scikit-learn

Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with. Take a look at this link and see by yourself! I recommend!

Conclusions

Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results. There are another machine learning techniques to handle with non-linear problems and we will see in the next posts. I hope you enjoyed this article!

This is a great tutorial, but I am confused with the first example. Why is the theta vector of length 3? Shouldn't it be of length 2? The theta vector you are trying to optimize is the slope and y-intercept, correct?

How would I use fmin bfgs if I'm training for a Neural Network? The cost function over there has more than 1 theta. How would I provide a list of thetas to "decorated cost" function. I tried doing it but, I get errors in scipy optimize (the thetas don't change; program crashes after a couple of iterations.

Hi. Nice post. I am wondering if it is possible to tweak a little bit of LogisticRegression in scikit-learn to get a "Regressor" rather that a "Classifier" like LogisticRegression? I went through all the codes. It seems that one of the main base class BaseLibLinear can only train different set of coefficients for different y. I really appreciate if you happy to get an answer. thanks.

I seem to be having an issue with the code. Downloaded from GitHub and run it. I would assume that in log_reg.py that the output from decorated_cost() function would be the theta values defining our boundary. In fact, the code hard codes those theta values rather than using the model output. If you use what is returned by decorated_cost(), it is not accurate. How did you generate the hard coded values? Am I missing something?

This is an informative post review. I am so pleased to get this post article and nice information. I was looking forward to get such a post which is very helpful to us. A big thank for posting this article in this website. Keep it up. mind control

I like totally and agree. And I think that in order to be comfortable with your style is to wear it more often. So wear your style to the lab on days that you don't have to do anything bloody, muddy or otherwise gross! subliminal advertising

Hi all I solved the issue related to logistic regression, for a simple misunderstood I replaced the cost_function with wrong J , since the f_min receives only a single value and also the negative value which was wrong from the problem (minimization).

Hello Marcel, I can not make either one work.the log_reg.py shows the "RuntimeWarning: overflow encountered in exp"for the log_reg_regular.py, I changed the maxfun to maxiterbut it still shows thetaR = theta[1:, 0]IndexError: too many indices"

Found your article and is very intersting after some effort to understand logistic regression. I notice that if the h[it] in predict function is changed from 0.5 to 0.2 or 0.3, the test accuracy result is sky rocketing to 0.92! Can you explain why ? How can we understand if that is a correct result or not ?Thanks for any feedback.

In theory example 1 should yield better accuracy if we added more features the same way it's done in example 2. After adding additional features for some reason minimizing function doesn't want to converge and stays at 60% any ideas why?

I agree with your post, the Introduction of automation testing product shortens the development life cycle. It helps the software developers and programmers to validate software application performance and behavior before deployment. You can choose testing product based on your testing requirements and functionality. QTP Training Chennai

hi,I am trying to do event recommendation by tags of events and i want to use logistic regresion as an algorith of the system. But logistic regression using vectors (x,y), but i could not transform tags to vectors. Does anyone can help me ?

Thanks for sharing this informative blog. Recently I have completed Digital Marketing courses at a leading digital marketing company. It's really useful for me to make a bright career. If anyone wants to get Digital Marketing Course in Chennai visit infiniX located at Chennai. Rated as No.1 digital marketing company in Chennai.

Your blog is really useful for me. Thanks for sharing this informative blog. If anyone wants to get real time Oracle Training in Chennai reach FITA located at Chennai. They give professional and job oriented training for all students.

Thanks for sharing this informative blog.. If anyone want to get HTML Training in Chennai please visit FITA academy located at Chennai, Velachery. Rated as No.1 training and placement academy in Chennai.

SEO is one of the digital marketing techniques which is used to increase website traffic and organic search results. If anyone wants to get SEO Training in Chennai visit FITA Academy located at Chennai. Rated as No.1 Training institutes in Chennai.

Thanks for your informative article on digital marketing trends. I hardly stick with SEO techniques in boosting my online presence as its cost efficient and deliver long term results. SEO Course in Chennai

I stick with Social Media Marketing. This promotional strategy is ideal for start-up and small organizations to enjoy maximum leads with minimal investment amount. However, you need to run effective marketing campaign to be successful. SEO Training Center in Chennai

Selenium is an open source web automation tool developed by Thoughtworks. Since it is based on JavaScript so it can be operated from any of the platforms like Windows, Linux, Mac, Android (Mobile OS developed by Google) , iOS (OS for iPhone and iPad) along with the supported web browsers such as Firefox, Internet Explorer, Chrome, Safari, Opera etc. Visit Us, Selenium Training in Chennai

QTP is widely used test automation tool mainly for functional testing. QTP has many more advanced options and HP recommends that all existing and new users should begin with Quick Test Professional(QTP) instead of Win Runner.Visit Us, QTP Training in Chennai

QTP Training in Chennai,Automated software testing is a process in which software tools execute pre-scripted tests on a software application before it is released into production Visit Us, QTP Training in Chennai

Hi, I wish to be a regular contributor of your blog. I have read your blog. Your information is really useful for beginner. I did QTP Training Chennai at Fita training and placement academy which offer best Selenium Training Chennai with years of experienced professionals. This is really useful for me to make a bright career.

Thanks for sharing this informative blog. Recently I did Digital Marketing Courses in Chennai at a leading digital marketing company. It's really useful for me to make a bright career. To know more details about this course please visit FITA.

Thanks for sharing this information. SEO is one of the digital marketing techniques which is used to increase website traffic and organic search results. . If anyone wants to get SEO Training in Chennai visit FITA Academy located at Chennai. Rated as No.1 SEO Training Institute in Chennai.

I have read all the articles in your blog; was really impressed after reading it. FITA is glad To inform you that; we provide Salesforcecrm practical training with MNC exports. We Assure you that through our training the students will gain all the sufficient knowledge to have a voyage in IT industry.

I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article. I did Loadrunner Course in Chennai. This is really useful for me. Suppose if anyone interested to learn Manual Testing Course in Chennai reach FITA academy located at Chennai Velachery.

The SAS system is a powerful software program designed to give researchers a wide variety of both data management and data analysis capabilities. SAS Training in Chennai Although SAS has millions of users worldwide.That process involves designing studies, collecting good data, describing the data with numbers and graphs, analyzing the data, and then making Decisions / conclusions. SAS Training in Chennai All these can be accomplished by using SAS software.

Wiztech Automation Solutions is the best Training institute in Chennai, started in the year 2006 and it extended its circle through providing the best education as per the global quality standards. Hence our PLC training Center in Chennai was recognized by IAO and ISO for its inspiring Education quality standards. Wiztech Automation Solution, the PLC SCADA Training Academy in Chennai offers both PLC, SCADA, DCS, VFD, Drives, Control Panels, HMI, Pneumatics, Embedded systems, VLSI Training courses in chennai with latest various brands. As we know that how PLC and SCADA technologies plays a vital role in the Automation Industry

Thanks for your informative article on software testing. Your post helped me to understand the future and career prospects in software testing. Keep on updating your blog with such awesome article. Best Python Training in chennai

Thanks for your informative article on digital marketing trends. I hardly stick with SEO techniques in boosting my online presence as its cost efficient and deliver long term results. SEO Training Institutes in Chennai

I always get satisfied reading this blog, the content in this site is always satisfying and truly helpful, I also wanted to share few links related to Python training Check this siteTeKslate for indepth Python training. Go here if you’re looking for information on Python training.

This data is magnificent. I am impressed with your writing style and how properly you define this topic. After studying your post, my understanding has improved substantially. Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. Regards,ccna course in Chennai|ccna training in Chennai

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

"I completed my salesforce training in GREENS TECHNOLOGY ADYAR last month. This institute is very good and a good choice for person looking for salesforce to join and mainly the faculty Mr.Vinod is good and talented man with lots of patience. Friendly man.They providing placement also. Thanks to Greens Technology "http://www.greenstechnologys.com/

Traffic through above source have become more reliable as the social media domination are increasing these days. It is really a very good technique one should ever try.

An Oracle database is a collection of data treated as a unit. The purpose of a database is to store and retrieve related information. A database server is the key to solving the problems of information management.

There are lots of information about latest technology and how to get trained in them, like Hadoop Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this.

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..

This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic

There are lots of information about latest technology and how to get trained in them, like pega Training in Chennai have spread around them, but this is a unique one according to me.thanks for taking the time to discuss this.http://www.pegatraining.in

Oracle Training in Chennai is one of the best oracle training institute in Chennai which offers complete Oracle training in Chennai by well experienced Consultants having more than 12+ years of IT experience. We provide best and high quality Oracle training based on current industry standards. You can gain more knowledge about Oracle Training in chennai its implementation process on joining out Oracle course.

Learn informatica from our Experts in IT industry. We are the best providers of any informatica Training in Chennai with excellent syllabus. By placement, course syllabus and practicals we are the best informatica Training providers in Chennai.Informatica Training in chennai

Through this SAS training course in Chennai you will learn the complete basics of the SAS software and language and learn how to manage and manipulate data. You will also get trained for the SAS Certified Base Programmer for SAS 9 certification exam conducted by SAS Institute. SAS Training in Chennai

This was so useful and informative.We develop a personal relationship with students and ensure that we the Best Oracle training center in Chennai maximize their learning, and we offer supplemental mentoring by our instructor.. Greens Technology is the Best training institute offer Oracle training in Chennai with Placements by certified experts with real-time LIVE PROJECTS. Our Oracle training institute in Chennai syllabus is perfectly mixed with practical and job oriented training for developers and administrators. Oracle Training in chennai

The information you have given here is truly helpful to me.The Best Informatica training in Chennai with excellent syllabus. By placement, course syllabus and practicals we are the BEST Informatica Training in Chennai.Informatica Training in chennai

I feel satisfied to read your blog PEGA real time training institute in Chennai. We offer best PEGA training with real-time project material. We can guarantee classes that makes you as a PEGA Certified Professional. Pega Training in Chennai

Thanks for sharing this niche useful informative post to our knowledge, ActuallyHadoop Training goes MainStream – We are now tied up with leading IT giants for resource consulting & Placements for numerous Big Data Projects in Pipeline.. The course material bundled with the training program can be of excellent use to the users as a quick refresher of the topics before attending the interviews. Hadoop Training in Chennai

Hey, nice site you have here. We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work!Please visit Greens Technologies located at Chennai Adyar which offer Oracle Training in chennai

fantastic presentation of informatica..if sharinng this session will describe near real-time architectures for accelerating the delivery of data to critical analytics and customer service applications in real world once again i want to share this sites Informatica Training in chennai

Hi. Nice post. I am wondering if it is possible.Actually pega software that can be used in many companies for their day to day business activities it has great scope in future.if suggest best coaching center visit Pega Training in Chennai

This site has very useful inputs related to qtp.This page lists down detailed and information about QTP for beginners as well as experienced users of QTP. If you are a beginner, it is advised that you go through the one after the other as mentioned in the list. So let’s get started… QTP Training in Chennai,

Nice site....Please refer this site also nice if Our vision succes ! Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies.Green Technologies In Chennai

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

fantastic presentation of informatica..if sharinng this session will describe near real-time architectures for accelerating the delivery of data to critical analytics and customer service applications in real world once again i want to share this sites. Informatica Training in chennai

Hi. Nice post. I am wondering if it is possible.Actually pega software that can be used in many companies for their day to day business activities it has great scope in future.if suggest best coaching center visit Pega Training in Chennai

This site has very useful inputs related to qtp.This page lists down detailed and information about QTP for beginners as well as experienced users of QTP. If you are a beginner, it is advised that you go through the one after the other as mentioned in the list. So let’s get started… QTP Training in Chennai,

Nice site....Please refer this site also nice if Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies. Green Technologies In Chennai

If wants to get real time Oracle Training visit this blog They give professional and job oriented training for all students.To make it easier for you Greens Technologies trained as visualizing all the real-world Application and how to implement in Archiecture trained with expert trainners guide may you want.. Start brightening your career with us Green Technologies In Chennai

your blog describe a logical view of the things. It is very effective for the reader. Please post more blog related to this. I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.You have done a great job . If anyone want to get Best Oracle training institutes in Chennai, Please visit Greens Technologies located at Chennai Adyar which offer Best Oracle Training in Chennai.

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

Awesome blog if our training additional way as an SQL and PL/SQL trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blog Green Technologies In Chennai

Nice site....Please refer this site also Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies. Green Technologies In Chennai

let's Jump Start Your Career & Get Ahead. Choose sas training method that works for you. This course is designed for professionals looking to move to a role as a business analyst, and students looking to pursue business analytics as a career. SAS Training in Chennai

Oracle Training in Chennai is one of the best oracle training institute in Chennai which offers complete Oracle training in Chennai by well experienced Oracle Consultants having more than 12+ years of IT experience.

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this

A Best Pega Training course that is exclusively designed with Basics through Advanced Pega Concepts.With our Pega Training in Chennai you’ll learn concepts in expert level with practical manner.We help the trainees with guidance for Pega System Architect Certification and also provide guidance to get placed in Pega jobs in the industry.

Our HP Quick Test Professional course includes basic to advanced level and our QTP course is designed to get the placement in good MNC companies in chennai as quickly as once you complete the QTP certification training course.

GREENS TECHNOLOGIES, ONE OF THE BEST IT INSTITUTES FOR ORACLE SQL TRAINING IN CHENNAI OFFERS TRAINING WITH PRACTICAL GUIDANCE. OUR TRAINING ACADEMY IS FULLY EQUIPPED WITH SUPERIOR INFRASTRUCTURE AND LAB FACILITIES. WE ARE PROVIDING THE BEST ORACLE PLSQL TRAINING IN CHENNAI.

Thanks for sharing this informative blog .To make it easier for you Greens Techonologies at Chennai is visualizing all the materials about (OBIEE).SO lets Start brightening your future.and using modeling tools how to prepare and build objects and metadata to be used in reports and more trained itself visit Obiee Training in chennai

I would recommend the Qlikview course to anyone interested in learning Business Intelligence .Absolutely professional and engaging training sessions helped me to appreciate and understand the technology better. thank you very much if our dedicated efforts and valuable insights which made it easy for me to understand the concepts taught and more ... qlikview Training in chennai

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing.. Cloud Computing Training in Chennai

This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topicAndroid Training In Chennai In Chennai

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.