kNN is a good classifier to start with, but it can be slow if you have a lot of training data. I recommend trying Support Vector Machines as well, which is also has an easy-to-use implementation in scikits-learn. Start with the RBF kernel. You can get good classification with many types of data using it. Of course, there's no free lunch, so experiment with different classifiers and their parameters!
–
Steve LMar 20 '13 at 18:40

Wow, I was thinking that mlpy documentation is cool but I tried scikits-learn and it's awesome. I have worked through kNN with iris dataset and I got what's going on, thanks for pointing out both scikits-learn and the algorithms. I will look further as Steve L and Pedrom suggested after I get my hands dirty. thanks to all.
–
Muhammet CanMar 20 '13 at 23:46

You have a wide selection of algorithms implemented on mlpy so you should be fine. I agree with Steve L when said that Support Vector Machines is great, but even when it is easier to use the inner details are not easy to grasp especially if you are new in ML.

For starters, Decision trees have the advantage that would produce an output that it is easy to understand and hence easier to debug.

Logistic Regression on the other hand, can give you good results and scale very well if you need more data.

I would say that in your case, you would be looking for the algorithm which after reading a bit you find more comfortable to work with. Most of the time, all of them are very capable to give you very decent results. Good luck!

As others mentioned, you can use a lot of algorithms for authorship attribution. kNN is a good starting point. Further, you can try several other algorithms such as Logistic Regression, Naïve Bayes Classifier, and Neural Networks which probably give more accurate predictions.

I’m also interested in authorship attribution and plagiarism detection. In fact, I have used above techniques for source code authorship attribution. You can read more about these, by using following research papers.

Given that you are not familiar with ML, the first three algorithms I would recommend would be:

1- Logistic Regression
2- Naive Bayes
3- Support Vector Machines

If you are only interested in predictive performance, have enough training data and have no missing values, you will find that using more complex methodologies, such as Bayesian Networks, will not result in statistically significant improvements in your predictive performance. Even if they do, you should start with these three (relatively) simple methodologies and use them as reference benchmarks.