CS 920 Programming Paradigms (3 credit units)

CS 920 taught by Patricia Hoffman, PhD is an extensive course on the R statistical Language. (The course number and title is due to change)

It can be taken by students who have never coded in any language before. It starts from the very beginning and continues through advanced concepts. It will give you a great jump start for my machine learning and data mining courses. In fact the last assignment for this course is similar to the first assignment of my Machine Learning courses.

Student comments

"I really enjoyed the class and would recommend anyone to do it. It was a
perfect blend of Theory of Data Mining and practical exercises in
language R. Exercises were extremely helpful, to not only understand
the concepts but how to apply them to real problems! It was one of the
most fun classes I have attended in a long time. The best part is that
what I have learned, I am able to directly apply in my work right away!
Thanks Patricia for putting together such an excellent course!". -
Sourabh Satish, Distinguished Engineer, Symantec.

"Thanks a lot for your course. I really enjoy it! I
wish you could teach a second course on machine learning/data mining or
on their applications." - Bo Wu

Dr Hoffman,

Thanks for teaching the class.
I REALLY, REALLY learned a lot from this class, in fact its the best class I have taken at UCSC. Besides the great class lectures, and
discussion, the homework, and project helped me a lot. The extra work you gave us, made me revise the subject matter once I got home, even
though it did seem like extra work, but that really paid back.

Thanks for teaching this subject to us.

I will stay in touch.
Best RegardsNavindra Yadav Google

Hi Tricia:

I would like to Thank you for
offering such an excellent course. I have learned a lot from your course.

Your course have given me a jump
start on R and a solid foundation on Machine learning. You have covered a lot
of breath that has helped me understand the broad array of techniques for
machine learning. Your example programs were an excellent source of learning as
well.

Machine Learning
automatically recognizes complex, previously unknown, novel, and useful patterns and
information in all types of data. Data driven algorithms are the wave of the
future and their results
improve as the amount of data increases. Machine Learning algorithms are used in search
engines, image analysis, multimedia database retrieval, bioinformatics,
industrial automation,
speech recognition, and many other fields. These are survey courses covering
the concepts and
principles of a large variety of data mining methods. The courses will equip
the students with a working knowledge of these techniques and prepares them to apply machine learning to real problems.At the end of this sequence the students will
collaborate on a set of projects of their choosing.

These courses
require a moderate level of computer programming proficiency, along with an elementary level
background in probability, statistics, linear algebra, and calculus. These are hands-on courses
using the statistical language R for class examples and homework assignments. No prior
knowledge of R is assumed, and some of the basics of open source R language
are covered.

Machine Learning
201 and 202

These courses cover topics in greater depth than Machine
Learning 101 and 102.After finishing
this series, participants are able to read the current literature and apply
what they have read to their own work.At the end of this sequence, students present interesting machine
learning projects using a wide variety of data sources.

Machine Learning 201 begins with ordinary least squares
regression and extends this basic tool in a number of directions. Various
regularization approaches are covered including ridge regression, lasso
regression, lease angle regression and elastic net. Logistic regression including
coding categorical inputs and outputs is discussed. Feature space expansions
along with subset selection (both forward and backward step-wise) are detailed.
These techniques naturally lead to generalizations of linear regression, known
as the "generalized linear model" and the "generalized additive
model".

Participants learn to adapt and execute machine learning
algorithms in the map reduce framework. Participants finishing the class are
able to author their own machine learning algorithms for map reduce and to run
them on Amazon Web Services. Amazon provided AWS credits for class
participants.

Participants learn to use python code to author mappers and reducers for
“hadoop-streaming”. For most of the class “mrjob” - an open-source
framework developed at Yelp is used. Employing mrjob enables class
members to program mappers and reducers in python. The mrjob framework
then submits the mapper-reducer to run locally without using hadoop, to run on
Amazon Web Services, or to run them on a private hadoop cluster. This
simplifies the programming tasks.

Topics included in this course are k-means with canopy clustering .Implementing expectation maximization
in the map-reduce paradigm is developed. Generalized Linear Models with
regularized regression are covered.Recommender systems along with singular value decomposition are
discussed.Frequent Item Set Implementations
are also provided.

Machine Learning - Natural Language Text Documents

Machine Learning applied to natural language text documents will be covered, including the use of statistical algorithms for accomplishing machine learning tasks on texts - not more traditional rule-based semantics, parsing, etc. The class starts with an introduction to basic text manipulations, and continues with comparisons of statistical techniques to semantic approaches, definition of problems in text mining, and simple text manipulations. Various algorithms for dealing with standard text mining problems, such as indexing, automatic classification (e.g. span filtering) part of speech identification, topic modeling, sentiment extraction, etc.

"I have taken two classes in machine learning taught by Tricia Hoffman at the Dojo. She is very knowledgeable in the area and does a great job sharing/teaching the methods with hands-on examples. The group homework is a great way to get engaged with the information taught in class and exchange information with fellow machine learners. From these classes I have been able to implement these techniques in my work at a large financial firm. I highly recommend thisclass."

Check out the great topics from the last Data Mining Camp (October 15, 2011)

http://www.sfbayacm.org/bootcamp/forums/

http://www.djcline.com/2011/10/20/oct-15-2011-acm-data-mining-camp/

Software as a Platform Panel Discussion October 15, 2011

The panel will discuss the benefits of moving to a software platform
distributed over a large number of processors. The risks involved in the
move along with advice in avoiding the pitfalls will be given. What are
the costs involved with moving to one of these platforms? The popular
platforms along with where the market is likely to go will be addressed.
What software developer backgrounds are companies looking to hire? Do
companies have in house programs to develop this talent?

Suggested Questions to Cover
- Is big getting smaller? (i.e. is Moore's law allowing hardware to catch up with big data)
- Do we all need new shoes? (should we throw all legacy code onto a bon-fire?)
- If not, how can we inter-operate?
- How valuable is it to "own" your own infrastructure all the way down to the bits?
- What kind of data mining analytics are the panelist doing?
-Are there any surprising applications that aren't what people usually list when they describe data mining?

Moderator
Dr. Patricia Hoffman, Research Scientist

Panelists

Ted Dunning has been involved with a number of startups, including
MusicMatch, and Veoh Networks with the latest being MapR Technologies
where he is Chief Application Architect. He is also a PMC member for the
Apache Zookeeper and Mahout projects. Opinionated about software and
data-mining and passionate about open source, he is an active
participant of Hadoop and related communities and loves helping projects
get going with new technologies.

Bryan Duxbury leads Rapleaf's Analysis Team, which is responsible for
maintaining and analyzing a database of over 200 billion people data
records. He is also the project chair of Apache Thrift.

Erik Andrejko is a software engineer currently working on large scale statistical climatology and related
agronomic models at Weatherbill. In the past he has built systems at
scale for collaborative filtering, search, rare event modeling and
various other related problems.

Jay Kreps is a Principal Engineer and Engineering Manager at
LinkedIn. One of LinkedIn's software platforms is called Voldemort: http://project-voldemort.com/
Voldemort is the open source data store used extensively atLinkedIn for
online queries built to overcome the inherent scalability limitations
of a relational database

Jimmy Retzlaff is a senior software engineer at Yelp, working on ad
targeting. Jimmy regularly gives talks about mrjob, Yelp's open source
library for doing MapReduce in Python. Before Yelp, Jimmy worked on the
Amazon Kindle for nearly 5 years and also developed a system for
generating interactive geographic visualizations of mutual fund sales
activity.

Bayesian Techniques Panel Discussion October 15, 2011

Bayesian Techniques are used to model uncertainty, for
inference (to explain data), Decision Making (Decision Theory), and Risk
Reduction (Predicting Future).A huge
advantage of Bayesian Techniques is the ability to use all relevant information
and to unify various methods in a probabilistic framework.This panel will discuss the types of problems
that are ideally suited for Bayesian Techniques.The current research and new developments
will be addressed.The panel will
provide guidance for managers considering using these techniques to solve their
problems.

Questions to be addressed include:
How can Bayesian Techniques improve the solutions to problems? How do
Bayesian Techniques compare with other Data Mining methods? What types of
problems are Bayesian Networks ideally suited to solving? What about
Bayesian nonparametric models? What are recommendation to follow when
implementing a Bayesian Technique? In practice how is it possible to
quantify uncertainty?

Moderator
Dr. Patricia Hoffman, Research Scientist

Panelists

John Mark Agosta, PhD, previously was
Chief Scientist at Impermium, a real time web service for span elimination
on social networks. Before that, Research Scientist at Intel Labs,
working on opinion mining on the web and distributed detection of computer
viruses; Edify Corporation (automating customer interaction using
statistical natural language and automated workflow), Knowledge Industries -
Diagnostic Bayes Nets, and SRI. John Mark did his thesis work at
Stanford, on Bayes networks models for visual recognition.

Lionel Jouffe, PhD, cofounder and CEO
of France-based Bayesia S.A.S. Lionel Jouffe holds a Ph.D. in Computer Science
and has been working in the field of Artificial Intelligence since the early
1990s. He and his team have been developing BayesiaLab since 1999 and it has
emerged as the leading software package for knowledge discovery, data mining
and knowledge modeling using Bayesian networks. BayesiaLab enjoys broad
acceptance in academic communities as well as in business and industry. The
relevance of Bayesian networks, especially in the context of consumer research,
is highlighted by Bayesia’s strategic partnership with Procter & Gamble,
who has deployed BayesiaLab globally since 2007

David Draper, PhD Professor at the University of California
at Santa Cruz.Past President of the
International Society for Bayesian Analysis.Authored more than 100 articles in Journals of American Statistical
Association, Royal Statistical Society, Bayesian Analysis, and both the New
England Journal of Medicine, and the Journal of American Medical
Association.His seminal article has
been sited more than 800 times.

Expert Panel Discussion November 13, 2010

The panel was just as lively as last year’s discussion. Follow the full video with all the Q/A to see what was interesting to the experts and students in attendance.

Moderator
Dr. Patricia Hoffman, Research Scientist

Panelistsand their signature question

Dr. Neel Sundaresan, Sr. Director and Head, eBay Research Labs at eBayHow does data mining apply to social and incentive networks?

Mr. Dean Abbott, Chief Scientist and Co-Founder at SmarterRemarketer, LLCWhat tools are available to the data miner today?

Dr. Mike Bowles, Research Scientist and Start-up ExecutiveHow has the field of stock market prediction changed over the past two years?

Dr. Hans Dolfing, Pattern Recognition Manager at AppleWhat are the current challenges in Speech and handwriting recognition?

Dr. Susan Holmes, Standford Professor, Statistics DepartmentWhat recommendations do you have for people starting out in data mining today?

Dr. Omid Madani. Senior Computer Scientist at SRI InternationalHow has data mining changed as the scale of data has increased so dramatically in the last few years?

Dr. Lionel Jouffe, President/CEO at BAYESIACan you give us a brief example where a Bayesian network did a great job in diagnostics? [poor audio, the answer to this question not included in video]