4.
Machine Learning
Vocabulary
• Feature: A number that represents something
about a data point
• Label: A feature of the data we want to predict
• Document: A block of text with a unique ID
• Model: A learned set of parameters that can
be used for prediction
• Corpus: A collection of documents
機械学習の前提となる語彙としてFeature、Label、Document、Model、Corpusが
ある

5.
What is Apache
Spark
• A library that defines a Resilient Distributed Dataset
type and a set of transformations
• RDDs are only representations of calculations
• A runtime that can execute RDDs in a distributed
manner
• A master process that schedules and monitors executors
• Executors actually do the calculations and can keep results in their
memory
• Spark SQL, MLLib and Graph X define special types of
RDDs
Sparkは汎用分散処理基盤で、SQL/機械学習/グラフといったコンポーネントを保
持する

7.
Spark’s Text-Mining
Tools
• LDA for Topic Extraction
• Word2Vec an unsupervised way to turn words
into features based on their meaning
• CountVectorizer turns documents into vectors
based on word count
• HashingTF-IDF calculates important words of
a document with respect to the corpus
• And much more
SparkのテキストマイニングツールとしてLDA、CountVectorizer、HashingTF-
IDF等のツールがある

12.
Word Segmentation
• Since Japanese lacks spaces it’s hard even in
theory
• A probabilistic approach is necessary
• Thankfully there are libraries that can help
日本語単語の抽出は単語区切り文字がなく、確率的アプローチが必要、ライブラ
リで効率的に実行できる

29.
Using the LDA model
• Prediction requires a LocalLDAModel
• Use .toLocal if
isInstanceOf[DistributedLDAModel]
• Convert to Vector using same steps
• Be sure to filter out words not in the vocabulary
• Call topicDistributions to see topic scores
LDAモデルはトピックの予想のために使用される

31.
Now what?
• Find the minimum logLikelihood in a set
of documents you know are OK
• Report anomaly whenever a new
document has a lower logLikelihood
トピックを正しく予想できた集合の最小対数尤度を計算、新しい文書がその値を
下回ったら「異常」に分類

41.
Embedding with Vector
Concatenation
• Calculate sum of words in description
• Add it to vectors from
Word2VecModel.getVectors with special
keyword (Ex. ITEM_1234)
• Create new Word2VecModel using constructor
• ※Not state of the art but can produce
reasonable recommendations without user
rating data
ベクトル連結による embedding、「アイテム」ごとに含まれる単語のベクトルを
合計する