Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

3.
Which of these features is “best” to tie together the data?
How do we label groupings in a meaningful manner?
How many groups/how to arrange/visualize them?
Are Descriptions, Keywords representative of the content?
PyGotham 2017 New York City @NoemiDerzsy

18.
Are features pulled from text (such as title, description fields)
and/or human supplied-keywords
descriptive of the content?
Topic Modeling…
PyGotham 2017 New York City @NoemiDerzsy

19.
What is Topic Modeling?
An efficient way to make sense of large volume of texts.
Identify topics within text corpus.
Categorize documents into topics.
Associate words with topics.
Who uses it?
Search engines, for marketing purpose, etc.
PyGotham 2017 New York City @NoemiDerzsy

20.
Latent Dirichlet Allocation (LDA)
q several techniques, but LDA is the most common
q Bayesian inference model that associates each document with a probability distribution over topics
q topics are probability distributions over words (probability of the word being generated from that topic for that document)
q clusters words into topics
q clusters documents into mixture of topics
q scales well with growing corpus
q before running LDA algorithm,we have to specify the number of topics: how to choose beforehand the optimal number of topics?
PyGotham 2017 New York City @NoemiDerzsy

21.
Topic Model Evaluation: Topic Coherence
Q: How to select the top topics?
A: Calculate the UMass topic coherence for each topic. Algorithm from Mimno, Wallach, Talley, Leenders, McCallum: Optimizing
Semantic Coherence in Topic Models, CEMNLP 2011.
Coherence = ! score(𝑤), 𝑤,)
).,
pairwise scores on the words used to describe the topic.
s𝑐𝑜𝑟𝑒34566 78,79
= log
𝐷 𝑤), 𝑤, + 1
𝐷(𝑤))
D(wi)D(wi) as the count of documents containing the word wiwi, D(wi,wj)D(wi,wj) the count of documents containing both
words wiwi and wjwj, and DD the total number or documents in the corpus.
PyGotham 2017 New York City @NoemiDerzsy

22.
openNASA Topics of Highest Coherence
PyGotham 2017 New York City @NoemiDerzsy

23.
Keywords for Topics
• selected keywords with their most frequently occurring terms
PyGotham 2017 New York City @NoemiDerzsy

24.
Other Clustering Method: K-Means
• using TF-IDF, the document vectors are put through a K-Means clustering algorithm which
computes the Euclidean distances amongst these documents and clusters nearby documents
together
• the algorithm generates cluster tags, known as cluster centers which represent the documents
within these clusters
• K-means distance:
• Euclidean
• Cosine
• Fuzzy
• Accuracy comparison:
• silhouette analysis can be used to study the separation distance between the resulting
clusters; can be used to determine the optimal number of clusters (silhouette score)
PyGotham 2017 New York City @NoemiDerzsy

29.
pyNASA and pyOpenGov Libraries
§ Python library that loads all the open NASA or other government metadata collection at once
pyNASA
https://github.com/bmtgoncalves/pyNASA
pyOpenGov
https://github.com/nderzsy/pyOpenGov
How to install:
>> pip install pyNASA
>> pip install pyOpenGov
PyGotham 2017 New York City @NoemiDerzsy