Text Analytics using a pre-built Wikipedia-based Topic Model

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and use that model to score new text. While creating your own domain-specific model may be necessary in many situations, others may benefit from a pre-built model based on millions of Wikipedia articles reduced to 200,000 topics. This model is downloadable here with details of how to install it here.

Installing the model

The installation link provided above describes other prerequisites such as directory object and table spaces. Once these are in place, when you load the model using impdp, you should see something like the following:

Once the model is installed in the database, we need an ORE object as a proxy for the in-database ESA model. You may know that models created using ORE's ore.odm* functions have underlying in-database first-class objects that are accessed by ORE proxy objects. Since the WIKI_MODEL is also an in-database model, it too needs to be accessed using a proxy object. To enable this, execute the following R function, which constructs a model object with class ore.odmESA.

Like other ORE models, unless the proxy object is saved in an ORE datastore, the model object will be removed and any underlying in-database model will be dropped when the database connection terminates. To avoid having to reinstall the WIKI_MODEL, we save the proxy object in a datastore, here also named WIKI_MODEL, however, any datastore name could be used. Next time we connect to the database using ORE and want to use the WIKI_MODEL, we simply load the proxy object from the datastore using ore.load.

View model metadata

Now, we're ready to use the model. First, let's explore the model metadata. Here we see the results from functions class, summary, settings, and features as invoked on the proxy object. Note that the features listed start off with an odd set of text '!!!'. It turns out this corresponds to a dance-punk band on Wikipedia, and so is a valid topic.

Using the same test data from the previous blog post, we can get topic assignments, i.e., score, for these titles using the WikiModel object and the predict function. However, there is a caveat, in the previous example, the column containing the text was called "TITLE". However, the WIKI_MODEL requires the input column to be called "TEXT", because the model was built using input data with column "TEXT".

The error in the first predict execution below results because of the ESA_TEXT ore.frame does not have the needed column "TEXT". We can easily address this by renaming "TITLE" to "TEXT". Then, we receive the topics extracted using the prebuilt model.

Next, we look at the feature_compare function, which returns an ore.frame containing the similarity measure resulting from the cross-product of terms / documents provided in the named column, in this case "TEXT" provided to argument compare.cols. First, we create an ore..frame of the two terms we want to compare. If more than two terms or documents are provided in the TEXT column, the supplemental.cols attribute is essential to know which terms are being considered. This can be based on, e.g., an ID column or the actual terms.

In the following examples, we show comparing larger text documents. Notice that the two documents in the first result are more similar than in the second, highlighting the model's ability to identify similar documents using this model.

> text <- c('Senior members of the Saudi royal family paid at least $560
+ million to Osama bin Laden terror group and the Taliban for
+ an agreement his forces would not attack targets in Saudi
+ Arabia, according to court documents. The papers, filed in
+ a $US3000 billion ($5500 billion) lawsuit in the US, allege
+ the deal was made after two secret meetings between Saudi
+ royals and leaders of al-Qa ida, including bin Laden. The
+ money enabled al-Qa ida to fund training camps in Afghanistan
+ later attended by the September 11 hijackers. The disclosures
+ will increase tensions between the US and Saudi Arabia.',
+ 'The Saudi Interior Ministry on Sunday confirmed it is holding
+ a 21-year-old Saudi man the FBI is seeking for alleged links to
+ the Sept. 11 hijackers. Authorities are interrogating Saud
+ Abdulaziz Saud al-Rasheed "and if it is proven that he was
+ connected to terrorism, he will be referred to the sharia
+ (Islamic) court," the official Saudi Press Agency quoted an
+ unidentified ministry official as saying.')
> df <- data.frame(ID = seq(length(text)), TEXT = text)
> DOCS <- ore.push(df)
> res <- feature_compare(WikiModel, DOCS, compare.cols = "TEXT", supplemental.cols = "ID")
> res[res[[1]] < res[[2]],]
ID_A ID_B SIMILARITY
1 1 2 0.58258
>
> text <- c('Senior members of the Saudi royal family paid at least $560
+ million to Osama bin Laden terror group and the Taliban for
+ an agreement his forces would not attack targets in Saudi
+ Arabia, according to court documents. The papers, filed in
+ a $US3000 billion ($5500 billion) lawsuit in the US, allege
+ the deal was made after two secret meetings between Saudi
+ royals and leaders of al-Qa ida, including bin Laden. The
+ money enabled al-Qa ida to fund training camps in Afghanistan
+ later attended by the September 11 hijackers. The disclosures
+ will increase tensions between the US and Saudi Arabia.',
+ 'Russia defended itself against U.S. criticism of its economic
+ ties with countries like Iraq, saying attempts to mix business
+ and ideology were misguided. "Mixing ideology with economic
+ ties, which was characteristic of the Cold War that Russia
+ and the United States worked to end, is a thing of the past,"
+ Russian Foreign Ministry spokesman Boris Malakhov said Saturday,
+ reacting to U.S. Defense Secretary Donald Rumsfeld statement
+ that Moscow economic relationships with such countries sends a
+ negative signal.')
>
> df <- data.frame(ID = seq(length(text)), TEXT = text)
> DOCS <- ore.push(df)
> res <- feature_compare(WikiModel, DOCS, compare.cols = "TEXT", supplemental.cols = "ID")
> res[res[[1]] < res[[2]],]
ID_A ID_B SIMILARITY
1 1 2 0.09541735

In this post, we've explored several key aspects of using the pre-built ESA Wikipedia-based model: