Summary statisticscolStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.Test data:
1 2 3
10 20 30
100 200 300

Stratified samplingStratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.

The ProblemRunning the example Latent Semantic Analysis (LSA) Wikipedia Example from the book Advanced Analytics with Spark, from Spark 1.2 spark-shell.cmd.It depends on stanfordNLP libraries, So I need add stanfordNLP related jars into Scala REPL shell - I don't want to add these jars to Spark's spark-shell.cmd. We can use :cp to add a jar to current Scala Shell session.But as there are multiple jars(actually 7) jars in stanford-corenlp-full-2014-10-31 folder, I don't want to add them one by one.The Solutionimport java.io.Filefor (file <- new File("E:/jeffery/src/textmining/standfordnlp/stanford-corenlp-full-2014-10-31").listFiles.filter(f => f.getName().endsWith(".jar")&& !f.getName().contains("-sources") && !f.getName().contains("-src") && !f.getName().contains("-javadoc"))) { println(":cp " + file) }

This will println all stanford-corenlp jars except jars including source code and javadoc::cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\ejml-0.23.jar:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\javax.json.jar:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\joda-time.jar:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\jollyday.jar:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\stanford-corenlp-3.5.0-models.jar:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\stanford-corenlp-3.5.0.jar:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\xom.jarThen just copy the output and paste them in Scala shell, Scala will add these into current shell classpath.Happy Hacking.

The ProblemDownload Sprak 1.2 from github, and try to build it by running sbt assembly.It always failed with error:[error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader[error] Use 'last' for the full log.Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?Retry didn't work, and I can access https://github.com/ScrapCodes/sbt-pom-reader.git, git clone it.Not sure why it failed.The SolutionTo fix this: I opened a new cmd terminal, and ran the following command to create the staging folder and git clone to the dest folder:mkdir C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-readergit clone https://github.com/ScrapCodes/sbt-pom-reader.git C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-readerThen I type r to retry it. As the sbt-pom-reader is already there, sbt would just happily take it. After several minutes, spark built succesfullyHappy hacking.

The GoalIn previous post, we introduced how to run Stanford NER(Named Entity Recognition) in UIMA, now we are integrating Stanford Sentiment Analysis in UIMA.StanfordNLPAnnotatorFeature Structure: org.apache.uima.stanfordnlp.input:actionWe use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.The feature org.apache.uima.stanfordnlp.output:type specifies the sentiment of the whole article: very negative, negative, neutral, positive or very positive.The configuration parameter: SentiwordnetFile which specifies the path of sentiwordnet file.

How it WorksFirst it ignore sentence which doesn't contain opinionated word. It uses Sentiwordnet to check whether this sentence contains non-neutral adjective.The it calls Stanford NLP Sentiment Analysis tool to process the text.Stanford NLP Sentiment Analysis has two model files: edu/stanford/nlp/models/sentiment/sentiment.ser.gz, which maps sentimentto 5 classes: very negative, negative, neutral, positive or very positive; edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz which maps sentiment to 2 classes: negative or positive.We use edu/stanford/nlp/models/sentiment/sentiment.ser.gz, but seems sometimes it inclines to mistakenly map non-negative text to negative.For example, it will map the following sentence to negative, but the binary mode will correctly map it to positive.I was able to stream video and surf the internet for well over 7 hours without any hiccups .So to fix this, when the 5 classes mode(sentiment.ser.gz) maps one sentence to negative, we will run the binay mode to recheck it, if the binary mode agrees(also report negative) then no change, otherwise change it to positive.We calculate the score of all sentence, and map the average score to the 5 classes. We give negative sentence a smaller value as we don't trust it.

To improve our text analytic project, after integrated OpenNLP with UIMA, we are trying to integrate StanfordNLP NER(Named Entity Recognition) into UIMA.

StanfordNLPAnnotator

Feature Structure: org.apache.uima.stanfordnlp.input:action

We use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.

We use dynamic output entity: org.apache.uima.stanfordnlp.output, its type specifies whether it's person or organization or etc.

Here we are using sujitpal's UimaUtils.java, it adds the feature org.apache.uima.stanfordnlp.input:action=ner to the CAS then send the case to UIMA server then check the org.apache.uima.stanfordnlp.output feature in the response.

The GoalIn my latest project, I need develop one GAE java application to crawl blogger siter, and save index into Lucene on GAE.This post will introduce how to deploy lucene-appengine and use google-http-java-client to parse sitemap.xml to get all posts then crawl each post, then save index to lucene-appengine on GAE, then use GAR cron task to index new posts periodically.

Then download lucene-appengine-examples source code, and copy needed dependencies from its pom.xml, and add google-http-client, google-http-client-appengine and google-http-client-xml into pom.xml.

Using google-http-java-client to Parse sitemap.xmlgoogle-http-java-client library allow us to easily convert xml response as java object by com.google.api.client.http.HttpResponse.parseAs(SOmeClass.class), all we need is to define the Java class.Check blogger's sitemap.xml: lifelongprogrammer sitemap.xml

BloggerCrawler ServletWe can call BloggerCrawler servlet manually to test our crawler. When we test or call the servlet manully we set maxseconds to some smaller value due to the GAE request handler time limit, when we call it from cron task, we set it to 8 mins(the timelimit for task is 10 mins).

Scheduled Crawler with GAE Cron
We can use GAE cron to call crawler servlet periodically, for example every 12 hours. All we need do is add the cron task into cron.xml:
Check Scheduled Tasks With Cron for Java for more about GAE cron.
Notice that Local development server does not execute cron jobs nor have the Cron Jobs link. The actual appengine will show cron jobs and will execute them.

The ProblemMy application uses Apache HttpClient 4.2, but when it sends request to some web pages, the response is garbled characters.Using Fiddler's Composer to execute the request, found the response is gziped.Content-Encoding: gzip

The Solution

In Apache HttpClient 4.2, the DefaultHttpClient doesn't support compression, so it doesn't decompress the response. We have to use DecompressingHttpClient.

The GoalIn my latest project, I use crawler4j to crawl websites and Solr summarizer to add summary of article. Now I would use Solr Classification to categorize articles to different categories: such as Java, Linux, News etc.Using Solr ClassifierThere are two steps when use Solr Classification:

Trainfirst we add docs with known category. We can crawl known websites, for example, assign java for cat field for articles from javarevisited; assign linux for articles from linuxcommando, assign solr for articles from solrpl and etc.localhost:23456/solr/crawler/crawler?action=create,start&name=linuxcommando.blogspot&seeds=http://linuxcommando.blogspot.com/&maxCount=50&parsePaths=http://linuxcommando.blogspot.com/\d{4}/\d{2}/.*&constants=cat:linuxlocalhost:23456/solr/crawler/crawler?action=create,start&name=javarevisited.blogspot&seeds=http://javarevisited.blogspot.com/&maxCount=50&parsePaths=http://javarevisited.blogspot.com/\d{4}/\d{2}/.*&constants=cat:java

We set doClassifer=true, the ClassfierUpdateProcessorFactory will call Solr Classifier to do assign a label for the category field.

From the result, we can see some articles are assigned to Java, some goes to Linux, some goes to solr.About AccuracyThe accuracy of Solr Classification is worse than Mahout, but its performance is much better and it's enough for my application.

Normalize Html Text and Get Main Content: MainContentUpdateProcessorFactoryFirst, I use JSoup to normalize the html text: remove links: as they are usually used for navigation or contain javascript code, also remove invisible block: style~=display:\\s*noneTo hep Solr 3975 to get important sentence, I add period(.) after div, span, textarea if their own text don't end with period(.).

It wants to parse summary.fromField(maincontent in this case), and get the most important summary.count(3) sentences and them into summary.summaryField(summary in this case), summary.hl_start and summary.hl_end is empty, as we just need the text, not want to use html tag(like em or bold) to highlight important words. summary.simpleformat is an internal used argument to tell summarizer to only return highlighted section: no stats, terms or sentences sections.DocumentSummaryUpdateProcessorFactory As some of web pages define og:description which gives one to two sentence, we can directly use it.If og:description is defined, then we would use summarizer to get most important summary.count(3) -1 =2 sentences.

Summarizer in ActionNow let's use our crawler to crawl one web page: Official: Debris Sign of Spaceship Breaking Up, and check the summarization.curl "http://localhost:23456/solr/crawler/crawler?action=start&seeds=http://abcnews.go.com/Health/wireStory/investigators-branson-spacecraft-crash-site-26619288&maxCount=1&constants=cat:news"The summaries saved in the doc:

<arr name="summary">
<str>
Investigators looking into what caused the crash of a Virgin Galactic prototype spacecraft that killed one of two test pilots said a 5-mile path of debris across the California desert indicates the aircraft broke up in flight. "When the wreckage is dispersed like that, it indicates the...
</str>
<str>
"We are determined to find out what went wrong," he said, asserting that safety has always been the top priority of the program that envisions taking wealthy tourists six at a time to the edge of space for a brief experience of weightlessness and a view of Earth below.
</str>
<str>
In grim remarks at the Mojave Air and Space Port, where the craft known as SpaceShipTwo was under development, Branson gave no details of Friday's accident and deferred to the NTSB, whose team began its first day of investigation Saturday.
</str>
</arr>