With R and rJava there are sometimes troubles, especially when you need to train topic model using mallet ar now textmining package:

I recommend to read this post to configure Java properly:
https://www.r-statistics.com/2012/08/how-to-load-the-rjava-package-after-the-error-java_home-cannot-be-determined-from-the-registry/

Also, R tests implementation for both architectures on Windows 32 bit and 64 bit. This was hard to discover during the automatic build in RStudio. There was two solutions for this problem:
Set no multi-architecture argument in check options:
devtools::check(document = FALSE, args=”–no-multiarch”)
or install properly Java for both architectures 32 bit and 64 bit

Lots of time I have spent on technical cases like how to export functions that are already implemented in other packages as S3 methods and I just add new class, how to write assignment function appropriately.

Another big step was systematization of different types of functions and output produced by them. The most important was unification of train and predict functions for topic models from different packages and application of output to topic_wordcloud and topic_network functions.

I meet Maciej Eder in Lepizig at the Digital Humanities Summer School in Leipzig. Thanks to his help and lots of insights gained during work on stylo R package, he saved lots of my time on code refactoring and preparing for cran submission.

After that we prepared another GSOC meeting with my mentors in Wrocław. We discussed progress in GSOC 2016 and decided what kind of steps should we take next.

For now I plan to finish works on the documentation of existing objects, preparation for the cran check and cleaning up the code. Because, my approach for the GSOC was test driven development I have almost all tests written already. Also, thanks to that most of the examples are prepared.

Short summary of last two weeks:
Almost all functionalities included in my proposal has already been implemented.
For last month I left most of works on the documentation and some refactoring.
I will follow good practices summarised in this posts:
https://bradfults.com/the-best-api-documentation-b9e46400379a
http://blog.parse.com/learn/engineering/designing-great-api-docs/
http://www.programmableweb.com/news/web-api-documentation-best-practices/2010/08/12

At last I also plan to include short vignettes.

After project I still plan to work on development of the package, by adding such functionalities like:
– sentiment analysis
– dimension reduction
– word embedding
– more use-cases

At the moment we have 10 issues pointed out in the github repository.
One includes return of uniform output for the topic models from different packages.

The most important functionalities that had been implemented last weeks:

– Adjusted tm_map to work for more complex cases
– c function for tmCorpus data
– terms for mallet topic model
– refactored train and predict functions for tmTopicModel
– included treetagger into package through koRpus (still there should be included check if the treetagger is instaled on the computer)

Also, I performed initial sentiment analysis on the tweeter data, and some more short vignettes.

http://stackoverflow.com/questions/21740244/is-it-possible-to-modify-list-elements
This week i struggled a little with modifying interior elements of an object in the lapply loop.
Help was this article at the stackoverflow forum.

Through last week I focused on integration some of the existing methods from packages such as NLP and tm for tmCorpus object.

I integrated beside others methods such as content(x) which for tmCorpus returns list of documents.
Also, I prepared version of tm_map function that can be applied for tmCorpus. At last it is already possible to create Term Document Matrix and Document Term Matrix based on just tmCorpus document. Last commit was solving bug which requires to use always English stop-words. At the moment it is more universal.

for this week I plan to:

July 4 – July 12

Start working on the “Integration of TreeTagger for topic modelling within specific part of speech.”
Testing the koRpus package for this purpose
Designing API to include TreeTagger without using the koRpus package.

Also, I plan to invest some time into sentiment analysis and other packages for LDA and visualizations.

This week was full of work and finally we got to the point where simple analysis is possible. For last week I applied changes mentioned in the proposal to enable easy use of mallet package.

The functions train and predict for topic model had been implemented. Also, the visualization techniques are available at the repository. One can see the structure of words for different topics and appropriate wordclouds.

The pictures shows the wordcloud of one of the topics the most probably created based on the romance books

and the network of words within different topics.

If you want to try how my package works at the moment take a look at first tests provided in:

This week I plan to implement some more basic transformation functions. The idea is to prepare simillar interface for the tmCorpus as there exists for VCorpus. Functions such as tm_map, tm_filter, and tm_reduce are very well designed and users are used to them. This is the main reason for not changing them.

Also this week I plan to have look at the other packages thet were not mentioned in the proposal. The plan is to extend the package possibilities during the project time as much as it is possible.

Last week I worked moved my works onto official repository. My work focused on naming new commit, developing class from earlier scratch, and few simple functions. Also, instead of writing my own functions for testing I used test_that to serve this purpose.

There are few concerns to be solved yet. The most important is to incorporate
basic class containing single document with metadata.

The API is now planed to consist of 4 basic classes:

text corpuses

parsed text corpuses

table text corpuses

tagged text corpuses

June 6 – June 12

Examples of usage for the train and predict functions. Application of this for Mallet package already exists at the application repo

Start working on the “Integration of the existing packages to obtain complete workflow for the text mining tasks.“ topic. This will involve transformation of some existing classes into one of our basic classes

This week was my first approach to creating code using Test Driven Development. This is a software development process that consists of very short development cycle. First you have to write tests for new feature. After that you only implement code to fulfill tests requirements.

This week I am going to start my coding project with Google Summer of Code! Through the last weeks I started getting deeper into the visualization and analytical tools for text mining. This included the sentiment analysis with the syuzhet, visualizations of the Topic Models using LDA and LDAvis, and topics browser made by Andrew Goldstone dfr-browser.
Also, I studied the methodology of the Test Driven Development. This software development process consists of short development cycle. Author of Test Driven Development, Kent Beck, remarked that TDD encourages simple designs and inspires confidence.

Plans for this week:

May 23 – May 29

Start the work on the “Interface for the mallet package”.
Solution of this problem had been partially implemented by Andrew Goldstone. Some interface functions are also available at my GitHub repository.

Designing the API and preparing the skeleton for the most important functions.
The structure of the basic data structures is already under discussion with mentors.

This year April 24th I got accepted for the GoogleSummer of Code program. I will be working on the integration of Text mining and Topic Modeling Tools for R Project for Statistical Computing.

GoogleSummer of Code, often abbreviated to GSoC, is an international program, where students construct free and open-source software during the summer. The program is open to university students aged over 18.

Over next couple months I will be writing here about progress of my works.

Here is an abstract of my coding project:
The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:

Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)

In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The GoogleSummer of Code is planned to be just outset of a bigger project.