The Data Daily

Data Science for Java Developers With Tablesaw - DZone Big Data

Data Science for Java Developers With Tablesaw
DZone's Guide to
Data Science for Java Developers With Tablesaw
Tablesaw is like an open-source Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Come learn all about it!
by
Aug. 20, 17 · Big Data Zone
Free Resource
Join For Free
Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS –the premier platform for containers and big data.
Data science is one of the hottest areas in computing today. Most people learn data science using either Python or R. Both are excellent languages for crunching and analyzing data.
But many Java developers feel left behind. There are great Java libraries for machine learning, especially for jobs that require distributed computing, but there's no simple path for Java developers to learn and apply data science. By minimizing the number of things you need to learn, the open-source Tablesaw provides a gateway.
Think of Tablesaw as a Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Used interactively or embedded in an application, its focus is to make data science as easy in Java as in R or Python. If you've done some data science, you may think of it as a data frame.
Tablesaw is easy to learn, but it's not a toy. Tables can be large — up to two billion rows. Performance is brisk — on my laptop, I can retrieve 500 records from a table of half of a billion rows in two milliseconds. It is open-sourced under a business-friendly Apache 2 license.
What Makes Tablesaw Beginner-Friendly?
It builds on what you know: For Java developers who want to do data science, it's a huge advantage to not have to also learn a new language.
It's easy to get started: Simply add Tablesaw as a Maven dependency for your project and you’re up and running. We’ll walk through an example below to show you how.
It's not distributed: Unlike many machine learning libraries, Tablesaw is not a distributed system. This removes enormous complexity and makes machine learning accessible to those without deep engineering experience or support.
The code is clear: There's a fluent API so you’ll understand your code the next time you read it.
It provides fast feedback: Tablesaw is designed to be used interactively for exploratory analysis.
Introductory Example
Here, I’ll show you some of Tablesaw’s basic data manipulation features. Future posts will address visualization, machine learning, the Kotlin API and REPL, and the Tablesaw architecture. The code for this example can be found here .
Up and Running
To begin, create a Java project and add the Tablesaw core library as a Maven dependency. The current dependency is:
tech.tablesaw
Next, create a class with a main method like so:
public class Foo { public static void main(String[] args { // rest of code goes here } }
The rest of our code will go in this method. Now add a table. Tablesaw can load data from relational databases, but we will create our table from a flat file:
Table table1 = Table.createFromCsv(“data/BushApproval.csv");
Table objects can provide a lot of information:
table1.name(); returns BushApproval.csv since it named the table after the file.
table1.shape(); returns 323 rows X 3 cols.
table1.structure(); returns a table of column metadata:
Index Column Name Column Type 0 date LOCAL_DATE 1 approval SHORT_INT 2 who CATEGORY
Note that we've inferred the column types from the data.
table1.first(3); returns a new table containing only the first three rows.
BushApproval.csv date approval who 2004-02-04 53 fox 2004-01-21 53 fox 2004-01-07 58 fox
Inevitably, we want to work with the data, and for that, we need columns. Each has a data type, and usually, you’ll want it by that type and not as a generic column because typed columns have more power. For example, to get the approval column, you can use:
ShortColumn approval = table1.shortColumn(“approval”);
Each column sub-type supports numerous operations. As a rule, operations on a column are applied to every element without explicit loops. Some call these “vector operations.” For example, operations like count(), min(), and contains() produce a single value for a column of data: approval.min();.
Other operations return a new column. The method dayOfYear() applied to a DateColumn returns a short integer column with each element the day of the year from 1 to 366.
Some column-returning operations take a scalar value as an input: cd.plusDays(4);.
This adds four days to every element. Others take a second column as an argument. These process the two columns in order, applying each integer value from the argument to the corresponding element in the receiver.
Boolean operations like isMonday() don’t return a boolean column directly, but a Selection instead. Selections can be used to filter tables by the values in their columns, so we’ll see them again:
Selection selection = table1.column(“date”).isMonday();
You can, of course, get a boolean column if you want it. You simply pass the Selection and the original column length to a BooleanColumn constructor, along with a name for the new column:
BooleanColumn mondays = new BooleanColumn(“mondays”, selection, 1000);
There are hundreds of methods available for column manipulation, but let's turn now to tables. Operations exist for creating, describing, modifying, sorting, querying, and summarizing tables. Here we'll cover sorting, querying, and summarizing.
Queries
For queries, we need a helper. It is called QueryHelper, and it’s best to do a static import wherever you will use it. The method selectWhere() gets the job done.
Usually, you will pass it as a Filter to selectWhere(), which can be easily created inline:
Table highApproval = table1.selectWhere(column("approval").isGreaterThan(80));
The segment column(“approval”).isGreaterThan(80) creates the filter.
Remember Selection objects from columns? You can also use those as arguments to selectWhere(), allowing you to use column-specific logic to query a table.
Table Q3 = table1.selectWhere(date.isInQ3());
Sorting
There are a number of ways to sort a table, but the easiest is sortOn();. This code gets it done:
table1.sortOn(“who”, “approval”);
“who” and “approval” are column names, and the sort is ascending. To sort in descending order, use sortDescendingOn().
To sort in mixed order, you can prepend a minus sign to a column name to indicate a descending sort on that column. For example, table1.sortOn(“who”, “-approval”); sorts on “who” in ascending order, and on “approval” in descending order.
Finally, you can write your own sort logic as an IntComparator, giving you full control over the ordering.
Summarizing
Now, we’ll cover summarization techniques like pivot tables (cross tabs). If you want to simply calculate group statistics for a table, the summarize() method works nicely. There are a large number of statistics available, including range, as shown below.
Table summary = table1.summarize("approval", range).by(“who”);
BushApproval.csv summary who Range [approval] fox 42.0 gallup 41.0 newsweek 40.0 time.cnn 37.0 upenn 10.0 zogby 37.0
Cross tabs are useful for producing counts or frequencies of the number of observations in a combination of categories. First, let's get two categorical columns:
CategoryColumn who = table1.categoryColumn("who"); CategoryColumn month = date.month(); table1.addColumn(month);
Now, we can calculate the raw counts for each combination:
Table xtab = CrossTab.xTabCount(table1, month, who);
Crosstab Counts: date month x who fox gallup newsweek time.cnn upenn zogby total APRIL 6 10 3 1 0 3 23 AUGUST 3 8 2 1 0 2 16 DECEMBER 4 9 4 3 2 5 27 FEBRUARY 7 9 4 4 1 4 29 JANUARY 7 13 6 3 5 8 42 JULY 6 9 4 3 0 4 26 JUNE 6 11 1 1 0 4 23 MARCH 5 12 4 3 0 6 30 MAY 4 9 5 3 0 1 22 NOVEMBER 4 9 6 3 1 1 24 OCTOBER 7 10 8 2 1 3 31 SEPTEMBER 5 10 8 3 0 4 30 Total 64 119 55 30 10 45 323
If you prefer to see the relative frequency for each combination, pass your crosstab table to the tablePercents() method:
CrossTab.tablePercents(xtab);
Crosstab Table Proportions: fox gallup newsweek time.cnn upenn zogby total APRIL 0.01857585 0.030959751 0.009287925 0.0030959751 0.0 0.009287925 0.071207434 AUGUST 0.009287925 0.024767801 0.0061919503 0.0030959751 0.0 0.0061919503 0.049535602 DECEMBER 0.012383901 0.027863776 0.012383901 0.009287925 0.0061919503 0.015479876 0.083591335 FEBRUARY 0.021671826 0.027863776 0.012383901 0.012383901 0.0030959751 0.012383901 0.08978328 JANUARY 0.021671826 0.04024768 0.01857585 0.009287925 0.015479876 0.024767801 0.13003096 JULY 0.01857585 0.027863776 0.012383901 0.009287925 0.0 0.012383901 0.08049536 JUNE 0.01857585 0.03405573 0.0030959751 0.0030959751 0.0 0.012383901 0.071207434 MARCH 0.015479876 0.0371517 0.012383901 0.009287925 0.0 0.01857585 0.09287926 MAY 0.012383901 0.027863776 0.015479876 0.009287925 0.0 0.0030959751 0.06811146 NOVEMBER 0.012383901 0.027863776 0.01857585 0.009287925 0.0030959751 0.0030959751 0.0743034 OCTOBER 0.021671826 0.030959751 0.024767801 0.0061919503 0.0030959751 0.009287925 0.095975235 SEPTEMBER 0.015479876 0.030959751 0.024767801 0.009287925 0.0 0.012383901 0.09287926 Total 0.19814241 0.36842105 0.17027864 0.09287926 0.030959751 0.13931888 1.0
What's Next?
I hope this has encouraged some of you to give Tablesaw a try. As I mentioned, future posts will cover visualization, machine learning, and more. Since you're a Java developer, consider taking a look at our contributor's page. Tablesaw is a work in progress. Help us make Java a great platform for data science.