Writing and Searching for POJOs in MarkLogic

In this tutorial, find out how to store and search POJOs in a MarkLogic database without giving up consistency, reliability, or scale.

With traditional relational databases, persisting your
in-memory data structures requires complex ORM (Object-Relational
Mapping) tools to handle the well-known impedance mismatch.
Next-generation NoSQL databases that support variety on stored
information can provide a simpler solution. In this tutorial, find
out how to store and search POJOs in a MarkLogic database without
giving up consistency, reliability, or scale.

The latest version adds the
MarkLogic Java API to make it easy to take advantage of the server
in your Java applications. For this tutorial, you’ll download the
free version of MarkLogic Server. We’ll work through some typical
data discovery scenarios with a music dataset, executing queries
both to answer specific questions and to get a better overall
understanding of the dataset. To make things simple, we’ll work
with data in a POJO representation. The setup steps consist of
installing MarkLogic Server, downloading the tutorial, and running
a bootstrapping utility that defines a couple of users and creates
the database and REST server.

Installing and Starting
MarkLogic Server

Download and install the latest version of MarkLogic from
http://developer.marklogic.com/products.
Once you’ve installed and started MarkLogic, go to the
browser-based administrative interface (at http://localhost:8001/), which will walk you through
getting an Express license and creating an admin user. (This
tutorial assumes you’ll be running MarkLogic on your local machine;
if that’s not the case, just substitute your server name whenever
you see “localhost” in this tutorial.)

Downloading the
Tutorial

After starting the server,
download the tutorial source code from http://developer.marklogic.com/media/pojo-tutorial-01.zip.
Unzip the distribution. You’ll find a standard Maven source
structure that you can use, for instance, in m2e. You can, of
course, work with the sources and classes without Maven if you
prefer by looking for the sources under the src/main/java directory and for runtime
environment under the target/classes
directory.

In the following sections, we’ll
only show the highlights from the source code and output. To get
the most out of this tutorial, you should view the complete
examples in your IDE or editor and run the examples to see the
complete output.

To run the tutorial examples,
you’ll need to set up a Java 6 runtime environment (preferrably the
latest stable distribution). You configure your CLASSPATH in the
usual way:

From the command-line, specify the root
directory for the tutorial classes and the jars for the Java API
and its lib dependencies on your CLASSPATH.

In an IDE such as Eclipse, create a
project with the tutorial classes in the source directory. Either
add the jars for the Java API and its lib dependencies to your
build path or use the Maven POM in the tutorial distribution to
download these dependencies to your Maven repository.

Setting up the Tutorial’s
Server Environment

This tutorial focuses on application programming rather than
MarkLogic server adminstration. Therefore, this tutorial provides a
utility to set up the server environment in one step. Before you
start, find and check the values in tutorial.properties.
The default values should be correct for your setup; simply ensure
that the values for tutorial.bootstrap_user
and tutorial.bootstrap_password
match the adminstrative credentials for the MarkLogic server. Be
wary of modifying the other values shipped with tutorial.properties.
To bootstrap the REST server’s environment, run the following
command at the command line:

Bootstrapping the Tutorial’s server-side environment

Alternatively, use an IDE to execute this class’s main method.
When its done, this command will have completed the following:

Created two users for running your application.

the rest-admin user is permitted to configure
the application. The bootstrapper sets up this user with the
password “x“.

The rest-writer user is allowed to write and
update documents, as well as execute searches and retrieve
documents. This user is also created with the password
“x“.

Created a new database called “TopSongs” for the application
data, and “TopSongs-modules” to hold extension code.

Added two range indexes to the “TopSongs” database to support
some of the searches below.

Later, when you want to set
up your own database, REST server, and indexes, go tohttp://localhost:8000/appservices/, click theNew Databasebutton, select the database, and click theConfigurebutton. Now we’re ready for a
quick look at the dataset.

Annotating the POJO
Classes

The dataset for this
tutorial consists of top songs extracted from Wikipedia
(http://en.wikipedia.org/wiki/Category:Lists_of_number-one_songs_in_the_United_States). Each song is described by a standalone tree
structure modelled with nested POJOs (similar to JSON but with
strong typing). To enable processing by JAXB, the POJO classes have
two JAXB annotations: one on the root class for the tree structure
and one on thedescrproperty.

JAXB Annotation

Thedescrproperty contains marked-up
text as a target for fulltext search. Other key properties include
exactly oneartistas well as zero or manywriters,producers,genres,
andweeks.

Writing POJOs To the
Database

The tutorial source provides
the serialized POJOs in XML files. Aside from thedescrproperty, the POJOs are vanilla Java beans and could be
loaded from a Java object input stream or any other
source.

ThePOJOWriterexample creates a database client and iterates over the
serialized POJOs files, using JAXB to write the POJOs to the
database as separate documents. Each document has a unique URI and
contains a root object and its subordinate objects. Here’s the
source code condensed to focus on the important parts (which will
also be true of subsequent examples).

Every application using the
API creates aDatabaseClientbefore interacting with the database and releases the
client afterward. Subsequent examples will omit these statements to
focus on new ideas.

The example above calls
theXMLDocumentManager.write()method to persist each POJO as a document
in the database. TheJAXBHandleclass adapts JAXB for integration into the API. The API
uses adapters like JAXBHandle to integrate standard content
representations as diverse as binary InputStream, character String,
and StAX XMLStreamReader.

Reading a POJO from the
Database

ThePOJOReaderexample confirms the previous load by calling theXMLDocumentManager.read()method to get a POJO from the database,
again using JAXB.

The example prints out the POJO
properties, producing the following output:

document:
/topsongs/Aretha-Franklin+Respect.xml

title | Respect

artist | Aretha Franklin

writers | Otis Redding

producers | Steve Cropper

genres | Soul

weeks | 1967-06-03 |
1967-06-10

Subsequent examples will
search these properties and the text of thedescrproperty.

Searching for the Value of a
Property

Now we’re ready to
investigate the top songs dataset. Looking at the output forRespect,
we might wonder whether Otis Redding wrote any other hit
songs.

TheKeyValueSearcherexample finds all documents where the writer
element contains the exact valueOtis
Redding. Such
searches resemble equals predicates in the WHERE clause of an SQL
database but can operate on varied document structures instead of
rigid relational tables.

All queries use aQueryManager. (Subsequent examples skip its construction.) TheKeyValueQueryDefinitionclass specifies the query criteria. The call
toQueryManager.search()searches the database.SearchHandleparses the results into a Java structure
reflecting documents matched by the query and locations matched
within each document. You can also get search results in JSON or
XML if you prefer.

The example iterates over the
matched documents and locations to generate the following output,
which answers the question. Otis Redding wrote two top songs.

For JSON documents, you can
search on the value of a key in much the same way.

Searching for Terms in
Text

When investigating a dataset, one
question often leads to another. We might wonder whether Aretha
Franklin and Otis Redding collaborated on other top songs. We can
start with a simple string search.

A string search expresses
query criteria including phrases and Booleans similar to the Google
search box. You can prompt a user for the criteria, but it’s also
convenient for specifying static criteria in an application. Like a
search engine, theStringSearcherexample matches documents that contain both of the
phrasesAretha FranklinandOtis Reddingin any location.

The search matched phrases
mentioning Aretha Franklin and Otis Redding in the description,
which doesn’t indicate whether they collaborated on the song.

Searching for
Combinations of Properties

To get a definitive answer
for our question, we need to constrain our phrase search to
theartistandwriterproperties. We define constraints with query options. Query
options specify the static parts of a query including not only
constraints but the result page length and so on. You write query
options to the database before executing a search that supply the
dynamic parts of the query including the criteria, the result page
number, and so on.

The
ConstrainedSearcher example builds the query
options as a data structure in Java:

As you might expect, the API
provides aQueryOptionsManagerto write, read, and delete query options. To build
options as a Java structure, you useQueryOptionsBuilderandQueryOptionsHandle. In particular, the call toQueryOptionsHandle.withConstraints()specifies constraints on
theartistandwriterproperties. That makes it possible to restrict search
phrases to these properties (similar to the key-value search shown
earlier). TheQueryOptionsManager.writeOptions()call saves the query options
under the nameconstraints.

By the way, because query options
are typically set up by an experienced developer and used by other
developers in applications, writing them requires a higher level of
permissions. While we’ll show how to build query options in Java,
you can also write query options as JSON or XML documents if you
prefer.

Now we can use the query
options to constrain the POJO properties where the search matches
the phrases. TheConstrainedSearcherexample specifies theconstraintsquery options when constructing theStringQueryDefinitionobject and then prefixes theAretha
Franklinphrase
with theartistconstraint and theOtis Reddingphrase with thewriterconstraint.

Apart from adding the query
options and constraint prefixes, this example is unchanged from the
previous version. The result output, however, is much more
precise:

Constrained Search Output

document:
/topsongs/Aretha-Franklin+Respect.xml

location: /topSong/artist

matched: Aretha Franklin

location:
/topSong/writers

matched: Otis
Redding

Only one song had this combination
of artist and writer, yielding our definitive answer.

Modifying Criteria
Dynamically with Structured Search

From time to time, you might need
to modify or inspect criteria programmatically. Examples include
providing a GUI editor for search criteria, adding hidden criteria,
checking for invalid or unauthorized criteria, or generating
criteria to reflect the current state of an external resource.

As with query options, you use a
builder to create a Java structure. The
StructuredSearcher example builds a structured
search for the same constrained criteria that the previous example
expressed as a string.

The example uses
StructuredQueryBuilder to create a
StructuredQueryDefinition specifying the criteria
for the artist and writer constraints defined by the constraints
query options. Aside from using StructuredQueryDefinition instead
of StringQueryDefinition, this example is the same as the previous
example, qualifies the same documents, and produces the same
output. A Java program, however, could easily change one of the
terms or add new complex Boolean conditions without string
parsing.

If you prefer, you can also write
a structured query as a JSON or XML document. While the rest of the
tutorial will stick with string queries for consistency, in each
case, the search criteria could have been specified with a
structured query.

Analyzing a Dataset with
Facetted Search

So far, the examples have answered
specific questions. To help frame questions, it’s also useful to
get a broad overview of the dataset. Facet analysis meets that
requirement by performing counts or other aggregates on the entire
dataset or a subset of interest. The next example supports facet
analysis by genre or over time.

When you imported the package at
the start of this tutorial, the import action configured the top
songs database. The configuration created range indexes on the
genre and week elements. A range index provides a basis for
calculating facets. Now, we’re ready to take advantage of those
genre and week range indexes.

As with the artist and writer
indexes in a previous example, the
FacettedSearcher example creates constraints for
the genre and week indexes in query options. The constraints
identify the range indexes and their datatypes. The example sorts
the genres in descending order by number of songs in the genre.

The source code fragment skips
over the construction of the QueryOptionsBuilder
and QueryOptionsHandle builder, which remains the
same as the earlier example. The call to
QueryOptionsHandle.setReturnResults() modifies
searches to return just the facet analysis and not a page of search
results.

The facetsongs query options have
done the heavy lifting of defining the facets. The
FacettedSearcher example specifies the facetsongs
query options when constructing the string definition. The example
performs the facet analysis on the subset of the songs that contain
the Grammy term anywhere in the
document. A search could use complex Booleans for a smaller subset
or no criteria for the entire dataset.

As with search results, SearchHandle parses the
list of facets into a Java structure with the values and their
aggregate counts. You can also read facets as JSON or XML.

The example output analyzes the
genres and weeks for all songs having the Grammy term.

Facet Output

facet: genre

Pop = 79

R&B = 71

…

Rhythm And Blues = 2

…

facet: week

1940-07-27 = 1

1940-08-03 = 1

1940-08-10 = 1

…

The output shows that
consolidating genre values like R&B and Rhythm And
Blues would improve the quality of the dataset. That’s fine
and to be expected from real-world Big Data. Cleaning up those
blemishes won’t change the big picture, so we can get value from
our dataset immediately. If later applications could benefit from
fixing these flaws, the facet analysis has shown us what to fix. We
can refine the dataset in place without getting in the way of
existing applications. Such flexible, progressive refinement
differs from traditional databases, where changes to data
structures and associations have a disruptive impact on
applications.

Summarizing a Dataset
with Limits and Buckets

For some purposes, facet analysis
provides too much detail. To get a fast summary of a dataset, you
might want to aggregate ranges of values and eliminate
outliers.

Query options can limit the number
of facet values. When facet values are ordered by descending
frequency, the effect is to return the top values. Query options
can also define buckets for grouping facet values. The
BuckettedSearcher example refines the previous
query options to add a limit and buckets:

Other than referring to the
revised query options, the BuckettedSearcher
example has exactly same search code as the previous example.
Because of the query options changes, however, the example produces
only the top genres and groups songs by decade instead of by
week.

Facet Output Limits &
Buckets

facet: genre

Pop = 79

R&B = 71

…

Country = 8

facet: week

40s = 4

50s = 11

…

00s = 67

Counting Property Values
for a Dataset

The broad understanding of the
dimensions of the dataset gained through facet analysis can frame
the investigation of specific questions. Knowing the genres for the
song dataset suggest that, if we want to investigate the breadth of
Quincy Jones career, we could look at the genres for the songs he
has produced. Such questions can be answered quickly based on a
range index.

First, the
ValuesLister example defines a producer constraint
(much like the artist and writer constraints in a previous
example). The query options also identify the range index supplying
the list of values (in this case, the genre values).

To query for the values, the
ValuesLister example constructs a
ValuesDefinition with the name of the values list
(genre) specified in the query options
(valuesongs). The example also
constructs a StringQueryDefinition, prefixes
Quincy Jones with the producer constraint (as with Aretha Franklin and the artist constraint previously), and initializes the
ValuesDefinition with the StringQueryDefinition to constrain the
values list to the songs produced by Quincy Jones.

The call to QueryManager.values() reads the
index and ValuesHandle parses the list into a Java
structure reflecting the values for the constrained subset. That’s
similar to the search() method with a SearchHandle in previous
examples, but in this case, reading directly from the index. As
elsewhere, you can also get the values list as JSON or XML. The
example iterates over the list to get each count and value.

The output shows that Quincy Jones
has produced a surprising diversity of hit songs:

Values Output

Hit songs per genre for producer Quincy
Jones:

1 Country Soul

1 Dance

…

1 Glam Metal

2 Hard Rock

1 Jazz

…

1 West Coast Hip
Hop

Counting Property
Co-Occurrence for a Dataset

A top song is a hit in one or more
weeks and can be classified in one or more genres; thus, each top
song associates weeks with genres. These associations of weeks and
genres (called co-occurrence or, when read from the database,
tuples) can demonstrate trends over time for genres. For instance,
we can investigating the trend for songs produced by Quincy
Jones.

In the query options, the producer
constraint remains the same as the previous example (and so isn’t
included in the fragment below). The TuplesLister
example builds the week-genre tuple
list over the week and genre range indexes (instead of a values
list for one range index).

To query for the tuples, the
TuplesLister example constructs a
ValuesDefinition with the name of the tuples list
(weeks-genre) specified in the query
options (tuplesongs). The example
constrains the query to songs produced by Quincy Jones with the
same StringQueryDefinition as the previous example (and so doesn’t
include those statements in the fragment below).

The call to
QueryManager.tuples() reads the indexes and
TuplesHandle parses the tuples into a Java
structure reflecting the values for the constrained subset. The
example iterates over the tuples to get each value, formatting the
date values for weeks using a Java DateFormat.

The output satisfies the goal of
the investigation by showing that Quincy Jones started by producing
Country Soul / R&B songs and transitioned through other genres
to Hip Hop.

Tuples Output

Hit song genres by week for producer Quincy
Jones:

1 1962-06-02 Country Soul

1 1962-06-02 R&B

…

1 1996-07-13 West Coast Hip
Hop

1 1996-07-20 West Coast Hip
Hop

Summary and
Resources

This tutorial provided a quick
overview of how to use the Java API to persist and query POJOs in
MarkLogic Server. In particular, you learned how to:

The MarkLogic Java API works with
other kinds of content besides POJOs. You can perform CRUD
operations on binary (including PDF and video), JSON, XML, and text
documents with collection, permission, and property metadata. You
can use multi-statement transactions and optimistic locking control
for CRUD operations. In search, you can take advantage of
geospatial search and faceting, aggregate functions over indexes
(including user-defined aggregates), and flexible snippeting and
element extraction for search results. Finally, you can extend the
API with server-side transforms and new resource services.

MarkLogic Server has too many
capabilities to explore in one tutorial including Hadoop
integration, reverse queries and alerting, server-side content
processing pipelines (for conversion, enrichment, or metadata
extraction), flexible replication, and ingestion and monitoring
tools. Major corporations and government agencies have used
MarkLogic Server in mission-critical solutions for years. Whether
evaluating NoSQL platforms for an enterprise solution or rolling
out a fast implementation on the Express license for a great
idea all your own, you can learn more at http://developer.marklogic.com.