forceSnapshot - If set to “true” then a snapshot will be generated
regardless of the profile’s last_run timestamp. This option is passed from
the Discovery Search Engine if a user has checked the
Restart feed from fresh snapshot now checkbox on the Feeds tab of the
administrative web interface.

dryRun - If set to “true” then the last_run timestamp for the specified
profile will not be updated. Use this parameter when testing changesets so
that repeated calls can deliver the same items for a delta changeset.

Note that the XML schema for the configuration is versioned. Earlier versions of the configuration XML
continue to be supported, however we recommend using the latest version of the XML schema in order to take
full advantage of the newest features.

The examples/v7 directory of the distribution contains sample XML configuration
files for various popular databases. You can copy any of these to
discovery_datatool.xml (in the same directory as datatool.properties) to get started.
Examples for the earlier schema versions are in the examples/v* directories.

As the Discovery Datatool is changed, we strive to maintain backwards compatibility. One way that we do this
is by introducing new XML schema versions as new features are added.
The Data Tool continues to support earlier versions of the configuration XML
by using the schema namespace to recognize how to parse the file.

In general, moving to the newer schema is trivial.
There was, however, a major structural change in the XML schema between version 3 and verson 4.
To ease migration utilities/configuration_converter_v3_to_v4.xslt can be used to migrate from
XML schema version 3 to version 4. You should do this if you want to use the any of the features added since version 3
such as subqueries, bulk/full feeds, or vertical table support.

Here is an example of how you would use this stylesheet with xsltproc and xmllint to convert IntegrationTest-3.xml.
This assumes that you have the command line tools xsltproc and xmllint installed:

The configuration XML file contains four main sections: accessControl, dataSources, profiles and publishers.
It contains all the SQL statements that are used, giving you complete ability
to optimize them as needed.

<sqlProfilename="profile"dataSource="dataSource"><!-- createSql is optional. By including it we enable automatic profile creation. --><createSql>
insert into changeset_profile (name) values (:name)
</createSql><retrieveSqlstartColumn="startTime"endColumn="endTime">
select last_run as startTime, CURRENT_TIMESTAMP as endTime
from changeset_profile
where name = :name
</retrieveSql><updateSql>
update changeset_profile
set last_run = :lastRun
where name = :name
</updateSql></sqlProfile>

The interaction with the SqlProfile is as follows.

Whenever a request for a /ws/publisher/PUBLISHER_NAME?profile=PROFILE_NAME
is made, the retrieveSql is executed. It must return start and end timestamps
in the columns corresponding to retrieveSql/@startColumn and
retrieveSql/@endColumn. The query must either return one row or no rows.

If no rows are returned and there is no createSql, then the changeset pull will
fail with a NoSuchProfileException. If there is a createSql, then it is executed
with the profile name bound as the “name” parameter. As you might expect,
this query is responsible for creating the named changeset profile. After that
query executes, then the retrieveSql is executed to obtain the start and end
timestamps.

After the successful generation of the changeset, the updateSql is executed
in order to update the named profile’s lastRun to the end timestamp
that was used in generating the changeset. The updateSql is passed the
profile name as the “name” parameter and the new last run timestamp as the “lastRun”
parameter. The updateSql is expected to modify exactly one row. If it does not,
then a RuntimeException is thrown.

A sqlPublisher element requires a dataSource. A profile is optional.
If no profile is specified, then the publisher will never have a start or end time available for its queries.
An optional propertyCase attribute specifies how the case of item properties should be handled.

The sqlPublisher element contains optionally an element specifying the queries for snapshot, full or bulk changeset,
as well any number of elements defining the queries for incremental (delta) changesets. These elements ultimately contain the
SQL queries used to create set-item, add-to-item, and remove-item elements in the resulting changeset.

Starting with schema version 7, you can specify if and how you want item property names to be normalized.
With schema version 7, the default is to preserve case.
Older schema versions continue to have the legacy behavior of lowercasing property names
from column names but from pivoted values from merge-columns.
The valid values for propertyCase are:

preserve

Use whatever case we get back from the database

lower

All property names are converted to locale lower case

upper

All property names are converted to locale upper case

legacy

Property names from sql column labels are converted to lower case, but property names from vertical table values are preserved.

A SqlPublisher can be configured to produce two categories of changesets.

Complete changesets: one of snapshot, bulk, or full

Incremental changesets: delta

A SqlPublisher may optionally contain one complete changeset defintion. A complete changeset
definition is provided by the snapshot, bulk, and full elements. As the name suggests, a complete changeset
should include all the items in your dataset.

A SqlPublisher may contain any number of incremental changeset definitions, each one defined in a delta element.
A delta changeset includes only the items that have been modified or deleted since the previous incremental or
complete changeset was created for a particular profile.

The changeset category elements contain one or more set-item elements, which in turn contain a query and optional
merge-columns and subquery elements.
The snapshot and delta elements may also contain add-to-item and remove-item elements, which in turn contain a query and,
in the case of add-to-item, optional merge-columns and subquery elements.
The bulk and full elements cannot contain any remove-item or add-to-item elements, but can contain a query and optional
merge-columns and subquery elements.

When the Data Tool processes a changeset for a SqlPublisher, the queries inside of the set-item elements are used to
create set-item elements in the resulting changeset to represent items that were created or modified.
The queries inside of the add-to-item elements are used to create
add-to-item elements in the resulting changeset to represent properties that
are set on preexisting items.
The queries inside of the remove-item elements are used to create remove-item elements in the resulting changeset
to record when items are deleted.

In all queries and subqueries, the column names are used for property names unless they are overridden by specifying a column alias.

The SQL is processed by a org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate
which supports named parameters.

The following bound parameters are available for your queries. Bound parameter names are not case sensitive.
Note that start and end are null for snapshot, bulk and full queries, so you probably do not want to include them in
your sql:

start
The inclusive start of the date range for this changeset as a java.util.Date.

end
The exclusive end of the date range for this changeset as a java.util.Date.

kind
‘snapshot’ or ‘delta’

Each set-item can contain one query element, and zero or more subquery elements. Each set-item is executed in sequence.
This supports creating changesets that union across various different views or tables.

Each add-to-item can contain one query element, and zero or more subquery elements. Each add-to-item is executed in sequence.
This supports creating changesets that union across various different views or tables.

If your database uses a vertical table design, then you will want to pivot multiple rows from your vertical table into one
conceptual set-item in the generated changeset. The Data Tool supports this kind of pivot with one or more merge-columns
elements.

With merge-columns, your query is expected to return multiple rows for each item. For each row returned,
the merge-columns element tells the Data Tool to use the value from keyColumn for the item property key and and the value from
valueColumn for the item property value. Multiple values returned across multiple rows for the same item and property
are combined into an array structure as shown for entry “ratings” in the this example:

The engine requires that data available through one-to-many relationships be “flattened” or denormalized
as part of the main item property query. A subquery is typically used to pull in values from
child tables. Subqueries provide a very flexible means for managing these relationships.

Subqueries are optional. Each set-item element can contain zero or more subquery elements.
Subqueries are executed after the single main required query.

Subqueries are quite powerful and flexible. Here are some ways that you might use a subquery.

Map a list of values from a single column in a child table to an item property as an array of values.

Map a list of values from a single column in a child table to an item property as a single string of delimited values.

Map a list of subquery rows to multiple item properties with each column mapped to an item property. Each property containing an array of values.

Map a list of subquery rows to multiple item properties with each column mapped to an item property. Each property containing a single string of delimited values.

Map a list of subquery rows to a single item property containing a map with each column as a map key for a string of delimited values.

Map a list of subquery rows to a single item property containing a list of maps (one per subquery row), with each column as a map key for that columns value.

Use a discriminator column returned from the subquery to map a list of subquery rows to a single item property containing a map with each discriminator value used as a map key with its value being a map of column name to value for that discriminator’s row.

The results from each subquery are merged into the set-item element. The column
names or aliases are used as property names, overwriting any entries previously
populated by the main query. The subquery results can optionally be scoped by
specifying either a property or propertyPrefix attribute on the subquery.
Specifying a property means use that as the set-item property name. Specifying
a propertyPrefix means prepend the prefix to the column name(s) or alias(es)
when creating set-item property names.

Subquery results can be represented as arrays of values, a delimited string,
arrays of maps, maps of arrays, maps of delimited strings, or mapped to
multiple entries on the main set-item with the values as arrays or delimited
strings. The choice of how multiple values are mapped is controlled by the
type attribute.

If a subquery returns more than one row, then the row values are represented as
an array. If a delimiter is specified, then the row values returned by the
subquery are joined by the delimiter into a single value.

If a subquery returns more that one column, and it is mapped using a property,
then the property value will be represented by a dictionary (map) or an array of
dictionaries, with subquery each row consisting of a single dictionary of
column name to column value. Each column is mapped to an entry in the struct.
Column aliases are respected, so you can re-map the database column names to
arbitrary strings in the returned dictionaries.

By specifying a propertyPrefix, each column from the subquery is merged into the
main set-item using the propertyPrefix combined with the column name or column
alias to create the set-item property name. The values are either a single
value (when the subquery returns a single row), an array of values, or a string
of delimited values depending on whether they type specified was array or
delimited.

Subqueries that return multiple columns may use the discriminator feature. When
a discriminator is used, the values of each row are mapped to a dictionary with the
discriminator value as the key. If multiple subquery rows
return the same discriminator value, then the values from the last row returned
overwrite any previous values for that discriminator.

The default subquery type is array - which builds an array element in the
set-item structure for the multiple values for each column returned. The
other valid choice for type is delimited - which builds a single delimited
string containing the multiple values. For type of delimited, a delimiter
can specified. If unspecified, the default delimiter is a comma.

All the columns or aliases returned by the main query are available to the
subquery as named bound parameters. The bound parameters from the main query
(:start, :end, and :kind) are not available to the subquery. Bound
parameter names are not case sensitive. If the main query returns null or an
empty string for a column, then the corresponding bound parameter will have a
null value. Bound parameters in a subquery that do not have a corresponding
column in the main query will have a null value.

The subquery examples below use the id returned from the main query as a
bound parameter.

<subquery><![CDATA[ select name as color from SubqueryColors where parent_id=:id order by name]]></subquery><subquerytype="delimited"delimiter="|"><![CDATA[ select name as color_delim from SubqueryColors where parent_id=:id order by name]]></subquery><subqueryproperty="color-default"><![CDATA[ select name from SubqueryColors where parent_id=:id order by name]]></subquery><subqueryproperty="color-array"type="array"><![CDATA[ select name from SubqueryColors where parent_id=:id order by name]]></subquery><subqueryproperty="color-delimited"type="delimited"><![CDATA[ select name from SubqueryColors where parent_id=:id order by name]]></subquery><subqueryproperty="color-delimited"type="delimited"delimiter="|"><![CDATA[ select name from SubqueryColors where parent_id=:id order by name]]></subquery><subqueryproperty="address"discriminator="addr_type"><![CDATA[ select addr_type, street, city, state, zipcode from SubqueryAddress where parent_id=:id order by id]]></subquery><subquerypropertyPrefix="product"><![CDATA[ select id, price, description from order_lines where order_id=:id order by id]]></subquery>

Create or obtain an RSA key and server certificate (potentially signed by a
certificate authority).

Note that the Common Name on the certificate must match the hostname in
the feed URL to be used by the discovery engine.

Import the key and certificate into a Java Key Store for the Data Tool.

If the server certificate is not signed by a root certificate
authority, then you will need to add it to you server’s shared trusted
key store, place a copy of the certificate (public key)
with a file extension of .pem or .crt into engines/xxxx/certs.

Configure the Data Tool start script to pass in the command line options
for the https-port keystore-file, keystore-pass and key-pass. Exactly how
to do this is detailed below.

The full process for creating and RSA key, X.509 certificate and Java Key Store
is well documented on the Internet. These minimal steps will get you started
with a self-signed certificate. You can substitute the name datatool
with anything you want.
Note that you need to have openssl and Java 1.5 or newer installed.

# Generate RSA key.# Note that the Common Name must match the host name that will be in the public# feed URL for the Data Tool
openssl genrsa -des3 -out datatool.key
# Create x509 certificate from the key
openssl req -new -x509 -key datatool.key -out datatool.crt
# Combine the key(s) and the certificate into the a single pkcs12 file
openssl pkcs12 -inkey datatool.key -in datatool.crt -export -out datatool.pkcs12
# create a Java key store and import the key and certificate
keytool -importkeystore -srckeystore datatool.pkcs12 -srcstoretype PKCS12 \
-destkeystore datatool.jks

You will need to place the keystore file datatool.jks in a location where the
Data Tool will be able to read it when it starts up.

Maximum heap to allocate in megabytes. Uses JVM default if not specified.

jvm.logging

Override for specifying logging configuration. Example -Djava.util.logging.config.file=/special_location/logging.properties. If unspecified and a logging.properties file exists in the same directory as the datatool.properties, then that file is used, otherwise the logging.properties from the Discovery Data Tool release directory is used.

validate_sql

Whether the Discovery Data Tool should validate its sqlPublishers’ SQL statements against the database. While no results are retrieved, this option can slow application start-up. Defaults to true. Set to false to disable validation.

Start by determining what your key business entities are. How are they related?
What do you plan to search and display. Remember that the Discovery Engine
cannot perform joins, so each item that you include in your changeset must
include all the data that you want to use for search or display
- including any data from related entities. Based on your proposed searches, go
ahead and create a dimensions.xml that you’ll use to validate your
initial changeset. It’s an iterative process, so this is just a starting point.

If you want to support processing incremental changes in the Discovery Engine,
then your entities will need a last-modified timestamp and you will need some
mechanism for tracking when items are deleted. Database triggers can be useful
to ensure that last-modified timestamps are updated appropriately - especially
if you choose to cascade updates on dependent objects to your main items (e.g.
updating a broker updates the last-modified timestamp on all of the broker’s
listings).

For tracking deleted items, two strategies are softdeletion and a
deleteditemstable. To implement soft deletion, rows are not actually
deleted, and a status flag is updated instead. All queries include the status
flag so that you don’t inadvertandly include deleted items. If you want to
implement a hard delete, where your items are actually deleted, a database
trigger can add a row to a deleteditemstable each time an item is
deleted. The deleteditemstable contains, at a minimum, the item
identifier and the timestamp of when the item was deleted. With either of
these mechanism, it is trivial to create SQL queries that return the list of
item identifiers that were deleted during a specific timeframe.

Item identifiers in the Discovery Engine must be
unique. In an SQL databae, it is common for each table to have its own primary
key sequence. In the engine, all items share a global identifier namespace (as
if they were all in one “table”), so you cannot simply use the database primary
key as your item id. You can, however, trivially modify your database
identifier by appending or prepending a type discriminator (e.g. “broker_1”
instead of just “1”).

Once you’ve considered all the above points, you’re ready to build and test the
SQL that you will use to export your entities in a denormalized form for your
changesets, and the SQL that you will use to export the current deleted items
list. Pay particular attention to the scalability of your queries as the size
of your data may increase over time. You can test these initial SQL queries out
in the tool of your choice.

Remeber that there are two kinds of changesets that you can choose to support:
complete changesets and incremental changesets.

Complete changesets (snapshot, bulk, full), contain all the data for all the
items that should be live. As such, there is no need to include a
remove-item query - unless you need to remove items that you just created
(and this is not permitted in bulk changesets). Complete changesets have an
:end timestamp, but do not have a :start timestamp.

Incremental changests (detla), only contain information about items that were
modified or deleted since the last incremental changeset. Use the :start
and :end bound parameters in your SQL. Note that the general approach is
WHERElastUpdated>=:startandlastUpdated<:end.

After you write your basic SQL, create a discovery_dataool.xml with the
building blocks described in this document. If you want to create incremental
changesets, then you must define a sqlProfile. Create one or more
sqlPublisher s and wire everything together so that you can test your SQL in
the Data Tool and test the changesets it generates with your engine. Afterall,
what good is a changeset if you cannot search or retrieve the data like you
planned.

For testing, you can use curl on the command line, redirecting the output
to a file. You can test the data tool without parameters to get a complete set
of data, and test using a profile. I often use the dryRun=true parameter
when testing a profile so that the profile will not be updated in the database.
This way you can repeatedly produce delta changesets with the same :start
and :end to test your SQL changes against the same subset of items.

If you have access to modify the database, then you can also delete or update
the test profiles that you used to test incremental feeds. If you only have
read-only db access, then you can still test the delta queries even without a
profile by using the startDate and endDate parameters. It should now be
clear that many of these parameters exist solely to enable testing.

To test a complete changeset (e.g. full, snapshot, or a delta with all data);

When you use curl-qi, the HTTP response headers are included in the
output. This is useful for confirming that the changeset type is as expected
(delta, full, etc) by looking at the X-t11e-type header. If you use curl-qi, you must need to edit the output to remove the headers in order to use
the changeset (or it will not be valid XML).

Query string parameters useful for testing are:

startDate

Test delta with :start SQL bound parameter in lieu of using a profile

endDate

Test delta with :end SQL bound parameter in lieu of using a profile

profile

Refers to the profile stored in the database. Depending on the datatool configuration, profiles may be auto-created

dryRun=true

Do not update the profile database row. This way you can repeatedly test the same delta timespan window

The datatool will validate the sql at startup using the database/driver to
parse it, and fail to start if invalid. To fully validate the sql you also need
to pull data through it. If you have subqueries, then you’ll want to test
scenarios where your subqueries return 0, 1, and >1 rows. This is just basic
branch coverage of your queries. If your sql has conditions (e.g. case, if)
you’ll want to test all the permutations.

The changeset contains all the expected data in the structure that you expect
(e.g. if you’re using subqueries, did you want parallel arrays of primitives
but got an array of dictionaries)

Deltas using a profile only contain the modified and deleted items that you
expect

All the publishers and sql are used. If you have different SQL for a <full> and
<delta> in a publisher, then you need to request and generate both kinds of
changesets in order to test the SQL for both.

The XML is valid by running it though xmllint

The changeset can be processed by the engine as expected. You can manually
upload the XML via the admin UI. This sometimes uncovers errors that you
missed. For example, if you didn’t notice that the XML was invalid due to an
exception.

Upload a dimensions file, and verify that the data in the changeset can be
indexed and queried as expected. The data might be in the changeset, but not in
a manner than can be indexed, or maybe there’s a mismatch in property names
between the dimension and the changeset.

Incremental feeds to the engine work as expected. Configure a recurring feed
for the engine via the admin UI using your data tool, make data changes, and
verify that the changes reach your engine.

If you request a changeset and the data tool encounters a runtime error in the
middle (e.g. sql that failed in a subquery part of the way through), then the
resulting changeset will be invalid with the exception trace included in the
generated output.

You will see most Data Tool errors in the log/datatool-0.log file,
especially errors validating the SQL at startup. The datatool will not start if
the SQL fails to validate. When posting a changeset to your Discovery Engine,
you may see some errors processing the changeset immediately in the Discovery
Engine HTTP response, but some errors will only appear in the Discovery Engine
log file log/discovery-0.log.