Archive

Tags

Native functions are a convenient way to expose a function implemented in a native language to code written directly in SPL. Such code can be used in a Custom operator, SPL functions, or primitive operators that allow custom logic. In InfoSphere Streams 3.0.0, native functions must be implemented or wrapped in C++ (e.g., using a library with C/C++ bindings). The C++ implementation is then compiled into a shared library, which is linked to an SPL program at compile time.

When writing native functions, one might want to use a complex SPL tuple type as a return value. In its current version, the SPL compiler does not allow a native function to return a tuple type. This is because native functions are compiled into a shared library, separately from the SPL program. When compiling a shared library, developers do not have access to specific SPL tuple types, which are only generated when compiling an SPL program. Even if one forces the compiler to generate an SPL tuple type before compiling the shared library, the SPL compiler does not guarantee that the generated tuple types will not change or move. As a result, shared libraries can only access tuples via their base class SPL::Tuple, which is not a concrete type. This means that one cannot use an SPL::Tuple when concrete types are needed in C++, such as in the return of a method or in STL containers.

A trick to overcome this limitation is to pass a mutable SPL::Tuple reference as a native function parameter and then use the reflective API to access tuple attributes. A mutable parameter in SPL is equivalent to a non-const reference parameter in the C++ API.

The following code shows an example of a Custom operator that uses a native function named populateTuple (line 19) to fill up a tuple of type tuple<int32 anInt, list<int32> anIntList> (line 14).

The SPL reflective APIs can also help when one needs to use native functions to populate an SPL collection that contains tuples. The example below shows an example SPL invocation of a native function that receives a map as a parameter (line 21).

A new feature of InfoSphere Streams 3.0 allows dynamic filter expressions for applications that useExport andImport operators. The Export operator allows an application to publish a stream that can be consumed by other applications. By using the Import operator, the consuming application can subscribe to such a stream. These operators are especially useful in two scenarios: (i) to dynamically compose applications, where different segments of the processing graph can start and stop during runtime, and (ii) for sharing common stream processing segments among different applications, and, therefore, avoiding redundant computation. One example scenario is to have an application that converts social media data from an external feed to SPL tuples and then exports such stream. Different applications can then import the social media stream and execute specific analytics for a given topic of interest. As topics of interest can vary over time, the analytics applications can be brought up and down without affecting other applications that are still consuming from the common stream source. The Export and Import operators are ideal for such cases.

A
recurring application sharing scenario is when different consuming applications
are interested in processing different subsets of the exported stream. Prior to
Streams 3.0, the importing application would receive all tuples available in
the exported stream. This would result in waste of network resources, as the
whole stream was transmitted but a filtering operation on the consuming
application side would immediately discard many tuples. In a scenario where
there are many different consumers, transferring the full stream multiple times
wastes a significant amount of resources.

To
reduce network transfers, developers can take advantage of dynamic filter
expressions in Import operators. During application instantiation, the filtering
expression is effectively shipped to the
Export operator side. During runtime, the Export operator evaluates the
filtering expression to decide which tuples should be transmitted to the
consuming application.

The following figures show some SPL code using this new
feature. All examples use a stream of type Schema declared as "int64 streamSubset, rstring stringSubset,
uint32 random".

The
segment below shows an Export operator that exports the stream produced by a Custom
operator, which consumes a stream produced by a FileSource operator. In this
example, the Custom operator forwards downstream all incoming tuples without
doing any specific transformation. In reality, developers may substitute this
operator with any arbitrary SPL topology. The invocation of the Export operator
does not need to change from prior versions of Streams.

In the Import side, one must now use the filter parameter, as in the example below. This instance of the Import operator receives only tuples where the streamSubset attribute has value 1 and the stringSubset attribute has value “streams”. This filtered, imported stream is processed by a Custom operator, which then sends the output directly to a FileSink. As in the example above, the Custom operator just illustrates a sample topology. The filter parameter in Import allows the construction of more complex expressions, similar to the subscription parameter.

The figure below shows the Streams instance graph when running the applications above. To illustrate the power of the dynamic filter expressions, we also run two other applications. The applications are similar to the importer application above, but use the following filtering expressions:

As highlighted by the red rectangle, the Custom operator of one of the importing applications (right side) receives only 2 out of the 10 tuples submitted to the Export operator (10 lines in ‘sample.dat’). The filtering for the “streams” keyword allows 4 tuples to be transmitted and the filtering for “sources” or “sinks” keywords allows 3 tuples to be transmitted. The total number of tuples transmitted by the Export operator using filtering is 9, while a configuration without filtering would transfer 30 tuples.

Summary: When the application consumes only a subset of the tuples of an exported stream, use the filter parameter of the Import operator to save network resources.

Some operators can be implemented directly in SPL using Custom operators, rather than defining them as native operators in C++ or Java. Implementing logic using Custom operators is typically suitable for operations that do not need to call out to pre-existing libraries, and whose logic operates on, and can be fully expressed with, SPL data types. Writing Custom operators requires less code and simplifies development as there is no need to switch to another language and write an operator model.

For example, consider parsing system messages from /var/log/messages on a Linux system. A typical example looks like the following:

We would like to write an SPL operator that parses such messages. The input tuples will contain a single string, which contains a single message. The output will be a tuple where each entry in our informal grammar above is its own string attribute:

Our example application reads messages directly from /var/log/messages, line by line. To parse these messages, we separate the raw line into tokens separated by spaces. From there, we can associate indices with attributes. For the date attribute, we know that entries 0, 1 and 2 are part of the date, so we combine them (effectively un-tokenizing them). The message itself can have any number of tokens separate by spaces, but we know that it must start at index 5.

Implementing this logic directly in SPL with a Custom operator is easier than dropping down to C++ or Java. However, as written, there is a problem: we cannot reuse this operator. If we wanted to parse messages in the same way in this or in another application, we would need to copy this code. It exists only inside of this Custom operator; there is no way to "name" this custom logic.

Wrapping Custom operators in composites solves this problem. By making a Custom operator invocation the only part of a composite's stream graph, we can use the composite operator as a way to "name" that logic. For example:

Note that the ParseMessages composite knows about the ServiceMessage type. Because of this fact, our composite is not type generic. In future posts, we will explore the various kinds of genericity available to composites.

Technically, however, the above code does have some type genericity, just not much. We will explore what that genericity is in a moment, but first let's go over the kinds of genericity there are in SPL, and which ones can apply to a composite operator.

Operators that can handle any number of input and output ports are port number generic. Composites cannot be port number generic; composites must define the exact number of input and output ports they provide. Primitive operators, however, can be port number generic. In our example, ParseMessages defines one input and one output port.

Composites can be type generic, which means that they can handle streams of any type. ParseMessages is not type generic, because the type of the stream Out is fully specified to have the type ServiceMessage.

The type for In, however, is partially type generic. The type is not fully specified, although we have made one assumption about it: that it contains an attribute named raw that is of type rstring. In the example application we previously developed, that attribute was the only attribute in the stream type, but that does not need to be true in general. For example, we could invoke ParseMessages in this way:

Even though ParseMessages does not know about the attributes processedTime and networkName, it can still handle the type AugmentedRawMessage on its input stream because it has an rstring attribute named raw.

However, we can still make ParseMessagesattribute generic. We can do this by modifying the composite to take an attribute as a parameter:

Composites that take attributes as parameters are attribute generic because they make no assumptions about an attribute's name. The type of the attribute, however, cannot be generic. In the above version of ParseMessages, the attribute we provide upon invocation must have type rstring, like this example:

If we tried to provide an attribute that was not an rstring (such as processedTime), the compiler would raise an error the first time it tried to use the inappropriately typed attribute.

Using a similar idea, we can still make ParseMessages even more generic. While we want to ensure that the output stream has the specific attributes date, hostname, service and message, there is no reason for us to *limit* the output stream's type to those attributes. However, because we fully specified the type name, we have forced that to be the case. We can remove that restriction by not fully specifying the type:

Note that we can no longer create a tuple literal of our output tuple type - creating a tuple literal requires knowing the full type of a tuple, but we want to remain partially type generic. To do so, we only require the out the output's stream type containsdate, hostname, service, message and that they are type rstring. We invoke this composite in this way:

When processing data, it is common to perform data enrichment. Enrichment is useful when the data source contains only partial information, but the analytics require additional information that is available only in other data sources. The InfoSphere Streams database toolkit contains two enrichment operators: ODBCEnrich and SolidDBEnrich. These operators require the data used for enrichment to be in a database.

In this post, we illustrate how to develop an SPL composite that serves as a generic file-based enricher. In this solution, we use a FileSource to scan the enrichment data from a file, and then store it in an in-memory map in a Custom operator. This map is keyed by the attributes used to correlate incoming tuples with the enrichment data. If the enrichment data fully fits in memory, this solution can be more efficient than querying the database every time a tuple must be enriched.

The code below shows a sample invocation of the file-based enricher (operator FileEnrich). In this example, the program generates a stream called Data, which has attributes id and city. Data is then consumed by FileEnrich, which outputs an enriched stream using the id attribute as a key. The output stream contains both Data attributes (id and city) and EnrichT attributes (id and name). Note that because EnrichT and Data share the id attribute, id appears only once in the EnrichedData stream. The FileEnrich operator receives the following parameters:

key - attribute in the input stream that is used for correlation

enrichmentType - tuple type describing the enrichment data

enrichmentKey - attribute of enrichmentType that is used for correlation. This parameter is passed as a string.

enrichmentKeyType - type of enrichmentKey. This type must match the key type.

We now show the code for the generic FileEnrich composite operator. This composite is developed using 2 primitive operators and 1 Custom operator. The first operator is a FileSource (line 11). The FileSource uses the enrichmentFile parameter and produces a stream of type enrichmentType. Using a parameter to establish the type of the FileSource output stream gives users the option to use a CSV file with any set of attributes. The second operator is a Switch (line 16), which serves exclusively to control when the input stream (In) can start flowing into the downstream operator. By default, this operator has an initial status of false (i.e., blocking tuples). The status parameter indicates the action taken once a tuple arrives in the second input stream. In this case, a true value indicates that the switch will open when a tuple arrives. The third operator (lines 21-22), implemented as a Custom, is the one responsible for doing the data enrichment itself.

The Custom operator has two phases of execution. First, it builds a map based on EnrichmentData, the stream generated by the FileSource. To create this map, this operator uses the enrichmentKeyType and the enrichmentType itself (line 25). To populate the map, the operator uses the function getTupleAttributeValue (line 29) to get the value of a tuple attribute when that attribute is specified as a string. This function is defined as a native function so that we can use the C++ reflection APIs to inspect the specific tuple attribute and get its corresponding value. The XML segment below shows the SPL signature for the native function.

The C++ code below shows the implementation of this function. This function returns an error flag in two cases: (i) if the attribute provided as a string does not exist, and (ii) if the attribute type does not match the type of value.

The second stage of execution happens after the stream produced by the FileSource is fully processed, which is indicated by a final punctuation. At this point, the Custom operator notifies the Switch (line 38), and the data enrichment process starts. For enrichment, the Custom operator merges the attributes of Data and EnrichmentData using the assignFrom function (lines 45-46). This function assigns all matching fields from one tuple to the other, so be careful when naming the attributes of the enrichment type. If attribute names overlap, the last assigned value will prevail, which in this case is the value available in the enrichment tuple (line 46).

In summary, the FileEnrich composite has three characteristics that make it generic:

A type parameter (enrichmentType), which allows the enrichment file to have any stream type;

An attribute parameter (key) that allows users to specify externally which attribute of the input stream should be used as an enrichment key;

A string parameter, which allows users to choose which attribute of the enrichment type that must be used for enrichment correlation. Passing the parameter as a string allows users to externally reference a stream attribute name from a stream that is created only inside the composite. In addition, such string can be used by native functions, where one can leverage the power of the reflection APIs.

In a previous post, we discussed composites and the kinds of genericity
they can have. A kind of genericity we did not cover is operator
genericity, which is when at least one operator in a composite's stream
graph is passed in as a parameter.

Building on the example code from composite genericity, suppose we have the
following application:

The purpose of this application is to find "suspect" remote hosts in a log file,
where we define a suspect as any remote host from which 10 or more failed login
attempts have originated. The steps the application takes to do this are:

Reads raw log messages from a source.

Parse the raw log messages into service-agnostic messages where the
service related message remains unparsed.

Finds all sshd service messages that also indicate an authentication
error.

Parse the message from the sshd service. (The definition of this
operator, ParseFailures, is not shown, but it is similar to the definition
for ParseMessages, which was shown in the previous post.)

After establishing that a message is a failed login, we want to aggregate
together all failed logins that come from the same remote host. When we
receive 10 of them, we emit a tuple that contains the remote host, and the
user who failed to login.

Finally, we write our suspects to a file.

One problem with our application as it is written is that it assumes the source
and the sink will always be on the filesystem. It's easy to imagine wanting to
perform this operation on data sources that come from the network, and wanting to
report the results over the network. However, we don't want to write another
version of this composite that just has a different source and sink.

Now, the FindSuspects composite is operator generic in its source and sink.
However, we have one problem: most source and sink adapters require parameters
to configure where they should send or receive data from. There is no way to
make these parameters fully generic. Of course, if we have, say, a file name
parameter, we can always parameterize the value to that parameter. But we cannot
parameterize the parameter itself.

Wrapping operator invocations in a composite solves this problem. Given the
above definition of FindSuspects, we could invoke it with:

The composite FindSuspectsFromFiles would be invoked as the main composite for
an application. The power of this technique comes from being able to define and
provide different kinds of sources and sinks:

When FindSuspectsFromTCP is used as the main composite for an application,
then it retrieves log data from logs.company.com on port 514, and sends
suspects to suspects.company.com on port 514. Of course, one could read data
from a TCP source, yet still write to a file. And, more importantly, it's easy
to define new kinds of sources and sinks for FindSuspects, including operators
that are not actually edge adaptors. Operator genericity affords SPL programmers
with the power to abstract out the structure of their applications.

The text toolkit can be used to generate Streams code to wrap a Infosphere BigInsights 2.0 Text Analytics extractor, either one described as source AQL modules or as compiled tam modules. The createTypes.pl script, located in the bin directory of the text toolkit, can generate Streams types that match the types of the extractor's output views, and optionally, a composite invoking the TextExtract operator.

This can be a useful way to create a starting-point Streams application from a BigInsights Text Analytics extractor. The types and the composite may also be created from a makefile to reduce the work in keeping a streams application in sync with a Text Analytics extractor. This post will describe first how to create Streams types from the extractor, then how to make a composite, and finally, will show how to make an end-to-end application that you can compile and run.

You need to run the following command from the toolkit's bin directory ($STREAMS_INSTALL/toolkits/com.ibm.streams.text/bin), or from the bin directory in copy of the toolkit.

The most basic use case for the createTypes.pl script is to create the types of the output views of the Text Analytics modules. This example uses the getNames module included in the FeatureDemo sample application from the text toolkit. The getNames module extracts titles followed by full names from text. The output view of the module is FullNameWithTitle, and it is created as follows:

There are two fields, title and fullName, both of type span (internally, a span is represented as a begin and end offset into a string). As someone working with Streams, you may not be familiar enough with AQL to determine the corresponding Streams types, or the AQL source may be unavailable (if it is provided as a tam module), leaving you in the dark. This is where the createTypes.pl script can help.

Creating Streams types Let’s assume that you are in the toolkit bin directory, that the FeatureDemo sample has been copied to your home directory, and that we want to build the application in ~/tryAQL. Then to generate the types, do

(The outputDir parameter is where the compiled .tam file will go when you run the application with Streams. It need not be part of your streams application.) This command compiles and inspects the BigInsights Text Analytics module, and then creates a simple Streams file, sample.spl with the spl types corresponding to each of output views. The entire file in this case is:

type toPrint0getNamesType = rstring title, rstring fullName;

By default, createTypes.pl maps spans to rstrings. Since we’re just going to print them out, strings make sense. But if you want to compare distances between mentions or perform other operations in which you need the offset in the text, you may want the output as tuples. To do this, supply the --noconvertspan option, and the generated file is:

type toPrint0getNamesType = tuple title, tuple fullName;

Similarly, the –inttype and –floattype options allow you to specify which streams int or float type to use.

It's not generally a good practice to have all your Streams files in the default namespace. Manually adding a namespace is a minor inconvenience if you are editing by hand, but could be difficult if you're including this command in a makefile that automatically generates the types, so we provide a namespace option. Let’s say you want this in the namespace myaql, in the filename mytypes.spl (if not supplied, it defaults to sample.spl), and that you want to build the application in the tryAQL subdirectory of your home directory. Make sure the directories exist, and then:

Now you can reference these types (in this case, the type toPrint0getNamesType) in your Streams application. In doing so, your application is somewhat insulated from changes in the AQL--adding a field to your output view, changes the type definition, but doesn't affect the rest of your streams application.

Creating a composite It may also be convenient to create a composite that applies the BigInsights Text Analytics Module. To create such a composite, add on the --makecomposite option, with an optional --compositename option.

This creates the Main.spl file in the current directory; you'll have to move it to the right place for your application (in this example ~/tryAQL). Once there, you can compile, run, and then look at the output. To standalone compile (from your ~/tryAQL directory):

sc -T -M Main -t $STREAMS_INSTALL/toolkits/com.ibm.streams.text

Here's how you'd run it on Chapter 1 of Sense and Sensibility, included as sample data in the toolkit.