The Eventuate Spark adapter allows applications to consume events from event logs and to process them in Apache Spark. Writing processed events back to event logs is not possible yet but will be supported in future versions.

Note

The Spark adapter is only available for Scala 2.11 at the moment (see Download).

importakka.actor.ActorSystemimportcom.rbmhtechnology.eventuate.DurableEventimportcom.rbmhtechnology.eventuate.adapter.spark.SparkBatchAdapterimportorg.apache.spark.rdd.RDDimportorg.apache.spark.{SparkConf,SparkContext}implicitvalsystem=ActorSystem("spark-example")valsparkConfig=newSparkConf(true).set("spark.cassandra.connection.host","127.0.0.1").set("spark.cassandra.connection.port","9042").set("spark.cassandra.auth.username","cassandra").set("spark.cassandra.auth.password","cassandra").setAppName("adapter").setMaster("local[4]")vallogId="example"valsparkContext:SparkContext=newSparkContext(sparkConfig)// Create an Eventuate Spark batch adaptervalsparkBatchAdapter:SparkBatchAdapter=newSparkBatchAdapter(sparkContext,system.settings.config)// Expose all events of given event log as Spark RDDvalevents:RDD[DurableEvent]=sparkBatchAdapter.eventBatch(logId)// Expose events of given event log as Spark RDD, starting at sequence number 3valeventsFrom:RDD[DurableEvent]=sparkBatchAdapter.eventBatch(logId,fromSequenceNr=3L)

A SparkBatchAdapter is instantiated with a SparkContext, configured for connecting to a Cassandra storage backend, and a Custom event serialization configuration (if any). The eventBatch method exposes an event log with given logId as RDD[DurableEvent], optionally starting from a custom sequence number.

Event logs can span several partitions in a Cassandra cluster and the batch adapter reads from these partitions concurrently. Hence, events in the resulting RDD are ordered per partition. Applications that require a total order by localSequenceNr can sort the resulting RDD:

// By default, events are sorted by sequence number *per partition*.// Use .sortBy(_.localSequenceNr) to create a totally ordered RDD.valeventsSorted:RDD[DurableEvent]=events.sortBy(_.localSequenceNr)

Exposing Spark DataFrames directly is not possible yet but will be supported in future versions. In the meantime, applications should convert RDDs to DataFrames or Datasets as shown in the following example:

importorg.apache.spark.sql.{Dataset,DataFrame,SQLContext}caseclassDomainEvent(sequenceNr:Long,payload:String)valsqlContext:SQLContext=newSQLContext(sparkContext)importsqlContext.implicits._// Create a DataFrame from RDD[DurableEvent]valeventsDF:DataFrame=events.map(event=>DomainEvent(event.localSequenceNr,event.payload.toString)).toDF()// Create a Dataset from RDD[DurableEvent]valeventDS:Dataset[DomainEvent]=events.map(event=>DomainEvent(event.localSequenceNr,event.payload.toString)).toDS()

A SparkStreamAdapter is instantiated with a Spark StreamingContext and a Custom event serialization configuration (if any). The eventStream method exposes an event log with given logName as DStream[DurableEvent]. The stream is updated by interacting with the event log’s replication endpoint at given host and port.

The stream starts from the given fromSequenceNr and is updated with both, replayed events and newly written events. The storage level of events in Spark can be set with the storageLevel parameter. Applications that want to enforce event processing in strict event log storage order should repartition the stream with .repartition(1), as shown in the example.

For persisting the stream processing progress, an application should store the last processed sequence number at a custom place. When the application is restarted, the stored sequence number should be used as argument to the eventStream call. Later versions will additionally support internal storage of event processing progresses.