This is fundamentally messaging framework like IBM MQ perhaps tuned to the iOT like world

This is fundamentally messaging framework like IBM MQ perhaps tuned to the iOT like world

satya - 7/21/2017, 10:47:01 AM

Likely use cases

Apply as a component of iOT

Turn an enterprise state changes as a special case of iOT for most current data in real time

Real time enable an enterprise

it is easier imagined than done

It may be component of an overall enterprise iot framework

satya - 7/21/2017, 10:49:09 AM

Clustering provides fault tolerance

Clustering provides fault tolerance

satya - 7/21/2017, 10:50:43 AM

Consumers are seen as groups and messages are delivered to them for load balancing

satya - 7/21/2017, 10:52:43 AM

queues help you scale with workers

queues help you scale with workers

satya - 7/21/2017, 10:53:01 AM

pub-sub has no inherent scaling primitives

pub-sub has no inherent scaling primitives

satya - 7/21/2017, 10:58:34 AM

Kafka says

it can allow both queuing and pub/sub models

stronger ordering

provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

Has a special API for processing streams and output them

Effectively a system like this allows storing and processing historical data from the past.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

We run several clusters of Kafka brokers for different purposes in each data center. We have nearly 1400 brokers in our current deployments across LinkedIn that receive over two petabytes of data every week. We generally run off Apache Kafka trunk and cut a new internal release every quarter or so.

satya - 7/23/2017, 12:13:25 PM

How schema registries are used

We - linkedin - have standardized on Avro as the lingua franca within our data pipelines at LinkedIn. So each producer encodes Avro data, registers Avro schemas in the schema registry and embeds a schema-ID in each serialized message. Consumers fetch the schema corresponding to the ID from the schema registry service in order to deserialize the Avro messages. While there are multiple schema registry instances across our data centers, these are backed by a single (replicated) database that contains the schemas.

satya - 7/23/2017, 12:14:36 PM

Interesting: Nuage

Nuage is the self-service portal for online data-infrastructure resources at LinkedIn, and we have recently worked with the Nuage team to add support for Kafka within Nuage. This offers a convenient place for users to manage their topics and associated metadata. Nuage delegates topic CRUD operations to Kafka REST which abstracts the nuances of Kafka?s administrative utilities.

Nuage is a service that exposes database provisioning functionality through a rich user interface and set of APIs. Through this new user interface, developers can specify the characteristics of the datastore they want to create, and Nuage will interact with the database system to provision your datastore on a pre-existing cluster.

The underlying database system needs to support multi-tenancy and needs to be elastic such that it can expand its capacity automatically when the load on the system increases.

It is ultra-simple to use and it takes little or no time to setup the data layer for the application you?re building.

It is elastic and scales automatically. New nodes are being provisioned automatically and as needed.

It has no single point of failure, is highly available, and fixes itself.

It is highly operable such that hundreds of thousands of nodes can be managed by a handful of administrators.