In this article

Copy data from Cassandra using Azure Data Factory

In this article

This article outlines how to use the Copy Activity in Azure Data Factory to copy data from a Cassandra database. It builds on the copy activity overview article that presents a general overview of copy activity.

Supported capabilities

You can copy data from Cassandra database to any supported sink data store. For a list of data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.

Specifically, this Cassandra connector supports:

Cassandra versions 2.x and 3.x.

Copying data using Basic or Anonymous authentication.

Note

For activity running on Self-hosted Integration Runtime, Cassandra 3.x is supported since IR version 3.7 and above.

Prerequisites

To copy data from a Cassandra database that is not publicly accessible, you need to set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article to learn details. The Integration Runtime provides a built-in Cassandra driver, therefore you don't need to manually install any driver when copying data from/to Cassandra.

Getting started

You can use one of the following tools or SDKs to use Copy Activity with a pipeline. Select a link for step-by-step instructions:

The Integration Runtime to be used to connect to the data store. You can use Self-hosted Integration Runtime or Azure Integration Runtime (if your data store is publicly accessible). If not specified, it uses the default Azure Integration Runtime.

When using SQL query, specify keyspace name.table name to represent the table you want to query.

consistencyLevel

The consistency level specifies how many replicas must respond to a read request before returning data to the client application. Cassandra checks the specified number of replicas for data to satisfy the read request. See Configuring data consistency for details.

Data type mapping for Cassandra

When copying data from Cassandra, the following mappings are used from Cassandra data types to Azure Data Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the source schema and data type to the sink.

The length of Binary Column and String Column lengths cannot be greater than 4000.

Work with collections using virtual table

Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your Cassandra database. For collection types including map, set and list, the driver renormalizes the data into corresponding virtual tables. Specifically, if a table contains any collection columns, the driver generates the following virtual tables:

A base table, which contains the same data as the real table except for the collection columns. The base table uses the same name as the real table that it represents.

A virtual table for each collection column, which expands the nested data. The virtual tables that represent collections are named using the name of the real table, a separator "vt" and the name of the column.

Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example section for details. You can access the content of Cassandra collections by querying and joining the virtual tables.

Example

For example, the following "ExampleTable" is a Cassandra database table that contains an integer primary key column named "pk_int", a text column named value, a list column, a map column, and a set column (named "StringSet").

pk_int

Value

List

Map

StringSet

1

"sample value 1"

["1", "2", "3"]

{"S1": "a", "S2": "b"}

{"A", "B", "C"}

3

"sample value 3"

["100", "101", "102", "105"]

{"S1": "t"}

{"A", "E"}

The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual table row corresponds to.

The first virtual table is the base table named "ExampleTable" is shown in the following table:

pk_int

Value

1

"sample value 1"

3

"sample value 3"

The base table contains the same data as the original database table except for the collections, which are omitted from this table and expanded in other virtual tables.

The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns. The columns with names that end with "_index" or "_key" indicate the position of the data within the original list or map. The columns with names that end with "_value" contain the expanded data from the collection.

Table "ExampleTable_vt_List":

pk_int

List_index

List_value

1

0

1

1

1

2

1

2

3

3

0

100

3

1

101

3

2

102

3

3

103

Table "ExampleTable_vt_Map":

pk_int

Map_key

Map_value

1

S1

A

1

S2

b

3

S1

t

Table "ExampleTable_vt_StringSet":

pk_int

StringSet_value

1

A

1

B

1

C

3

A

3

E

Next steps

For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported data stores.

Feedback

We'd love to hear your thoughts. Choose the type you'd like to provide: