In this post, you will learn how to automatically design a complete data warehouse solution on top of Impala using Indyco, a tool for designing, exploring, and understand your business model (recently named Cloudera Certificated Partner for the Impala platform). You will also discover how you can design a new anlytic database schema for Impala in three steps using Indyco features.

Indyco combines the Dimensional Fact Model, the well known conceptual modeling formalism designed for representing data, with considerable domain expertise. Using Indyco, IT and business people can sit together and make sense of their data model without being too concerned about about implementation details.

Step 1: Deploy the Business Model

First of all, start Indyco Builder (you can download a free trial here) and open a project. For this how-to you will use a sample project, so click on “Open sample project” directly on the Indyco Start Page.

Step 1.1: Select Target

As a first step, choose the query engine target and the default storage type for your files on Impala. You can override the storage type for each tables later.

Step 1.2: Edit Preferences

In the preferences section, you can define your naming conventions (like name prefixes or default types) or leave the default ones.

Step 1.3: Choose Datamarts

Choose the datamart(s) you want to generate on Impala.

Step 1.4: Choose Database Options

Insert the name of your target database on Impala. If you want, Indyco will create it for your on Impala’s default location or wherever you prefer.

Step 1.5: Choose Export Options

Before generating your model into Impala, you can review the result of the logical modeling and fine tune it if needed. In particular you can override every setting of your tables: you can mark them as external, specify a different location, or change their storage type.

Step 1.6: Your Model on Impala

Clicking on finish, Indyco will ask you a folder to save the result scripts that you can copy and paste in your preferred Hadoop manager.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

/*

-- ----------------------------------------------------------------

-- Indyco Builder - Auto-Generated DDL Script --

-- ----------------------------------------------------------------

--

-- Project Name . . . . . . : Indyco Tutorial

-- + Exported Datamarts. . : Commercial, Inventory

-- + Export Timestamp. . . : 2015-06-17 15:06:01

--

-- Selected Profile . . . . : Star Schema

-- Create database . . . . : True

-- Database name . . . . . : indyco_sample_project

--

-- Target DB. . . . . . . . : Impala

--

-- ----------------------------------------------------------------

*/

CREATE DATABASE`indyco_sample_project`LOCATION'/path/of/my/db';

CREATE TABLE`indyco_sample_project`.`DT_AGENT`

(

`ID_AGENT`INT,

`AGENT`STRING,

`AREA_MANAGER`STRING,

`ID_AREA_MANAGER`INT

)

STORED ASAVRO

TBLPROPERTIES(

'avro.schema.literal'='{

"type": "record",

"name": "DT_AGENT",

"fields": [

{"name": "ID_AGENT", "type": "int"},

{"name": "AGENT", "type": "string"},

{"name": "AREA_MANAGER", "type": "string"},

{"name": "ID_AREA_MANAGER", "type": "int"}

]}');

CREATE EXTERNAL TABLE`indyco_sample_project`.`BT_AGENT_CONTROLLER`

(

`ID_AGENT`INT,

`ID_CONTROLLER`INT,

`WEIGHT`INT

)

STORED ASPARQUET

LOCATION'/different/path';

CREATE TABLE`indyco_sample_project`.`DT_AGENT_CONTROLLER`

(

`ID_CONTROLLER`INT,

`CONTROLLER`STRING

)

STORED ASPARQUET;

Step 2: Migrate Your Schema to Impala

With the Reverse engineering wizard, you can create a DFM project from an existing data warehouse.

To open the Reverse engineering wizard select the “Reverse engineering” menu item under the “Tools” section.

The wizard has the following steps:

Step 2.1: Connection String

Connection string configuration comprises two steps:

Indicate the provider, the connection string, and the owner of a database

Enable inferences on primary- and foreign-key constraints. In other words, if the primary and foreign key constraints don’t exist in the physical schema but rather only in the logical schema, the reverse engineering tries to find the logical constraints. To enable the primary key inference you have to set:

A string pattern for primary-key name. (A pattern such as {table}_ids establishes that each primary key name is composed with its table name and the suffix _ids. For example, for table PRODUCT, the primary key name will be PRODUCT_ids.)

The prefix for fact tables, dimension tables, and bridge tables with the following pattern: <fact-table-prefix>,<dimension-table-prefix>,<bridge-table-prefix>

Step 2.2: Fact Table Selection

You can select the fact tables that will become the fact schemas at the end of reverse-engineering process.

Step 2.3: Logical Modeling

In this step, you can see for each fact table a preview of the logical table schema (on the left) with the corresponding fact schema (on the right) .

Only a logical constraint can be added, removed, or changed, while the physical constraints are unchangeable.

Every change in a logical schema is propagated in the other logical schemas, if needed (for example, if it deletes a relation between two dimension tables) and is also propagated to the corresponding fact schemas.

Finally, by clicking on “Finish,” you save a new DFM project with the same fact schemas included in the logical model.

Step 3: Model and Migrate

Now you have one place to govern, maintain, evolve, and improve your business model.

Next, you can easily create your database schema on CDH by following the steps described at the beginning of this post!