Now that you know how to create a new project, we recommend checking out our Carbon Emissions tutorial to explore what Meltano is capable of. If you are feeling more adventurous, feel free to skip it and move on!

This is the Carbon Intensity API (carbon emissions/footprint) and SQLite tutorial. This datasource was chosen as it is public, free, and does not require credentials to access. It guides you through data extraction from the Carbon Intensity API, loading extracted entities to a SQLite database, and analyzing the results.

TIP

This tutorial is perfect if your goal is to get Meltano up and running as quickly as possible.

Navigate to the directory in your terminal where you want your Meltano project to be installed. Then run the following commands:

Remember

Run source venv/bin/activate to leverage the meltano installed in your virtual environment (venv) if you haven't already.

Initialize a new project with a folder called carbon:

meltano init carbon

Change directory into your new carbon project:

cd carbon

Let's see what extractors and loaders are available

meltano discover all

Run the extractor (tap) and loader (target)

meltano elt tap-carbon-intensity target-sqlite

Ensure Meltano UI will know how to use data from ELT:

meltano add model model-carbon-intensity-sqlite

Congratulations! You have just extracted all the data from the Carbon Intensity API and loaded it into your local SQLite database.

TIP

Meltano extracts data from various sources like Salesforce, Zendesk, and Google Analytics and then loads that data into the database of your choice. You can use community extractors and loaders or write your own too.

Meltano's ELT pipeline empowers you to aggregate data from various sources and then gather insights from them using Meltano UI with its automatic SQL generation.

This is the Salesforce API and Postgres database tutorial. It guides you through data extraction from your Salesforce account, loading extracted entities to a Postgres DB, transforming the raw data, and analyzing the results.

You can leave SFDC_URL and SFDC_CLIENT_ID as they are in the example above, but you have to set SFDC_USERNAME, SFDC_PASSWORD and SFDC_SECURITY_TOKEN and SFDC_START_DATE according to your instance and preferences.

A Salesforce account may have more than 100 different entities. In order to see the list of available entities, please run

meltano select tap-salesforce --list --all

In this tutorial, we are going to work with a couple of the most common ones and show you how to select entities to extract from a specific API: Account, Contact, Lead, User, Opportunity and Opportunity History:

Depending on your Account, the aforementioned command may take from a couple minutes to a couple hours. That's why we propose to set the SFDC_START_DATE not too far in the past for your first test.

You could also extract and load the data and then run the transformations at a later point (examples below):

Only run the Extract and Load steps:

meltano elt tap-salesforce target-postgres

Only run the Transform Step:

meltano elt tap-salesforce target-postgres --transform only

The transform step uses the dbt transforms defined by Mavatar's Salesforce dbt package.
When meltano elt tap-salesforce target-postgres --transform run is executed, both default and custom dbt transformations in the transform/ directory (a folder created upon project initilization) are being performed.

In order to visualize the data with existing transformations in the UI, the final step would be to add models:

Per default, Meltano will pull all data in the ELT process. This behavior is perfect to get started because of its simplicity. However, some datasets are too big to query as a whole: the solution is incremental ELT.

Incremental ELT will persist the extraction cursor (named state) to make sure any subsequent ELT only pull the data that changed after this cursor. This feature is currently implemented by the extractors and is pretty simple to setup in your Meltano project.

WARNING

Support for incremental ELT varies from extractor to extractor.

To enable it, Meltano must know which cursor to use for the ELT, which is set using the --job_id parameter on the meltano elt command.
Alternatively, one can use the MELTANO_JOB_ID environmental variable. For each subsequent ELT, Meltano will look for a previous cursor to start from.

The first run will create a cursor state:

meltano elt --job_id=gitlab tap-gitlab target-postgres

Subsequent runs will start from this cursor:

meltano elt --job_id=gitlab tap-gitlab target-postgres

WARNING

Schedules currently only support the MELTANO_JOB_ID environment variable, which need to be set manually in the meltano.yml.

As much as we'd like to support all the data sources out there, we'll need your help to get there. If you find a data source that Meltano doesn't support right now, it might be time to get your hands dirty.

We aim to make Meltano as thin as possible on top of the components it abstracts, so adding a new plugin should be straightforward.

When creating a new plugin, you'll often have to expose some settings to the user so that Meltano can generate the correct configuration to run your plugin.

To expose such a setting, you'll need to define it as such

name: Identifier of this setting in the configuration.
The name is the most important field of a setting, as it defines how the value will be passed down to the underlying component.
Nesting can be represented using the . separator.

We assume that you have also run the ELT steps in the SalesForce tutorial.
Nothwithstanding, you can follow this tutorial in order to create entirely new transformations and run the whole ELT process in the end.

In order to have the results of the transformations materialized in the analytics schema, we need to update sfdc-project/transform/dbt_project.yml. For more details on materialization options, please check dbt's documentation.

And finally update the project's .env to add the proper settings for the source and the target databases. The TAP_PG_* variables are used by the Tap (i.e. they define the source DB where the data are extracted from), while the PG_* variables are used by the Target (i.e. they define the target DB where the data will be loaded at)

This step is required if you don't want to export everything from the source db. You can skip it if you just want to export all tables.

We can use meltano select to select which entities will be exported by the Tap from the Source DB. You can find more info on how meltano select works on the Meltano cli commands Documentation.

In the case of tap-postgres, the names of the Entities (or streams as they are called in the Singer.io Specification) are the same as the table names in the Source DB, prefixed by the DB name and the schema they are defined into: {DB NAME}-{SCHEMA NAME}-{TABLE NAME}.

For example, assume that you want to export the users table and selected attributes from the issues table that reside in the tap_gitlab schema in warehouse DB. The following meltano select commands will only export those two tables and data for the selected attributes:

We have all the data for our very successful startup GitFlix, a git based video streaming service, in CSV files.

We export our user data from our CRM, the episode information from our CMS and the streaming data from our own custom streaming system.

It's time for us to move all our data to a Postgres Database, so that we can have everything together, run some advanced analysis and compare the results.

We have a pretty simple scenario: Users stream episodes from various TV series.

For each user we have their name, age, their lifetime value to GitFlix (total subscriptions until today) and some additional data on how often they login to GitFlix and their total logins since they subscribed.

For episodes, we store their number (e.g. '304' for episode 4 of season 3), title, the TV series they belong to (e.g. 'Star Trek TNG'), the rating the episode got in IMDb and the expected ad revenue per minute on ad supported plans (the streaming wars have forced GitFlix to offer both a paid and an ad supported free subscription).

Each input CSV file used with tap-csv must be a traditionally-delimited CSV (commas separated columns, newlines indicate new rows, double quoted values) as defined by the defaults to the python csv library. The first row is the header defining the attribute name for that column and will result to a column of the same name in the database. You can check the downloaded files as an example of valid CSV files (they were generated by exporting Google Sheets to CSV).

You should replace the example warehouse value as the name of the database, the user and password with your own Postgres credentials and change the address and port if the Postgres is not running locally and on the default Port.

PG_SCHEMA is the schema that will be used to import the raw data to and TAP_CSV_FILES_DEFINITION (csv_files.json in the example) is a json file with all the CSV files to be loaded.

So all the records in our CSV files were loaded successfully to our Database, but we are still having some issues:

Everything is a string in a CSV, so all values were loaded as character varying to our Postgres DB. Strings are not very useful for calculating aggregates, so we want to convert all numerical measures to floats or integers accordingly.

Similarly, we want to run more complex value transformations; for example, convert a value like $2,638,765.21 to 2638765.21 by removing the $ and commas and converting the result to a float.

Column names like clv (short for customer lifetime value), avg_logins (short for average logins per day) or ad_rev (short for expected ad revenue per minute on ad supported plans) can be part of raw formatting but are not very useful to a high level user. We want to provide proper, descriptive, attribute names.

Some of our columns have errors like #DIV/0! (division by zero), which we want to clean and convert to NULL values or 0s (depending on the business logic).

We are going to add some simple custom transforms in order to clean and normalize the data.

with source as(select*from {{ env_var('PG_SCHEMA') }}.episodes
),
renamed as(select-- Primary Key: Cast to integer and provide unique name
CAST(id asinteger)as episode_id,-- Keep the Episode Number as a stringnoas episode_number,
title as title,
tv_series as tv_series,-- Fix empty values and cast the rating to float
CAST(nullif(rating,'')asfloat)as imdb_rating,-- Remove the $ from the start and the commas-- and then cast to float
CAST(nullif(replace( substring(ad_rev from2),',',''),'')ASfloat)as ad_revenue_per_minute
from source
where-- Make sure that we keep only episodes with valid IDs
id isNOTNULL-- and that we remove all the test entriesand title NOTLIKE'test_the_test_%')select*from renamed

In the transforms above we could have defined more complicated transformations, additional filters, etc. If you are interested to check more examples, you can check more advanced transformations on Meltano's default transforms for Zuora .

Also, we could have hard coded the schema the raw tables reside in (in this example csv_imports), but we make use of the fact that it is defined by using the environmental variable PG_SCHEMA and use that instead. That means that even if you change the configuration and load the data to a different schema, the Transforms will not have to change.

You should now be able to follow the same steps to import your own CSV files and generate complex reports in Meltano UI:

Prepare your CSV files so that they have a header in the first line with the attribute names.

Update csv_files.json to link your CSV files and use the proper entity name and key(s) for each.

Import and check the raw data

Add custom Transforms and Models by following the Gitflix example or any other Transforms and Models provided by Meltano. You can check the Meltano Group for projects that define default transforms or models for various supported APIs if you want to see real world examples.

Once the meltano elt pipeline has successfully completed and data extracted from an API or a Data Source have been transformed and loaded to the analytics schema of your Data Warehouse, you can use Meltano UI or any data exploration tool to analyze and generate reports.

In this tutorial, we are going to present how to connect Jupyter Notebook to a Meltano Project that uses Postgres to store the transformed data.

If you have Jupyter already installed in your system, you can skip this step.

The most common options for installing Jupyter Notebook are by either using Anaconda or pip. We are going to use pip in this tutorial, as Meltano also uses pip for its installation.

Remember

If you used a virtual environment (venv) to install and run Meltano, don't forget to first navigate to the directory with your venv and run source venv/bin/activate to enable it.

The following commands will install Jupyter Notebook and the most common python libraries required to connect to a Database (psycopg2, sqlalchemy), manipulate data (pandas) and generate some ad hoc plots (matplotlib):

(Optional) Navigate to your Meltano Project and make the credentials you used with Meltano available to the environment the Jupyter Notebook will run:

cd /path/to/my/meltano/project
set +a
source .env
set -a

This is an optional step, but allows us to use the same credentials (e.g. for connecting to Postgres) from inside Jupyter Notebook without entering them again and, more importantly, without exposing any sensitive information inside the Notebook in case you want to share the Notebook with others.

While on the Notebook Dashboard, you can start a new Notebook by selecting Python 3 from the New drop down menu.

We are going to showcase the most simple and straightforward way to connect to your analytics schema, fetch some transformed data and generate some plots.

The first step for a data exploration Notebook is to import the proper libraries required for data exploration and manipulation and then setup the connection to the Database (Postgres in our case) so that we can fetch data:

Cell 1

# Import required librariesimport pandas as pd
import psycopg2
import sqlalchemy
import matplotlib as plt
import os
from sqlalchemy import create_engine
%matplotlib inline
# Get the Postgres username, password, and database name from the Environment# You can also set them directly here, but it's better not to include passwords# or parameters specific to you inside the Notebook
POSTGRES_ADDRESS = os.getenv("PG_ADDRESS")
POSTGRES_PORT = os.getenv("PG_PORT")
POSTGRES_DBNAME = os.getenv("PG_DATABASE")
POSTGRES_USERNAME = os.getenv("PG_USERNAME")
POSTGRES_PASSWORD = os.getenv("PG_PASSWORD")# Connect to the analytics schema, not one of the schemas with the raw data extracted
PG_SCHEMA ='analytics'# A long string that contains the necessary Postgres login information
postgres_str =('postgresql://{username}:{password}@{ipaddress}:{port}/{dbname}'.format(username=POSTGRES_USERNAME,
password=POSTGRES_PASSWORD,
ipaddress=POSTGRES_ADDRESS,
port=POSTGRES_PORT,
dbname=POSTGRES_DBNAME))# Create the connection
cnx = create_engine(postgres_str)

Click |>| Run and we are set to start exploring the data in brought in with Meltano.

You can then write queries and generate plots at will.

As an example, assume that you have loaded data from your Zendesk Account by using tap_zendesk. You can then check the most important Ticket statistics by month:

Cell 2

# Query to send to the Database
sql_query = f'''
SELECT
created_year || '-' || created_month as month,
COUNT(*) as total_tickets,
SUM(ticket_unsolved_counter) as unsolved_tickets,
SUM(ticket_solved_counter) as solved_tickets,
SUM(ticket_one_touch_counter) as one_touch_tickets,
SUM(ticket_reopened_counter) as reopened_tickets,
SUM(replies) as total_replies,
ROUND(AVG(replies), 2) as avg_replies,
ROUND(AVG(full_resolution_time_in_minutes_business), 2) as avg_res_time_mins
FROM {PG_SCHEMA}.zendesk_tickets_xf
GROUP BY created_year, created_month
ORDER BY created_year, created_month;
'''# Execute the query and store the result in a pandas dataframe
result = pd.read_sql_query(sql_query, cnx)# Print the result to the output to check what the query brought in
result