In dataduct, data is shared between two activities using S3. After a
step is finished, it saves its output to a file in S3 for successive
steps to read. Input and output nodes abstract this process, they
represent the S3 directories in which the data is stored. A step’s input
node determines which S3 file it will read as input, and its output node
determines where it will store its output. In most cases, this
input-output node chain is taken care of by dataduct, but there are
situations where you may want finer control over this process.

the output of the extract-local step is fed into the
create-load-redshift step, so the pipeline will load the data found
inside data/test_table1.tsv into dev.test_table.sql. This
behaviour can be made explicit through the name and input_node
properties.

When an input -> output node link is created, implicitly or explicitly,
dependencies are created automatically between the two steps. This
behaviour can be made explicit through the depends_on property.

Dataduct usually handles a step’s output nodes automatically, saving the
file into a default path in S3. You can set the default path through
your dataduct configuration file. However, some steps also have an
optional output_path property, allowing you to choose an S3
directory to store the step’s output.

Transform steps allow you to run your own scripts. If you want to save
the results of your script, you can store data into the output node by
writing to the directory specified by the OUTPUT1_STAGING_DIR enviroment
variable.

In this case, the script must save data to subdirectories with names
matching the output nodes. In the above example, generate_data.py
must save data in OUTPUT1_STAGING_DIR/foo_data and
OUTPUT1_STAGING_DIR/bar_data directories. If the subdirectory and
output node names are mismatched, the output nodes will not be generated
correctly.