Parsing command line arguments and passing them around in your Flink application

Almost all Flink applications, both batch and streaming, rely on external configuration parameters.
They are used to specify input and output sources (like paths or addresses), system parameters (parallelism, runtime configuration), and application specific parameters (typically used within user functions).

Flink provides a simple utility called ParameterTool to provide some basic tooling for solving these problems.
Please note that you don’t have to use the ParameterTool described here. Other frameworks such as Commons CLI and
argparse4j also work well with Flink.

Getting your configuration values into the ParameterTool

The ParameterTool provides a set of predefined static methods for reading the configuration. The tool is internally expecting a Map<String, String>, so its very easy to integrate it with your own configuration style.

From .properties files

The following method will read a Properties file and provide the key/value pairs:

From system properties

When starting a JVM, you can pass system properties to it: -Dinput=hdfs:///mydata. You can also initialize the ParameterTool from these system properties:

ParameterToolparameter=ParameterTool.fromSystemProperties();

Using the parameters in your Flink program

Now that we’ve got the parameters from somewhere (see above) we can use them in various ways.

Directly from the ParameterTool

The ParameterTool itself has methods for accessing the values.

ParameterToolparameters=// ...parameter.getRequired("input");parameter.get("output","myDefaultValue");parameter.getLong("expectedCount",-1L);parameter.getNumberOfParameters()// .. there are more methods available.

You can use the return values of these methods directly in the main() method of the client submitting the application.
For example, you could set the parallelism of a operator like this:

and then use it inside the function for getting values from the command line.

Register the parameters globally

Parameters registered as global job parameters in the ExecutionConfig can be accessed as configuration values from the JobManager web interface and in all functions defined by the user.

Register the parameters globally:

ParameterToolparameters=ParameterTool.fromArgs(args);// set up the execution environmentfinalExecutionEnvironmentenv=ExecutionEnvironment.getExecutionEnvironment();env.getConfig().setGlobalJobParameters(parameters);

Access them in any rich user function:

publicstaticfinalclassTokenizerextendsRichFlatMapFunction<String,Tuple2<String,Integer>>{@OverridepublicvoidflatMap(Stringvalue,Collector<Tuple2<String,Integer>>out){ParameterToolparameters=(ParameterTool)getRuntimeContext().getExecutionConfig().getGlobalJobParameters();parameters.getRequired("input");// .. do more ..

Naming large TupleX types

It is recommended to use POJOs (Plain old Java objects) instead of TupleX for data types with many fields.
Also, POJOs can be used to give large Tuple-types a name.

Example

Instead of using:

Tuple11<String,String,...,String>var=new...;

It is much easier to create a custom type extending from the large Tuple type.

Exclude all log4j dependencies from all Flink dependencies: this causes Maven to ignore Flink’s transitive dependencies to log4j.

Exclude the slf4j-log4j12 artifact from Flink’s dependencies: since we are going to use the slf4j to logback binding, we have to remove the slf4j to log4j binding.

Add the Logback dependencies: logback-core and logback-classic

Add dependencies for log4j-over-slf4j. log4j-over-slf4j is a tool which allows legacy applications which are directly using the Log4j APIs to use the Slf4j interface. Flink depends on Hadoop which is directly using Log4j for logging. Therefore, we need to redirect all logger calls from Log4j to Slf4j which is in turn logging to Logback.

Please note that you need to manually add the exclusions to all new Flink dependencies you are adding to the pom file.

You may also need to check if other (non-Flink) dependencies are pulling in log4j bindings. You can analyze the dependencies of your project with mvn dependency:tree.

Use Logback when running Flink on a cluster

This tutorial is applicable when running Flink on YARN or as a standalone cluster.

In order to use Logback instead of Log4j with Flink, you need to remove log4j-1.2.xx.jar and sfl4j-log4j12-xxx.jar from the lib/ directory.

Next, you need to put the following jar files into the lib/ folder:

logback-classic.jar

logback-core.jar

log4j-over-slf4j.jar: This bridge needs to be present in the classpath for redirecting logging calls from Hadoop (which is using Log4j) to Slf4j.

Note that you need to explicitly set the lib/ directory when using a per-job YARN cluster.

The command to submit Flink on YARN with a custom logger is: ./bin/flink run -yt $FLINK_HOME/lib <... remaining arguments ...>