Currently, the character encoding for source files needs to be configured individually for each and every plugin that processes source files. In this context, source file refers to some plain text file that - unlike an XML file - lacks intrinsic means to specify the employed file encoding. The Java source files are the most promiment example of such text files. Velocity templates, BeanShell scripts and APT documents are further examples.

Life would become easier if there was a dedicated POM element like ${project.build.sourceEncoding} which could be used to specify the encoding once per entire project. Every plugin could use it as default value:

Adding this element to the POM structure can only happen in Maven 2.1:

For Maven 2.0.x, the value can be defined as an equivalent property:

Thus plugins could immediately be modified to use ${project.build.sourceEncoding} expression, whatever Maven version is used.

Motivation

Why bother with file encoding at all? Well, a file encoding (aka charset) is required to solve the following discrepancy: A file stored on disk or transmitted via network is merely a stream of bytes/octets. In contrast, text is a stream of characters. However, a character is not a byte.

To further illustrate this, just consider the Unicode standard chosen for a Java String. Unicode defines more than 65.000 characters which obviously cannot be mapped to a single byte each. Hence, one needs a reversible transformation that defines how to map a character to bytes and vice-versa. This transformation is called a file/character encoding.

Now, there are different encodings, each potentially yielding different bytes for the same character. For example, the common encoding ASCII will map the character 'A' to the byte with the hex code 0x41. The same character is mapped to the byte 0xC1 when using the encoding EBCDIC. Another example is the character 'ü' (small letter u with umlaut) that maps to the single byte 0xFC when using ISO-8859-1 but maps to the two byte sequence 0xC3 0xBC when using UTF-8.

It should be clear by now that encoding a character with one encoding and later on decoding it with a different encoding can corrupt the character. To avoid such errors, it is crucial that all developers of a project have agreed to use the same encoding when editing the project sources and running the build.

Default Value

As shown by a user poll on the mailing list and the numerous comments on this article, this proposal has been revised: Plugins should use the platform default encoding if no explicit file encoding has been provided in the plugin configuration.

Since usage of the platform encoding yields platform-dependent and hence potentially irreproducible builds, plugins should output a warning to inform the user about this threat. This way, users can smoothly update their POMs to follow best practices.

Code Spots to Review for Proper Encoding Handling

The following classes and/or methods indicate usage of the JVM's default encoding and hence should be reviewed:

String(byte[])

String.getBytes()

FileReader

FileWriter

PrintWriter(File)(new in JDK 5)

PrintWriter(OutputStream)(new in JDK 5)

InputStreamReader(InputStream)

OutputStreamWriter(OutputStream)

ReaderFactory.newPlatformReader()

WriterFactory.newPlatformWriter()

FileUtils.fileRead(String)

FileUtils.fileRead(File)

FileUtils.fileWrite(String, String)

FileUtils.fileAppend(String, String)

IOUtils.toString(InputStream)

IOUtils.toString(InputStream, int)

Plugins to Modify

Build plugins are highlighted, since the impact of the change is more critical to the built artifact than reporting plugins.

References

Please see [0] for the related thread from the mailing list, [1] for some further descriptions and [2] for a similar feature request in JIRA. Also note a related proposal for the output encoding of reports [3].