Affected Version

Analysis

As we know, the core ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika. So, let’s take a look at the source code of ingest attachment plugin.

/** subset of parsers for types we support */
private static final Parser PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};

In the function above, we can see very limited security restrictions for SAXParserFactory.newInstance() object:

factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);

According Oracle’s explanation, when XMLConstants.FEATURE_SECURE_PROCESSING is enabled, it instructs the implementation to process XML securely. This may set limits on XML constructs to avoid conditions such as denial of service attacks.

But, actually it’s unable to completely prevent against XML Entity Expansion. An attacker still can send crafted XML document to the ES server to do DoS attack via XML Entity Expansion vulnerability as mentioned in the section Proof of Concept. I checked Apache Tika source code that they only enable XMLConstants.FEATURE_SECURE_PROCESSING feature, which they could think it can protect from this type of attack per this article since it limits the number of entity expansions to 64,000 by default. But actually, we can still leverage these 64,000 entity expansions to generate an amplification XML object in memory, e.g. the file size of ES_XML.xml in the POC is only around 64KB, but once it is uploaded to the backend Tika server on ES server, Tika will have to allocate at least 6MB memory (6MB / 64KB = 100 times) to parse it, which leads to CPU utilization increasing rapidly.

Therefore, from my understanding, one possible fix could be, to set the following system property would greatly limit the impact as per the Oracle document:

jdk.xml.entityExpansionLimit=1

Proof of Concept

The following demonstration is tested and verified on ElasticSearch 6.3.1 which is using Tika 1.18:

0x01 分析

什么是SAML

Security Assertion Markup Language (SAML, pronounced sam-el[1]) is an open standard for exchanging authentication and authorization data between parties, in particular, between an identity provider and a service provider. As its name implies, SAML is an XML-based markup language for security assertions (statements that service providers use to make access-control decisions). SAML is also:
A set of XML-based protocol messages
A set of protocol message bindings
A set of profiles (utilizing all of the above)