Load URL content stream on-demand, rather than automatically

Details

Description

I think the remote streaming feature should be limited to update request processors. I'm not sure if there is even any use of using it on a /select, but even if there is, it's an unintended security risk. Observe this URL that is roughly the equivalent of an SQL injection attack:

Activity

I don't think we should restrict non-update request handlers from handling streams. Consider DocumentAnalysisRequestHandler - it's handy to be able to stream in text for analysis. There are other request handlers that leverage this as well.

Perhaps, though, a solution is to not allow streams to be resolved unless they are truly needed by the request handler? In your example, there is no need for the standard search request handler to access any streams and thus that URL shouldn't be hit.

Erik Hatcher
added a comment - 26/Oct/11 14:03 I don't think we should restrict non-update request handlers from handling streams. Consider DocumentAnalysisRequestHandler - it's handy to be able to stream in text for analysis. There are other request handlers that leverage this as well.
Perhaps, though, a solution is to not allow streams to be resolved unless they are truly needed by the request handler? In your example, there is no need for the standard search request handler to access any streams and thus that URL shouldn't be hit.

Ryan McKinley
added a comment - 26/Oct/11 14:19 Here is a quick totally untested patch that should behave as Erik describes. Rather then create the URLConnection in the constructor, it waits for someone to call getStream()
this will make effectively limit streaming to requests that hit something that uses it

+1 on Ryan's approach. David, does that work for you? Can someone drum up some test cases to add along with this?

Also, this isn't quite like a SQL injection attack where malicious strings can be put right into the query. It would require some application level flaws to send an arbitrary stream.url parameter over with user/data input, especially to a /select search request. But certainly the point is taken that bad things can happen with stream.url abuse.

Erik Hatcher
added a comment - 26/Oct/11 15:51 +1 on Ryan's approach. David, does that work for you? Can someone drum up some test cases to add along with this?
Also, this isn't quite like a SQL injection attack where malicious strings can be put right into the query. It would require some application level flaws to send an arbitrary stream.url parameter over with user/data input, especially to a /select search request. But certainly the point is taken that bad things can happen with stream.url abuse.

This patch simply adds a test class with a few tests. One of those tests tries to use remote streaming with a select URL and it fails – by design. Once a fix for this issue works, this test should succeed.

David Smiley
added a comment - 26/Oct/11 22:54 This patch simply adds a test class with a few tests. One of those tests tries to use remote streaming with a select URL and it fails – by design. Once a fix for this issue works, this test should succeed.
This is Test Driven Development, by the way.
I don't have time left at the moment to see if Ryan's patch works.

David - Thanks for the strong TDD example! Thanks a lot for that, srsly.

Ryan - Thanks to you for the quick fix.

I tried out the test patch first, got the failure, applied Ryan's patch, test passes. TDD by the book.

I've committed this to trunk, with the change history log of: "Now load URL content stream data (via stream.url) when called for during request handling, rather than loading URL content streams automatically regardless of use."

I think the security aspect of this is a separate issue. What we've done here is only load URL content (file, etc content streams I double-checked, they late load already as it should) when a component calls out for it. So someone could still send in that same evil stream.url to /analysis/document. Let's spin off another issue for something like "Enable fine grained control over allowed content streams", such that one could disable URL content streams, but leave local file content streams possible, say. Not sure that entirely satisfies this issue though, as it certainly is the case that one would have situations where stream.url to load content is really handy, but you certainly don't want any loopback (or fan-out) from malicious data to kill a system either. What do others think about how to address this appropriately on the Solr side (even if that means simply making it clearer what stream.url really does underneath)?

Erik Hatcher
added a comment - 27/Oct/11 05:02 David - Thanks for the strong TDD example! Thanks a lot for that, srsly.
Ryan - Thanks to you for the quick fix.
I tried out the test patch first, got the failure, applied Ryan's patch, test passes. TDD by the book.
I've committed this to trunk, with the change history log of: "Now load URL content stream data (via stream.url) when called for during request handling, rather than loading URL content streams automatically regardless of use."
I think the security aspect of this is a separate issue. What we've done here is only load URL content (file, etc content streams I double-checked, they late load already as it should) when a component calls out for it. So someone could still send in that same evil stream.url to /analysis/document. Let's spin off another issue for something like "Enable fine grained control over allowed content streams", such that one could disable URL content streams, but leave local file content streams possible, say. Not sure that entirely satisfies this issue though, as it certainly is the case that one would have situations where stream.url to load content is really handy, but you certainly don't want any loopback (or fan-out) from malicious data to kill a system either. What do others think about how to address this appropriately on the Solr side (even if that means simply making it clearer what stream.url really does underneath)?

What's left on this issue? I suppose backporting it to 3.x is desirable for the masses? Do these patches work out of the box for 3.x? (if not, can someone whip that up?) Is this particular issue done now? Should we rename it to "Load URL content streams when needed, rather than automatically regardless"?

Erik Hatcher
added a comment - 27/Oct/11 05:10 What's left on this issue? I suppose backporting it to 3.x is desirable for the masses? Do these patches work out of the box for 3.x? (if not, can someone whip that up?) Is this particular issue done now? Should we rename it to "Load URL content streams when needed, rather than automatically regardless"?

Yonik - oops, you're right. I was running the newly added test with -Dtestcase and just forgot to run the full suite before committing. Thank goodness for continuous integration build.

I committed a fix for the test, since ContentStreamBase.URLStream does not set the size until getStream/getReader is called. But looking at where we use stream.getSize() (and other getters) other problems are caused as once a ContentStream is instantiated it is assumed to have a size and a type, but this isn't the case after Ryan's patch. Should we adjust all the places where we use content streams to getStream/getReader before anything else? I think so. And note on getStream/getReader that it must be called prior to getting any details of the stream like size and type.

Erik Hatcher
added a comment - 27/Oct/11 13:08 Yonik - oops, you're right. I was running the newly added test with -Dtestcase and just forgot to run the full suite before committing. Thank goodness for continuous integration build.
I committed a fix for the test, since ContentStreamBase.URLStream does not set the size until getStream/getReader is called. But looking at where we use stream.getSize() (and other getters) other problems are caused as once a ContentStream is instantiated it is assumed to have a size and a type, but this isn't the case after Ryan's patch. Should we adjust all the places where we use content streams to getStream/getReader before anything else? I think so. And note on getStream/getReader that it must be called prior to getting any details of the stream like size and type.

I've been looking at the usage of content streams and the only place I found of concern is ExtractingDocumentLoader, which calls stream.getName()/getSize()/getContentType() before calling getStream(), but this seems to work fine on trunk despite the changes to lazy load the stream and those parameters. This doesn't seem like it should work properly. No?

Erik Hatcher
added a comment - 27/Oct/11 14:53 I've been looking at the usage of content streams and the only place I found of concern is ExtractingDocumentLoader, which calls stream.getName()/getSize()/getContentType() before calling getStream(), but this seems to work fine on trunk despite the changes to lazy load the stream and those parameters. This doesn't seem like it should work properly. No?

Yonik Seeley
added a comment - 27/Oct/11 15:10 This doesn't seem like it should work properly. No?
Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once?

Here's a patch that fixes the extracting request handler to pull the stream properties after getStream() is called. Stepping through with a debugger showed that before this change the metadata values were being set to null.

Erik Hatcher
added a comment - 27/Oct/11 16:14 Here's a patch that fixes the extracting request handler to pull the stream properties after getStream() is called. Stepping through with a debugger showed that before this change the metadata values were being set to null.

Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once?

No exception or multiple getStream() calls. Turns out the metadata attributes (which are ignored in the example /update/extract config) were being set to null (as evidenced by stepping through with a debugger attached). The latest patch fixes this. All seems to be well otherwise in investigating the use of other content stream usage where getStream() is called first. I'm going to commit this patch and add a comment to the getStream() method to note that it should be called before the other properties.

Erik Hatcher
added a comment - 27/Oct/11 16:18 Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once?
No exception or multiple getStream() calls. Turns out the metadata attributes (which are ignored in the example /update/extract config) were being set to null (as evidenced by stepping through with a debugger attached). The latest patch fixes this. All seems to be well otherwise in investigating the use of other content stream usage where getStream() is called first. I'm going to commit this patch and add a comment to the getStream() method to note that it should be called before the other properties.

Regarding future steps to take to make Solr more secure with regards to remote streaming: Personally, I think that, by default, the only handlers that should be able to use this are /update/ registered handlers. That makes Solr easier to secure and is also the biggest use case for this feature. I'd like it to be clearer in solrconfig.xml exactly which handlers can use remote streaming. Presently, you have to have internal knowledge to know that /analysis/document will use it – and that's not cool from a security perspective. You suggested limiting specific URLs or files or files vs URLs but I don't think that is important.

David Smiley
added a comment - 27/Oct/11 17:01 Regarding future steps to take to make Solr more secure with regards to remote streaming: Personally, I think that, by default, the only handlers that should be able to use this are /update/ registered handlers. That makes Solr easier to secure and is also the biggest use case for this feature. I'd like it to be clearer in solrconfig.xml exactly which handlers can use remote streaming. Presently, you have to have internal knowledge to know that /analysis/document will use it – and that's not cool from a security perspective. You suggested limiting specific URLs or files or files vs URLs but I don't think that is important.

Attached is a patch for 3x. I saw the 4 commits Erik made to trunk and I applied it to my 3x checkout. I also added a comment in solrconfig.xml to say:
SearchRequestHandler won't fetch it, but some others do.

David Smiley
added a comment - 01/Dec/11 21:04 Attached is a patch for 3x. I saw the 4 commits Erik made to trunk and I applied it to my 3x checkout. I also added a comment in solrconfig.xml to say:
SearchRequestHandler won't fetch it, but some others do.
The patch was made with git; I hope that won't create a problem.