[SPARK-23599][SQL] Add a UUID generator from Pseudo-Random Numbers
## What changes were proposed in this pull request?
This patch adds a UUID generator from Pseudo-Random Numbers. We can use it later to have deterministic `UUID()` expression.
## How was this patch tested?
Added unit tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #20817 from viirya/SPARK-23599.

[SPARK-23683][SQL] FileCommitProtocol.instantiate() hardening
## What changes were proposed in this pull request?
With SPARK-20236, `FileCommitProtocol.instantiate()` looks for a three argument constructor, passing in the `dynamicPartitionOverwrite` parameter. If there is no such constructor, it falls back to the classic two-arg one.
When `InsertIntoHadoopFsRelationCommand` passes down that `dynamicPartitionOverwrite` flag `to FileCommitProtocol.instantiate(`), it assumes that the instantiated protocol supports the specific requirements of dynamic partition overwrite. It does not notice when this does not hold, and so the output generated may be incorrect.
This patch changes `FileCommitProtocol.instantiate()` so when `dynamicPartitionOverwrite == true`, it requires the protocol implementation to have a 3-arg constructor. Classic two arg constructors are supported when it is false.
Also it adds some debug level logging for anyone trying to understand what's going on.
## How was this patch tested?
Unit tests verify that
* classes with only 2-arg constructor cannot be used with dynamic overwrite
* classes with only 2-arg constructor can be used without dynamic overwrite
* classes with 3 arg constructors can be used with both.
* the fallback to any two arg ctor takes place after the attempt to load the 3-arg ctor,
* passing in invalid class types fail as expected (regression tests on expected behavior)
Author: Steve Loughran <stevel@hortonworks.com>
Closes #20824 from steveloughran/stevel/SPARK-23683-protocol-instantiate.

[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list2018-03-16T18:42:57ZBryan Cutlercutlerb@gmail.comHolden Karauholden@pigscanfly.ca2018-03-16T18:42:57Zhttps://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=8a72734f33f6a0abbd3207b0d661633c8b25d9ad

[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list
## What changes were proposed in this pull request?
Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class `_CountVectorizerParams` to allow the Python model to also own the parameters. This now matches the Scala class hierarchy.
## How was this patch tested?
Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #16770 from BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.

[SPARK-23623][SS] Avoid concurrent use of cached consumers in CachedKafkaConsumer
## What changes were proposed in this pull request?
CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a pool of KafkaConsumers that can be reused. However, it was built with the assumption there will be only one task using trying to read the same Kafka TopicPartition at the same time. Hence, the cache was keyed by the TopicPartition a consumer is supposed to read. And any cases where this assumption may not be true, we have SparkPlan flag to disable the use of a cache. So it was up to the planner to correctly identify when it was not safe to use the cache and set the flag accordingly.
Fundamentally, this is the wrong way to approach the problem. It is HARD for a high-level planner to reason about the low-level execution model, whether there will be multiple tasks in the same query trying to read the same partition. Case in point, 2.3.0 introduced stream-stream joins, and you can build a streaming self-join query on Kafka. It's pretty non-trivial to figure out how this leads to two tasks reading the same partition twice, possibly concurrently. And due to the non-triviality, it is hard to figure this out in the planner and set the flag to avoid the cache / consumer pool. And this can inadvertently lead to ConcurrentModificationException ,or worse, silent reading of incorrect data.
Here is a better way to design this. The planner shouldnt have to understand these low-level optimizations. Rather the consumer pool should be smart enough avoid concurrent use of a cached consumer. Currently, it tries to do so but incorrectly (the flag inuse is not checked when returning a cached consumer, see [this](https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403)). If there is another request for the same partition as a currently in-use consumer, the pool should automatically return a fresh consumer that should be closed when the task is done. Then the planner does not have to have a flag to avoid reuses.
This PR is a step towards that goal. It does the following.
- There are effectively two kinds of consumer that may be generated
- Cached consumer - this should be returned to the pool at task end
- Non-cached consumer - this should be closed at task end
- A trait called KafkaConsumer is introduced to hide this difference from the users of the consumer so that the client code does not have to reason about whether to stop and release. They simply called `val consumer = KafkaConsumer.acquire` and then `consumer.release()`.
- If there is request for a consumer that is in-use, then a new consumer is generated.
- If there is a concurrent attempt of the same task, then a new consumer is generated, and the existing cached consumer is marked for close upon release.
- In addition, I renamed the classes because CachedKafkaConsumer is a misnomer given that what it returns may or may not be cached.
This PR does not remove the planner flag to avoid reuse to make this patch safe enough for merging in branch-2.3. This can be done later in master-only.
## How was this patch tested?
A new stress test that verifies it is safe to concurrently get consumers for the same partition from the consumer pool.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #20767 from tdas/SPARK-23623.

[SPARK-23680] Fix entrypoint.sh to properly support Arbitrary UIDs
## What changes were proposed in this pull request?
As described in SPARK-23680, entrypoint.sh returns an error code because of a command pipeline execution where it is expected in case of Openshift environments, where arbitrary UIDs are used to run containers
## How was this patch tested?
This patch was manually tested by using docker-image-toll.sh script to generate a Spark driver image and running an example against an OpenShift cluster.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Ricardo Martinelli de Oliveira <rmartine@rmartine.gru.redhat.com>
Closes #20822 from rimolive/rmartine-spark-23680.

[SPARK-23581][SQL] Add interpreted unsafe projection
## What changes were proposed in this pull request?
We currently can only create unsafe rows using code generation. This is a problem for situations in which code generation fails. There is no fallback, and as a result we cannot execute the query.
This PR adds an interpreted version of `UnsafeProjection`. The implementation is modeled after `InterpretedMutableProjection`. It stores the expression results in a `GenericInternalRow`, and it then uses a conversion function to convert the `GenericInternalRow` into an `UnsafeRow`.
This PR does not implement the actual code generated to interpreted fallback logic. This will be done in a follow-up.
## How was this patch tested?
I am piggybacking on exiting `UnsafeProjection` tests, and I have added an interpreted version for each of these.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #20750 from hvanhovell/SPARK-23581.

[SPARK-18371][STREAMING] Spark Streaming backpressure generates batch with large number of records
## What changes were proposed in this pull request?
Omit rounding of backpressure rate. Effects:
- no batch with large number of records is created when rate from PID estimator is one
- the number of records per batch and partition is more fine-grained improving backpressure accuracy
## How was this patch tested?
This was tested by running:
- `mvn test -pl external/kafka-0-8`
- `mvn test -pl external/kafka-0-10`
- a streaming application which was suffering from the issue
JasonMWhite
The contribution is my original work and I license the work to the project under the project’s open source license
Author: Sebastian Arzt <sebastian.arzt@plista.com>
Closes #17774 from arzt/kafka-back-pressure.

[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources... 2018-03-16T16:36:30ZDongjoon Hyundongjoon@apache.orggatorsmilegatorsmile@gmail.com2018-03-16T16:36:30Zhttps://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=5414abca4fec6a68174c34d22d071c20027e959d

[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default`
## What changes were proposed in this pull request?
Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.
This PR aims to
- Improve test suites more robust and makes it easy to test new data sources in the future.
- Test new native ORC data source with the full existing Apache Spark test coverage.
As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted.
## How was this patch tested?
Pass the Jenkins with updated tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #20705 from dongjoon-hyun/SPARK-23553.

[SPARK-23635][YARN] AM env variable should not overwrite same name env variable set... 2018-03-16T08:22:03Zjerryshaosshao@hortonworks.comjerryshaosshao@hortonworks.com2018-03-16T08:22:03Zhttps://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=c952000487ee003200221b3c4e25dcb06e359f0a

[SPARK-23635][YARN] AM env variable should not overwrite same name env variable set through spark.executorEnv.
## What changes were proposed in this pull request?
In the current Spark on YARN code, AM always will copy and overwrite its env variables to executors, so we cannot set different values for executors.
To reproduce issue, user could start spark-shell like:
```
./bin/spark-shell --master yarn-client --conf spark.executorEnv.SPARK_ABC=executor_val --conf spark.yarn.appMasterEnv.SPARK_ABC=am_val
```
Then check executor env variables by
```
sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq }.collect.foreach(println)
```
We will always get `am_val` instead of `executor_val`. So we should not let AM to overwrite specifically set executor env variables.
## How was this patch tested?
Added UT and tested in local cluster.
Author: jerryshao <sshao@hortonworks.com>
Closes #20799 from jerryshao/SPARK-23635.

[SPARK-23644][CORE][UI] Use absolute path for REST call in SHS2018-03-16T07:12:26ZMarco Gaidomarcogaido91@gmail.comjerryshaosshao@hortonworks.com2018-03-16T07:12:26Zhttps://git-wip-us.apache.org/repos/asf?p=spark.git;a=commitdiff;h=ca83526de55f0f8784df58cc8b7c0a7cb0c96e23

[SPARK-23644][CORE][UI] Use absolute path for REST call in SHS
## What changes were proposed in this pull request?
SHS is using a relative path for the REST API call to get the list of the application is a relative path call. In case of the SHS being consumed through a proxy, it can be an issue if the path doesn't end with a "/".
Therefore, we should use an absolute path for the REST call as it is done for all the other resources.
## How was this patch tested?
manual tests
Before the change:
![screen shot 2018-03-10 at 4 22 02 pm](https://user-images.githubusercontent.com/8821783/37244190-8ccf9d40-2485-11e8-8fa9-345bc81472fc.png)
After the change:
![screen shot 2018-03-10 at 4 36 34 pm 1](https://user-images.githubusercontent.com/8821783/37244201-a1922810-2485-11e8-8856-eeab2bf5e180.png)
Author: Marco Gaido <marcogaido91@gmail.com>
Closes #20794 from mgaido91/SPARK-23644.

[SPARK-23608][CORE][WEBUI] Add synchronization in SHS between attachSparkUI and detachSparkUI functions to avoid concurrent modification issue to Jetty Handlers
Jetty handlers are dynamically attached/detached while SHS is running. But the attach and detach operations might be taking place at the same time due to the async in load/clear in Guava Cache.
## What changes were proposed in this pull request?
Add synchronization between attachSparkUI and detachSparkUI in SHS.
## How was this patch tested?
With this patch, the jetty handlers missing issue never happens again in our production cluster SHS.
Author: Ye Zhou <yezhou@linkedin.com>
Closes #20744 from zhouyejoe/SPARK-23608.

[SPARK-23533][SS] Add support for changing ContinuousDataReader's startOffset
## What changes were proposed in this pull request?
As discussion in #20675, we need add a new interface `ContinuousDataReaderFactory` to support the requirements of setting start offset in Continuous Processing.
## How was this patch tested?
Existing UT.
Author: Yuanjian Li <xyliyuanjian@gmail.com>
Closes #20689 from xuanyuanking/SPARK-23533.

[SPARK-23642][DOCS] AccumulatorV2 subclass isZero scaladoc fix
Added/corrected scaladoc for isZero on the DoubleAccumulator, CollectionAccumulator, and LongAccumulator subclasses of AccumulatorV2, particularly noting where there are requirements in addition to having a value of zero in order to return true.
## What changes were proposed in this pull request?
Three scaladoc comments are updated in AccumulatorV2.scala
No changes outside of comment blocks were made.
## How was this patch tested?
Running "sbt unidoc", fixing style errors found, and reviewing the resulting local scaladoc in firefox.
Author: smallory <s.mallory@gmail.com>
Closes #20790 from smallory/patch-1.