Activity

We can write a simple setup.py file, for pyspark source distribution.
Any end user, who intend to use pyspark modules need to do a pip install of pyspark and set the SPARK_HOME env variable, before importing the pyspark into this code.
Also, we can introduce one more environment variable, say SPARK_VERSION, this needs to be validated against the pyspark installed version, during the import time. A dictionary could be maintained in a text file under spark/python, to validate the compatibility of pyspark and spark.

Prabin Banka
added a comment - 19/Mar/14 02:18 We can write a simple setup.py file, for pyspark source distribution.
Any end user, who intend to use pyspark modules need to do a pip install of pyspark and set the SPARK_HOME env variable, before importing the pyspark into this code.
Also, we can introduce one more environment variable, say SPARK_VERSION, this needs to be validated against the pyspark installed version, during the import time. A dictionary could be maintained in a text file under spark/python, to validate the compatibility of pyspark and spark.
Will this be sufficient ?

I'm all for pip installable pyspark, but I'm confused about the ideal way to install the pyspark code. I'd also prefer to avoid introducing an extra variable, SPARK_VERSION. It seems to me that if we had the typical setup.py file that downloaded code from PyPi, then users would have to deal with differences in dependencies between the python version in PyPi and in their code pointed to by SPARK_HOME. Additionally, users would still need to download the spark jars or set SPARK_HOME, which means two (possibly different) versions of the python code are flying around. The fact that users have to manage the version, download spark into SPARK_HOME, and pip install pyspark doesn't seem quite right.

What do you think about this: We create a setup.py file that requires SPARK_HOME be set in the environment (requiring that the user have downloaded Spark) BEFORE the pyspark code gets installed.

An additional idea we could consider: Then, when pip or a user calls pyspark, we have "python setup.py install" redirect to "python setup.py develop." This installs pyspark in "development mode" and means that the pyspark code pointed to by $SPARK_HOME/python is the source of truth. (more about development mode here: https://pythonhosted.org/setuptools/setuptools.html#development-mode). My thinking for this is that since users need to specify SPARK_HOME, we might as well keep the python library with the spark code (as it currently is) to avoid potential compatibility conflicts. As a maintainer, we also don't need to update PyPi with the latest version of pyspark. Using "develop mode" as default may be a bad idea. I also don't know how to automatically prefer "setup.py develop" over "setup.py install".

Last, and perhaps most obvious, if we create a setup.py file, we could also probably no longer include the py4j egg in the spark downloads as we'd rely on setuptools to provide the external libraries.

Alex Gaudio
added a comment - 02/Aug/14 04:24 I'm all for pip installable pyspark, but I'm confused about the ideal way to install the pyspark code. I'd also prefer to avoid introducing an extra variable, SPARK_VERSION. It seems to me that if we had the typical setup.py file that downloaded code from PyPi, then users would have to deal with differences in dependencies between the python version in PyPi and in their code pointed to by SPARK_HOME. Additionally, users would still need to download the spark jars or set SPARK_HOME, which means two (possibly different) versions of the python code are flying around. The fact that users have to manage the version, download spark into SPARK_HOME, and pip install pyspark doesn't seem quite right.
What do you think about this: We create a setup.py file that requires SPARK_HOME be set in the environment (requiring that the user have downloaded Spark) BEFORE the pyspark code gets installed.
An additional idea we could consider: Then, when pip or a user calls pyspark, we have "python setup.py install" redirect to "python setup.py develop." This installs pyspark in "development mode" and means that the pyspark code pointed to by $SPARK_HOME/python is the source of truth. (more about development mode here: https://pythonhosted.org/setuptools/setuptools.html#development-mode ). My thinking for this is that since users need to specify SPARK_HOME, we might as well keep the python library with the spark code (as it currently is) to avoid potential compatibility conflicts. As a maintainer, we also don't need to update PyPi with the latest version of pyspark. Using "develop mode" as default may be a bad idea. I also don't know how to automatically prefer "setup.py develop" over "setup.py install".
Last, and perhaps most obvious, if we create a setup.py file, we could also probably no longer include the py4j egg in the spark downloads as we'd rely on setuptools to provide the external libraries.

Because PySpark depends on Spark packages, Python user can not use it after 'pip install pyspark', so there is not too much benefits from this.

Once we release PySpark separated from Spark, then we should keep the compatability across versions of PySpark and Spark, it will be a nightmare for us (we can not move fast to improve the implementation of PySpark).

So, I think we can not do this in near future. Prabin Banka, do you mind to close the PR?

Davies Liu
added a comment - 31/Oct/14 19:11 Because PySpark depends on Spark packages, Python user can not use it after 'pip install pyspark', so there is not too much benefits from this.
Once we release PySpark separated from Spark, then we should keep the compatability across versions of PySpark and Spark, it will be a nightmare for us (we can not move fast to improve the implementation of PySpark).
So, I think we can not do this in near future. Prabin Banka , do you mind to close the PR?