Zeppelin + Spark On EMR?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Zeppelin + Spark On EMR?

shahab
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab

Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Phil Wills
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab

Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

shahab
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab


Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Eugene
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.
Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Anders Hammar
Hi,

Thank you Phil for updating my script to support the latest version of EMR.
I have edited my gist so that it includes some of your updates plus added some other additional changes.


While on the subject, has anyone be able to get Zeppelin to work together with the Amazon's Spark installation on Amazon EMR 4.x (by exporting SPARK_HOME and HADOOP_HOME instead)? When I try this then I get the following exception:

org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:444)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:442)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:442)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:430)
    ...

From a quick look at it, the problem seems to be that the Amazon installation of Spark use SPARK_CLASSPATH to add additional libraries (/etc/spark/conf/spark-env.sh) while the Zeppelin use "spark-submit --driver-class-path" (zeppelin/bin/interpreter.sh). 

Any ideas?

Best regards,
Anders


On Wed, Sep 9, 2015 at 5:09 PM, Eugene <[hidden email]> wrote:
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.

Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Eugene
Hi Anders, 

I also had the error you mention, overcame this with:
  1. using spark installation from zeppelin
  2. altering conf/interpreter.json with properties like "spark.executor.instances", "spark.executor.cores", "spark.default.parallelism" from spark-defaults.conf, parsed this file using parts of your gist.
Code looks like this:

cd ~/zeppelin/conf/
SPARK_DEFAULTS=~/emr-spark-defaults.conf
SPARK_EXECUTOR_INSTANCES=$(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_CORES=$(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_MEMORY=$(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print $2}')
SPARK_DEFAULT_PARALLELISM=$(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print $2}')
cat interpreter.json | jq ".interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.instances\" = \"${SPARK_EXECUTOR_INSTANCES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.cores\" = \"${SPARK_EXECUTOR_CORES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.memory\" = \"${SPARK_EXECUTOR_MEMORY}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.default.parallelism\" = \"${SPARK_DEFAULT_PARALLELISM}\" " > interpreter.json_
cat interpreter.json_ > interpreter.json
rm interpreter.json_


2015-09-18 17:05 GMT+04:00 Anders Hammar <[hidden email]>:
Hi,

Thank you Phil for updating my script to support the latest version of EMR.
I have edited my gist so that it includes some of your updates plus added some other additional changes.


While on the subject, has anyone be able to get Zeppelin to work together with the Amazon's Spark installation on Amazon EMR 4.x (by exporting SPARK_HOME and HADOOP_HOME instead)? When I try this then I get the following exception:

org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:444)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:442)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:442)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:430)
    ...

From a quick look at it, the problem seems to be that the Amazon installation of Spark use SPARK_CLASSPATH to add additional libraries (/etc/spark/conf/spark-env.sh) while the Zeppelin use "spark-submit --driver-class-path" (zeppelin/bin/interpreter.sh). 

Any ideas?

Best regards,
Anders


On Wed, Sep 9, 2015 at 5:09 PM, Eugene <[hidden email]> wrote:
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.




--


Best regards,
Eugene.
Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Ophir Cohen
Did anyone else encountered that problem?

I removed the --driver-class-path "${CLASSPATH}" from bin/interpreter.sh script and now it start the SparkContext as expected.
The problem is that it does not grab my local hive-site.xml that pointed to an external metastore and try to use the local one :(

On Fri, Sep 18, 2015 at 4:14 PM, Eugene <[hidden email]> wrote:
Hi Anders, 

I also had the error you mention, overcame this with:
  1. using spark installation from zeppelin
  2. altering conf/interpreter.json with properties like "spark.executor.instances", "spark.executor.cores", "spark.default.parallelism" from spark-defaults.conf, parsed this file using parts of your gist.
Code looks like this:

cd ~/zeppelin/conf/
SPARK_DEFAULTS=~/emr-spark-defaults.conf
SPARK_EXECUTOR_INSTANCES=$(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_CORES=$(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_MEMORY=$(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print $2}')
SPARK_DEFAULT_PARALLELISM=$(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print $2}')
cat interpreter.json | jq ".interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.instances\" = \"${SPARK_EXECUTOR_INSTANCES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.cores\" = \"${SPARK_EXECUTOR_CORES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.memory\" = \"${SPARK_EXECUTOR_MEMORY}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.default.parallelism\" = \"${SPARK_DEFAULT_PARALLELISM}\" " > interpreter.json_
cat interpreter.json_ > interpreter.json
rm interpreter.json_


2015-09-18 17:05 GMT+04:00 Anders Hammar <[hidden email]>:
Hi,

Thank you Phil for updating my script to support the latest version of EMR.
I have edited my gist so that it includes some of your updates plus added some other additional changes.


While on the subject, has anyone be able to get Zeppelin to work together with the Amazon's Spark installation on Amazon EMR 4.x (by exporting SPARK_HOME and HADOOP_HOME instead)? When I try this then I get the following exception:

org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:444)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:442)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:442)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:430)
    ...

From a quick look at it, the problem seems to be that the Amazon installation of Spark use SPARK_CLASSPATH to add additional libraries (/etc/spark/conf/spark-env.sh) while the Zeppelin use "spark-submit --driver-class-path" (zeppelin/bin/interpreter.sh). 

Any ideas?

Best regards,
Anders


On Wed, Sep 9, 2015 at 5:09 PM, Eugene <[hidden email]> wrote:
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.




--


Best regards,
Eugene.

Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Jonathan
Zeppelin is now supported on EMR in release emr-4.1.0 without the need for any bootstrap action like this. See https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/ for the emr-4.1.0 announcement.

BTW, the version of Zeppelin bundled with emr-4.1.0 is a SNAPSHOT version of 0.6.0, built from commit a345f768471e9b8c89f4eb4d3aba6b684bff75b3.

~ Jonathan

On Wed, Sep 30, 2015 at 2:27 AM, Ophir Cohen <[hidden email]> wrote:
Did anyone else encountered that problem?

I removed the --driver-class-path "${CLASSPATH}" from bin/interpreter.sh script and now it start the SparkContext as expected.
The problem is that it does not grab my local hive-site.xml that pointed to an external metastore and try to use the local one :(

On Fri, Sep 18, 2015 at 4:14 PM, Eugene <[hidden email]> wrote:
Hi Anders, 

I also had the error you mention, overcame this with:
  1. using spark installation from zeppelin
  2. altering conf/interpreter.json with properties like "spark.executor.instances", "spark.executor.cores", "spark.default.parallelism" from spark-defaults.conf, parsed this file using parts of your gist.
Code looks like this:

cd ~/zeppelin/conf/
SPARK_DEFAULTS=~/emr-spark-defaults.conf
SPARK_EXECUTOR_INSTANCES=$(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_CORES=$(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_MEMORY=$(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print $2}')
SPARK_DEFAULT_PARALLELISM=$(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print $2}')
cat interpreter.json | jq ".interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.instances\" = \"${SPARK_EXECUTOR_INSTANCES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.cores\" = \"${SPARK_EXECUTOR_CORES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.memory\" = \"${SPARK_EXECUTOR_MEMORY}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.default.parallelism\" = \"${SPARK_DEFAULT_PARALLELISM}\" " > interpreter.json_
cat interpreter.json_ > interpreter.json
rm interpreter.json_


2015-09-18 17:05 GMT+04:00 Anders Hammar <[hidden email]>:
Hi,

Thank you Phil for updating my script to support the latest version of EMR.
I have edited my gist so that it includes some of your updates plus added some other additional changes.


While on the subject, has anyone be able to get Zeppelin to work together with the Amazon's Spark installation on Amazon EMR 4.x (by exporting SPARK_HOME and HADOOP_HOME instead)? When I try this then I get the following exception:

org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:444)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:442)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:442)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:430)
    ...

From a quick look at it, the problem seems to be that the Amazon installation of Spark use SPARK_CLASSPATH to add additional libraries (/etc/spark/conf/spark-env.sh) while the Zeppelin use "spark-submit --driver-class-path" (zeppelin/bin/interpreter.sh). 

Any ideas?

Best regards,
Anders


On Wed, Sep 9, 2015 at 5:09 PM, Eugene <[hidden email]> wrote:
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.




--


Best regards,
Eugene.


Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Anders Hammar
That is great!

Note that executor settings needs to be configured manually in Zeppelin though.

"Zeppelin does not use some of the settings defined in your cluster’s spark-defaults.conf configuration file (though it will instruct YARN to dynamically allocate executors if you have enabled that setting). You must set executor settings (e.g. memory and cores) in the Interpreter tab and then restart the interpreter for them to be used."


/Anders

On Thu, Oct 1, 2015 at 1:33 AM, Jonathan Kelly <[hidden email]> wrote:
Zeppelin is now supported on EMR in release emr-4.1.0 without the need for any bootstrap action like this. See https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/ for the emr-4.1.0 announcement.

BTW, the version of Zeppelin bundled with emr-4.1.0 is a SNAPSHOT version of 0.6.0, built from commit a345f768471e9b8c89f4eb4d3aba6b684bff75b3.

~ Jonathan

On Wed, Sep 30, 2015 at 2:27 AM, Ophir Cohen <[hidden email]> wrote:
Did anyone else encountered that problem?

I removed the --driver-class-path "${CLASSPATH}" from bin/interpreter.sh script and now it start the SparkContext as expected.
The problem is that it does not grab my local hive-site.xml that pointed to an external metastore and try to use the local one :(

On Fri, Sep 18, 2015 at 4:14 PM, Eugene <[hidden email]> wrote:
Hi Anders, 

I also had the error you mention, overcame this with:
  1. using spark installation from zeppelin
  2. altering conf/interpreter.json with properties like "spark.executor.instances", "spark.executor.cores", "spark.default.parallelism" from spark-defaults.conf, parsed this file using parts of your gist.
Code looks like this:

cd ~/zeppelin/conf/
SPARK_DEFAULTS=~/emr-spark-defaults.conf
SPARK_EXECUTOR_INSTANCES=$(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_CORES=$(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_MEMORY=$(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print $2}')
SPARK_DEFAULT_PARALLELISM=$(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print $2}')
cat interpreter.json | jq ".interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.instances\" = \"${SPARK_EXECUTOR_INSTANCES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.cores\" = \"${SPARK_EXECUTOR_CORES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.memory\" = \"${SPARK_EXECUTOR_MEMORY}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.default.parallelism\" = \"${SPARK_DEFAULT_PARALLELISM}\" " > interpreter.json_
cat interpreter.json_ > interpreter.json
rm interpreter.json_


2015-09-18 17:05 GMT+04:00 Anders Hammar <[hidden email]>:
Hi,

Thank you Phil for updating my script to support the latest version of EMR.
I have edited my gist so that it includes some of your updates plus added some other additional changes.


While on the subject, has anyone be able to get Zeppelin to work together with the Amazon's Spark installation on Amazon EMR 4.x (by exporting SPARK_HOME and HADOOP_HOME instead)? When I try this then I get the following exception:

org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:444)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:442)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:442)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:430)
    ...

From a quick look at it, the problem seems to be that the Amazon installation of Spark use SPARK_CLASSPATH to add additional libraries (/etc/spark/conf/spark-env.sh) while the Zeppelin use "spark-submit --driver-class-path" (zeppelin/bin/interpreter.sh). 

Any ideas?

Best regards,
Anders


On Wed, Sep 9, 2015 at 5:09 PM, Eugene <[hidden email]> wrote:
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.




--


Best regards,
Eugene.



Reply | Threaded
Open this post in threaded view
|

Re: Zeppelin + Spark On EMR?

Jonathan
Right, though that is something that could probably be fixed in a later release. That is, the Zeppelin executor core/memory settings could be defaulted to the same settings to which Spark defaults. Of course, you might want Zeppelin's executor core/memory settings to differ from your Spark defaults, but it seems reasonable to default Zeppelin to the same defaults to which Spark defaults, then let you change them after the fact if necessary. In the future, we could also make Zeppelin's interpreter settings configurable via the Configuration API. What are your opinions on this?

Thanks,
Jonathan

On Thu, Oct 1, 2015 at 1:34 AM, Anders Hammar <[hidden email]> wrote:
That is great!

Note that executor settings needs to be configured manually in Zeppelin though.

"Zeppelin does not use some of the settings defined in your cluster’s spark-defaults.conf configuration file (though it will instruct YARN to dynamically allocate executors if you have enabled that setting). You must set executor settings (e.g. memory and cores) in the Interpreter tab and then restart the interpreter for them to be used."


/Anders

On Thu, Oct 1, 2015 at 1:33 AM, Jonathan Kelly <[hidden email]> wrote:
Zeppelin is now supported on EMR in release emr-4.1.0 without the need for any bootstrap action like this. See https://aws.amazon.com/blogs/aws/amazon-emr-release-4-1-0-spark-1-5-0-hue-3-7-1-hdfs-encryption-presto-oozie-zeppelin-improved-resizing/ for the emr-4.1.0 announcement.

BTW, the version of Zeppelin bundled with emr-4.1.0 is a SNAPSHOT version of 0.6.0, built from commit a345f768471e9b8c89f4eb4d3aba6b684bff75b3.

~ Jonathan

On Wed, Sep 30, 2015 at 2:27 AM, Ophir Cohen <[hidden email]> wrote:
Did anyone else encountered that problem?

I removed the --driver-class-path "${CLASSPATH}" from bin/interpreter.sh script and now it start the SparkContext as expected.
The problem is that it does not grab my local hive-site.xml that pointed to an external metastore and try to use the local one :(

On Fri, Sep 18, 2015 at 4:14 PM, Eugene <[hidden email]> wrote:
Hi Anders, 

I also had the error you mention, overcame this with:
  1. using spark installation from zeppelin
  2. altering conf/interpreter.json with properties like "spark.executor.instances", "spark.executor.cores", "spark.default.parallelism" from spark-defaults.conf, parsed this file using parts of your gist.
Code looks like this:

cd ~/zeppelin/conf/
SPARK_DEFAULTS=~/emr-spark-defaults.conf
SPARK_EXECUTOR_INSTANCES=$(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_CORES=$(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print $2}')
SPARK_EXECUTOR_MEMORY=$(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print $2}')
SPARK_DEFAULT_PARALLELISM=$(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print $2}')
cat interpreter.json | jq ".interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.instances\" = \"${SPARK_EXECUTOR_INSTANCES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.cores\" = \"${SPARK_EXECUTOR_CORES}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.executor.memory\" = \"${SPARK_EXECUTOR_MEMORY}\" | .interpreterSettings.\"2B188AQ5T\".properties.\"spark.default.parallelism\" = \"${SPARK_DEFAULT_PARALLELISM}\" " > interpreter.json_
cat interpreter.json_ > interpreter.json
rm interpreter.json_


2015-09-18 17:05 GMT+04:00 Anders Hammar <[hidden email]>:
Hi,

Thank you Phil for updating my script to support the latest version of EMR.
I have edited my gist so that it includes some of your updates plus added some other additional changes.


While on the subject, has anyone be able to get Zeppelin to work together with the Amazon's Spark installation on Amazon EMR 4.x (by exporting SPARK_HOME and HADOOP_HOME instead)? When I try this then I get the following exception:

org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:444)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:442)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:442)
    at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:430)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:430)
    ...

From a quick look at it, the problem seems to be that the Amazon installation of Spark use SPARK_CLASSPATH to add additional libraries (/etc/spark/conf/spark-env.sh) while the Zeppelin use "spark-submit --driver-class-path" (zeppelin/bin/interpreter.sh). 

Any ideas?

Best regards,
Anders


On Wed, Sep 9, 2015 at 5:09 PM, Eugene <[hidden email]> wrote:
Here's a bit shorter alternative, too


2015-09-09 18:58 GMT+04:00 shahab <[hidden email]>:
Thanks Phil, it works. Great job and well done!

best,
/Shahab

On Mon, Sep 7, 2015 at 6:32 PM, Phil Wills <[hidden email]> wrote:
Anders script is a bit out of date if you're using the latest version of EMR.  Here's my fork:


which worked OK for me fairly recently.

Phil  

On Mon, 7 Sep 2015 at 10:01 shahab <[hidden email]> wrote:
Hi,

I am trying to use Zeppelin to work with Spark on Amazon EMR. I used the script provided by Anders ( https://gist.github.com/andershammar/224e1077021d0ea376dd) to setup Zeppelin. The Zeppelin can connect to Spark but when I got error when I run the tutorials. and I get the following error:

...FileNotFoundException: File file:/home/hadoop/zeppelin/interpreter/spark/dep/zeppelin-spark-dependencies-0.6.0-incubating-SNAPSHOT.jar does not exist

However, the above file does exists in that path on the Master node.'

I do appreciate if anyone has any experience to share how to setup Zeppelin with EMR .

best,
/Shahab





--


Best regards,
Eugene.




--


Best regards,
Eugene.