Using separate SPARK_HOME in Zeppelin

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Using separate SPARK_HOME in Zeppelin

Patrik Iselind-2
Hi,

I'm trying to build a docker image for Zeppelin in which I'll be able to use a spark standalone cluster. For this I understand that I need to include a Spark installation and point to it with the environment variable SPARK_HOME. I think I've done this correctly, but it doesn't seem to work. I hope that someone on this list can see what I'm missing.

I have a base image for Zeppelin:
```Dockerfile for zeppelin:apache
FROM       alpine:3.8

ARG        DIST_MIRROR=http://archive.apache.org/dist/zeppelin
ARG        VERSION=0.8.2
ENV        ZEPPELIN_HOME=/opt/zeppelin \
    JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk \
    PATH=$PATH:/usr/lib/jvm/java-1.8-openjdk/jre/bin:/usr/lib/jvm/java-1.8-openjdk/bin
RUN        apk add --no-cache bash curl jq openjdk8 py3-pip && \
    ln -s /usr/bin/python3 /usr/bin/python && \
    mkdir -p ${ZEPPELIN_HOME} && \
    curl ${DIST_MIRROR}/zeppelin-${VERSION}/zeppelin-${VERSION}-bin-all.tgz | tar xvz -C ${ZEPPELIN_HOME} && \
    mv ${ZEPPELIN_HOME}/zeppelin-${VERSION}-bin-all/* ${ZEPPELIN_HOME} && \
    rm -rf ${ZEPPELIN_HOME}/zeppelin-${VERSION}-bin-all && \
    rm -rf *.tgz
EXPOSE     8080
VOLUME     ${ZEPPELIN_HOME}/logs \
    ${ZEPPELIN_HOME}/notebook
WORKDIR    ${ZEPPELIN_HOME}
CMD        ./bin/zeppelin.sh run
```

From this base image I include Spark 3.0.1 from the same bitnami image that my Spark cluster is using.
``` Dockerfile for zeppelin:latest
FROM docker.io/bitnami/spark:3.0.1-debian-10-r32 AS sparkimage

FROM zeppelin:alpine

COPY --from=sparkimage /opt/bitnami/spark /opt/spark

RUN cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh && \
    echo "export SPARK_HOME=/opt/spark" >> conf/zeppelin-env.sh && \
    echo "export PYTHONPATH=\$SPARK_HOME/python/" >> conf/zeppelin-env.sh && \
    echo "export PYTHONPATH=\$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:\$PYTHONPATH" >> conf/zeppelin-env.sh && \
    echo "export PYSPARK_PYTHON=python3" >> conf/zeppelin-env.sh && \
    echo "export PYSPARK_DRIVER_PYTHON=python3" >> conf/zeppelin-env.sh

RUN cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml

# From 0.8.2, Zeppelin server bind 127.0.0.1 by default instead of 0.0.0.0.
# Configure zeppelin.server.addr property or ZEPPELIN_ADDR env variable to
# change.
ENV ZEPPELIN_ADDR="0.0.0.0"
```

Now I start zeppelin:latest and make no changes to the interpreters at all, it's not needed to produce my issue. I'd later, when starting a pyspark interpreter works, set spark.master to spark://spark-master:7077.

Open a new notebook.

```example
%python

import pyspark
print(pyspark.version.__version__)
```
prints
```output
3.0.1
```
This is exactly what I expect.

Now comes the troublesome part.

```example
%pyspark

print(sc)
```
prints
```output
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
        at org.apache.zeppelin.spark.BaseSparkScalaInterpreter.getUserJars(BaseSparkScalaInterpreter.scala:382)
        at org.apache.zeppelin.spark.SparkScala211Interpreter.open(SparkScala211Interpreter.scala:71)
        at org.apache.zeppelin.spark.NewSparkInterpreter.open(NewSparkInterpreter.java:102)
        at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:62)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
        at org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:664)
        at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:260)
        at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:194)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:616)
        at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
        at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
```

What am I missing to get the %pyspark interpreter to work?


===========
Patrik Iselind, IDD

If anything is unclear, don't hesitate to ask more.