Restart zeppelin spark interpreter

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Restart zeppelin spark interpreter

Jung, Soonoh
Hi everyone,

I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
Basically Zeppelin spark interpreter's spark job is not finishing after executing a notebook.
It looks like the spark job still occupying memory a lot in my Yarn cluster.
Is there a way restart spark interpreter automatically(or pragmatically) every time I run a notebook in order to release that memory in my Yarn cluster?

Regards,
Soonoh
Reply | Threaded
Open this post in threaded view
|

Re: Restart zeppelin spark interpreter

Jonathan
On the most recent several releases of EMR, Spark dynamicAllocation is automatically enabled, as it allows longer running apps like Zeppelin's Spark interpreter to continue running in the background without taking up resources for any executors unless Spark jobs are actively running.

However, if you are seeing resources still being used even after some idle time, maybe you are using maximizeResourceAllocation (which makes any Spark job use 100% of the resources, with one executor per slave node). If you use maximizeResourceAllocation, it effectively disables dynamicAllocation because it causes spark.executor.instances to be set. If you still want to use dynamicAllocation along with maxizeResourceAllocation, just set spark.dynamicAllocation.enabled to true in the spark-defaults configuration classification. This will signal to the maximizeResourceAllocation feature not to set spark.executor.instances so that dynamicAllocation will be used.

Keep in mind that this might not be the most ideal way to use dynamicAllocation though (especially if you don't have many nodes in the cluster) because the maximizeResourceAllocation feature would make the executors very coarsely grained since there's only one per node. It would still allow multiple applications to run at once though because executors from one application could spin down when idle, allowing another application to spin up executors.

Hope this helps,
Jonathan
On Mon, Oct 3, 2016 at 5:38 PM Jung, Soonoh <[hidden email]> wrote:
Hi everyone,

I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
Basically Zeppelin spark interpreter's spark job is not finishing after executing a notebook.
It looks like the spark job still occupying memory a lot in my Yarn cluster.
Is there a way restart spark interpreter automatically(or pragmatically) every time I run a notebook in order to release that memory in my Yarn cluster?

Regards,
Soonoh
Reply | Threaded
Open this post in threaded view
|

Re: Restart zeppelin spark interpreter

Jung, Soonoh
Hi Jonathan,

Thank you for the information!
Yes, I am using maximizeResourceAllocation. I will try turn off this and just use dynamicAllocation alone.

Regards,
Soonoh

On 4 October 2016 at 11:07, Jonathan Kelly <[hidden email]> wrote:
On the most recent several releases of EMR, Spark dynamicAllocation is automatically enabled, as it allows longer running apps like Zeppelin's Spark interpreter to continue running in the background without taking up resources for any executors unless Spark jobs are actively running.

However, if you are seeing resources still being used even after some idle time, maybe you are using maximizeResourceAllocation (which makes any Spark job use 100% of the resources, with one executor per slave node). If you use maximizeResourceAllocation, it effectively disables dynamicAllocation because it causes spark.executor.instances to be set. If you still want to use dynamicAllocation along with maxizeResourceAllocation, just set spark.dynamicAllocation.enabled to true in the spark-defaults configuration classification. This will signal to the maximizeResourceAllocation feature not to set spark.executor.instances so that dynamicAllocation will be used.

Keep in mind that this might not be the most ideal way to use dynamicAllocation though (especially if you don't have many nodes in the cluster) because the maximizeResourceAllocation feature would make the executors very coarsely grained since there's only one per node. It would still allow multiple applications to run at once though because executors from one application could spin down when idle, allowing another application to spin up executors.

Hope this helps,
Jonathan

On Mon, Oct 3, 2016 at 5:38 PM Jung, Soonoh <[hidden email]> wrote:
Hi everyone,

I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
Basically Zeppelin spark interpreter's spark job is not finishing after executing a notebook.
It looks like the spark job still occupying memory a lot in my Yarn cluster.
Is there a way restart spark interpreter automatically(or pragmatically) every time I run a notebook in order to release that memory in my Yarn cluster?

Regards,
Soonoh

Reply | Threaded
Open this post in threaded view
|

Re: Restart zeppelin spark interpreter

Jung, Soonoh
Hi Jonathan,

I reinstalled EMR without maximizeResourceAllocation, but still have same problem.

I changed default spark interpreter with "isolated" option.
After I executed some notes and all notes finished, it still uses a lot memory.
spark history ui says all jobs completed but one app(application_1475616447166_0014)'s executors tab, there's lots of active executors, other 3 spark apps only have one active executors.
I wonder why those executors are not removed or yarn cluster memory is not released?

Inline images 2

Inline images 3
Inline images 1


Here is how I created AWS-EMR instance:

aws emr create-cluster \
     --termination-protected \
     --applications Name=Ganglia Name=Spark Name=Zeppelin Name=Hive \
     --service-role EMR_DefaultRole \
     --enable-debugging \
     --release-label emr-5.0.0 \
     --name "${EMR_NAME}" \
     --instance-groups '[
{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master Instance Group"},
{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"r3.xlarge","Name":"Core Instance Group"},
{"InstanceCount":6,"BidPrice":"0.15","InstanceGroupType":"TASK","InstanceType":"r3.xlarge","Name":"Task instance group - 6"}]' \
     --configurations '[
{"Classification":"hadoop-env",
    "Properties":{},
    "Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},
{"Classification":"spark-env",
    "Properties":{"maximizeResourceAllocation":"false"},
    "Configurations":[{"Classification":"export","Properties":{"JAVA_HOME":"/usr/lib/jvm/java-1.8.0"},"Configurations":[]}]},
{"Classification":"zeppelin-env",
    "Properties":{},
    "Configurations":[
        {"Classification":"export","Properties":{
            "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
            "ZEPPELIN_NOTEBOOK_S3_BUCKET":"zeppelin-notebook",
            "ZEPPELIN_NOTEBOOK_S3_USER":"zeppelin-user"}
    }]
}]'

and no other manual configuration in zeppelin, spark and yarn sides.

Regards,
Soonoh


On 4 October 2016 at 12:01, Jung, Soonoh <[hidden email]> wrote:
Hi Jonathan,

Thank you for the information!
Yes, I am using maximizeResourceAllocation. I will try turn off this and just use dynamicAllocation alone.

Regards,
Soonoh

On 4 October 2016 at 11:07, Jonathan Kelly <[hidden email]> wrote:
On the most recent several releases of EMR, Spark dynamicAllocation is automatically enabled, as it allows longer running apps like Zeppelin's Spark interpreter to continue running in the background without taking up resources for any executors unless Spark jobs are actively running.

However, if you are seeing resources still being used even after some idle time, maybe you are using maximizeResourceAllocation (which makes any Spark job use 100% of the resources, with one executor per slave node). If you use maximizeResourceAllocation, it effectively disables dynamicAllocation because it causes spark.executor.instances to be set. If you still want to use dynamicAllocation along with maxizeResourceAllocation, just set spark.dynamicAllocation.enabled to true in the spark-defaults configuration classification. This will signal to the maximizeResourceAllocation feature not to set spark.executor.instances so that dynamicAllocation will be used.

Keep in mind that this might not be the most ideal way to use dynamicAllocation though (especially if you don't have many nodes in the cluster) because the maximizeResourceAllocation feature would make the executors very coarsely grained since there's only one per node. It would still allow multiple applications to run at once though because executors from one application could spin down when idle, allowing another application to spin up executors.

Hope this helps,
Jonathan

On Mon, Oct 3, 2016 at 5:38 PM Jung, Soonoh <[hidden email]> wrote:
Hi everyone,

I am using Zeppelin in AWS EMR (Zeppelin 0.6.1, spark 2.0 on Yarn RM)
Basically Zeppelin spark interpreter's spark job is not finishing after executing a notebook.
It looks like the spark job still occupying memory a lot in my Yarn cluster.
Is there a way restart spark interpreter automatically(or pragmatically) every time I run a notebook in order to release that memory in my Yarn cluster?

Regards,
Soonoh