Zeppelin best practices/ efficiencies

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Zeppelin best practices/ efficiencies

Shan Potti

Hello Zeppelin users,

 

I’m reaching out to you for some guidance on best practices. We currently use Zeppelin 0.7.0 on EMR and I have a few questions on gaining efficiencies with this setup that I would like to get addressed. Would really appreciate if any of you can help me with these issues or point me to the right person/team.

 

1.       Interpreter Settings

 

I understand that the newer versions (we are currently on Zeppelin 0.7), have the option of different interpreter nodes such as Scoped, Isolated and Shared.

Multiple users in our team use the Zeppelin application by creating separate notebooks. Sometimes, jobs continue to execute endlessly or fail to execute or time out due to maxing out on memory. We tend to restart the interpreter or are sometimes forced to restart Zeppelin application on the EMR master node to resume operations. Is this the best way to deal with such issues?

We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an interpreter instance per note.

Would you recommend that we continue to use this interpreter setting or do you think we would be served better by using any other available interpreter settings? I did take a look at the Zeppelin documentation for information on these settings but anything additional would be greatly helpful.

 

Also, is there a way to accurately determine how much of the available memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives us insights on what jobs in various notebooks are running but we don’t have insight on the memory/compute power being used.

 

Ideally, I would like to figure out the root cause behind why my queries are not running. Is it because of memory maxing out on Zeppelin or HDFS or Spark or because of insufficiency in the number of compute nodes.

 

Would really appreciate if you could share any documentation that can guide me on these aspects.

 

2.       Installation Ports

By default Zeppelin on EMR gets installed on port 8890. However, to be complaint with security policies we needed to use other ports. This change was made by editing the Zeppelin configuration file in SSH. I’m concerned if this approach has cloned the application on the other ports and also restricting my usage of Zeppelin. Is this the right way of installing Zeppelin on another port?

 

Appreciate any pointers you may have. Please see below for more information on the cluster and the applications on the cluster.

 

Thanks,

Shan

 

Cluster Details:

Release label: emr-5.4.0

Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase 1.3.0, Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Zeppelin best practices/ efficiencies

Jeff Zhang

Regarding interpreter memory issue, this is because zeppelin's spark interpreter only support yarn-client mode, that means the driver runs in the same host as zeppelin server. So it is pretty easy to run out of memory if many users share the same driver (scoped mode you use). You can try livy interpreter which support yarn-cluster, so that the driver run in the remote host and each user use isolated spark app.


Shanmukha Sreenivas Potti <[hidden email]>于2017年5月4日周四 上午6:54写道:

Hello Zeppelin users,

 

I’m reaching out to you for some guidance on best practices. We currently use Zeppelin 0.7.0 on EMR and I have a few questions on gaining efficiencies with this setup that I would like to get addressed. Would really appreciate if any of you can help me with these issues or point me to the right person/team.

 

1.       Interpreter Settings

 

I understand that the newer versions (we are currently on Zeppelin 0.7), have the option of different interpreter nodes such as Scoped, Isolated and Shared.

Multiple users in our team use the Zeppelin application by creating separate notebooks. Sometimes, jobs continue to execute endlessly or fail to execute or time out due to maxing out on memory. We tend to restart the interpreter or are sometimes forced to restart Zeppelin application on the EMR master node to resume operations. Is this the best way to deal with such issues?

We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an interpreter instance per note.

Would you recommend that we continue to use this interpreter setting or do you think we would be served better by using any other available interpreter settings? I did take a look at the Zeppelin documentation for information on these settings but anything additional would be greatly helpful.

 

Also, is there a way to accurately determine how much of the available memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives us insights on what jobs in various notebooks are running but we don’t have insight on the memory/compute power being used.

 

Ideally, I would like to figure out the root cause behind why my queries are not running. Is it because of memory maxing out on Zeppelin or HDFS or Spark or because of insufficiency in the number of compute nodes.

 

Would really appreciate if you could share any documentation that can guide me on these aspects.

 

2.       Installation Ports

By default Zeppelin on EMR gets installed on port 8890. However, to be complaint with security policies we needed to use other ports. This change was made by editing the Zeppelin configuration file in SSH. I’m concerned if this approach has cloned the application on the other ports and also restricting my usage of Zeppelin. Is this the right way of installing Zeppelin on another port?

 

Appreciate any pointers you may have. Please see below for more information on the cluster and the applications on the cluster.

 

Thanks,

Shan

 

Cluster Details:

Release label: emr-5.4.0

Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase 1.3.0, Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Zeppelin best practices/ efficiencies

Shan Potti
Thanks, Jeff!

I'll look into this solution.

On Wed, May 3, 2017 at 5:32 PM, Jeff Zhang <[hidden email]> wrote:

Regarding interpreter memory issue, this is because zeppelin's spark interpreter only support yarn-client mode, that means the driver runs in the same host as zeppelin server. So it is pretty easy to run out of memory if many users share the same driver (scoped mode you use). You can try livy interpreter which support yarn-cluster, so that the driver run in the remote host and each user use isolated spark app.


Shanmukha Sreenivas Potti <[hidden email]>于2017年5月4日周四 上午6:54写道:

Hello Zeppelin users,

 

I’m reaching out to you for some guidance on best practices. We currently use Zeppelin 0.7.0 on EMR and I have a few questions on gaining efficiencies with this setup that I would like to get addressed. Would really appreciate if any of you can help me with these issues or point me to the right person/team.

 

1.       Interpreter Settings

 

I understand that the newer versions (we are currently on Zeppelin 0.7), have the option of different interpreter nodes such as Scoped, Isolated and Shared.

Multiple users in our team use the Zeppelin application by creating separate notebooks. Sometimes, jobs continue to execute endlessly or fail to execute or time out due to maxing out on memory. We tend to restart the interpreter or are sometimes forced to restart Zeppelin application on the EMR master node to resume operations. Is this the best way to deal with such issues?

We currently use the ‘Scoped’ interpreter setting, i.e. it sets up an interpreter instance per note.

Would you recommend that we continue to use this interpreter setting or do you think we would be served better by using any other available interpreter settings? I did take a look at the Zeppelin documentation for information on these settings but anything additional would be greatly helpful.

 

Also, is there a way to accurately determine how much of the available memory is being used by the various jobs on Zeppelin? The ‘Job’ tab gives us insights on what jobs in various notebooks are running but we don’t have insight on the memory/compute power being used.

 

Ideally, I would like to figure out the root cause behind why my queries are not running. Is it because of memory maxing out on Zeppelin or HDFS or Spark or because of insufficiency in the number of compute nodes.

 

Would really appreciate if you could share any documentation that can guide me on these aspects.

 

2.       Installation Ports

By default Zeppelin on EMR gets installed on port 8890. However, to be complaint with security policies we needed to use other ports. This change was made by editing the Zeppelin configuration file in SSH. I’m concerned if this approach has cloned the application on the other ports and also restricting my usage of Zeppelin. Is this the right way of installing Zeppelin on another port?

 

Appreciate any pointers you may have. Please see below for more information on the cluster and the applications on the cluster.

 

Thanks,

Shan

 

Cluster Details:

Release label: emr-5.4.0

Applications: Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, HBase 1.3.0, Zeppelin 0.7.0, Oozie 4.3.0, Mahout 0.12.2




--
Loading...