spark.r interpreter becomes unresponsive after some time and R process quits silently

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

spark.r interpreter becomes unresponsive after some time and R process quits silently

Pietro Pugni
Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
 - Ubuntu Server 16.04.2 LTS
 - Spark 2.1.0
 - Microsoft Open R 3.3.2
 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory           21g
spark.driver.memory                     21g
spark.python.worker.memory       4g
spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client 
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.
Kind regards
 Pietro Pugni
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

spark.r interpreter becomes unresponsive after some time and R process quits silently

Paul Brenner
Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 

I do wonder if this is related somehow to https://issues.apache.org/jira/browse/ZEPPELIN-1832
which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.

Paul Brenner
DATA SCIENTIST
(217) 390-3033  

PlaceIQ:Location Data Accuracy

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <[hidden email]> wrote:
Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
 - Ubuntu Server 16.04.2 LTS
 - Spark 2.1.0
 - Microsoft Open R 3.3.2
 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory           21g
spark.driver.memory                     21g
spark.python.worker.memory       4g
spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client 
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.
Kind regards
 Pietro Pugni

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Pietro Pugni
I know for sure that R process gets killed (or quits) but don't know if its father process (interpreter.sh) gets killed too.

I noticed that I can always restart the interpreter on 0.7.1 while sometimes it was impossible to do on 0.7.0 (I had to manually restart zeppelin service). Probably that JIRA improved the situation a little bit.

Now I'm running a bash script that tracks start and stop time of R process in order to shed some light on this issue. I enabled DEBUG logging in log4j properties file.


Il 6 mag 2017 4:43 PM, "Paul Brenner" <[hidden email]> ha scritto:
Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 

I do wonder if this is related somehow to https://issues.apache.org/jira/browse/ZEPPELIN-1832
which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.

Paul Brenner
DATA SCIENTIST
(217) 390-3033  

PlaceIQ:Location Data Accuracy

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <[hidden email]> wrote:
Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
 - Ubuntu Server 16.04.2 LTS
 - Spark 2.1.0
 - Microsoft Open R 3.3.2
 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory           21g
spark.driver.memory                     21g
spark.python.worker.memory       4g
spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client 
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.
Kind regards
 Pietro Pugni

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Pietro Pugni
I repost this because it didn’t appear on the mailing list board.

These are the step needed to reproduce the error and to track down the log message.

1) I started a brand new instance of zeppelin issuing:
service zeppelin start

and started a bash script that tracks down R processes activity.
After running a simple R script from Zeppelin, the R interpreter process was started:

Mon May  8 11:27:59 CEST 2017 >>> R started

2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin tracked down the connection being closed:
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null

3) At 13:08:00 R was closed. My script returned:
Mon May  8 13:08:00 CEST 2017 >>> R stopped

This is the output from the interpreter log file (deleted non-useful lines):
INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on port 45227
INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkRInterpreter
...
INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1494235664723 finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter2097894179
DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll from ZeppelinServer
DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in handleErrors(returnStatus, conn) : 
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No status is returned. Java SparkR backend might have failed.
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: <Anonymous> -> invokeJava -> handleErrors
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Execution halted

This is the output from zeppelin log file (it didn't track the R interpreter failure):
INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056) - Job 20170506-145151_1585482989 is finished successfully, status: FINISHED
INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_-1250116940 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session2130846287
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271) - Validating all active sessions...
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304) - Finished session validation.  No sessions were stopped.

Hope this helps. 
Any hints?

Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <[hidden email]> ha scritto:

I know for sure that R process gets killed (or quits) but don't know if its father process (interpreter.sh) gets killed too.

I noticed that I can always restart the interpreter on 0.7.1 while sometimes it was impossible to do on 0.7.0 (I had to manually restart zeppelin service). Probably that JIRA improved the situation a little bit.

Now I'm running a bash script that tracks start and stop time of R process in order to shed some light on this issue. I enabled DEBUG logging in log4j properties file.


Il 6 mag 2017 4:43 PM, "Paul Brenner" <[hidden email]> ha scritto:
Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 

I do wonder if this is related somehow to https://issues.apache.org/jira/browse/ZEPPELIN-1832
which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.

Paul Brenner
DATA SCIENTIST
(217) 390-3033  

PlaceIQ:Location Data Accuracy

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <[hidden email]> wrote:
Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
 - Ubuntu Server 16.04.2 LTS
 - Spark 2.1.0
 - Microsoft Open R 3.3.2
 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory           21g
spark.driver.memory                     21g
spark.python.worker.memory       4g
spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client 
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.
Kind regards
 Pietro Pugni



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Jongyoul Lee
Hi, Thanks for this detail debugging.

At first, notebookserver doesn't have any clue for this symptom because it's used between browser and zeppelin server.

I don't know why R has stoped unexpectedly. Is there any log related to R? I'm not familiar with R actually.

BTW, I'll install R and test it in my local

On Tue, May 9, 2017 at 8:29 AM, Pietro Pugni <[hidden email]> wrote:
I repost this because it didn’t appear on the mailing list board.

These are the step needed to reproduce the error and to track down the log message.

1) I started a brand new instance of zeppelin issuing:
service zeppelin start

and started a bash script that tracks down R processes activity.
After running a simple R script from Zeppelin, the R interpreter process was started:

Mon May  8 11:27:59 CEST 2017 >>> R started

2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin tracked down the connection being closed:
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null

3) At 13:08:00 R was closed. My script returned:
Mon May  8 13:08:00 CEST 2017 >>> R stopped

This is the output from the interpreter log file (deleted non-useful lines):
INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on port 45227
INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkRInterpreter
...
INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1494235664723 finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter2097894179
DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll from ZeppelinServer
DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in handleErrors(returnStatus, conn) : 
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No status is returned. Java SparkR backend might have failed.
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: <Anonymous> -> invokeJava -> handleErrors
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Execution halted

This is the output from zeppelin log file (it didn't track the R interpreter failure):
INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056) - Job 20170506-145151_1585482989 is finished successfully, status: FINISHED
INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_-<a href="tel:1250116940" value="+821250116940" target="_blank">1250116940 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session2130846287
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271) - Validating all active sessions...
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304) - Finished session validation.  No sessions were stopped.

Hope this helps. 
Any hints?

Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <[hidden email]> ha scritto:

I know for sure that R process gets killed (or quits) but don't know if its father process (interpreter.sh) gets killed too.

I noticed that I can always restart the interpreter on 0.7.1 while sometimes it was impossible to do on 0.7.0 (I had to manually restart zeppelin service). Probably that JIRA improved the situation a little bit.

Now I'm running a bash script that tracks start and stop time of R process in order to shed some light on this issue. I enabled DEBUG logging in log4j properties file.


Il 6 mag 2017 4:43 PM, "Paul Brenner" <[hidden email]> ha scritto:
Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 

I do wonder if this is related somehow to https://issues.apache.org/jira/browse/ZEPPELIN-1832
which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.

Paul Brenner
DATA SCIENTIST
(217) 390-3033  

PlaceIQ:Location Data Accuracy

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <[hidden email]> wrote:
Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
 - Ubuntu Server 16.04.2 LTS
 - Spark 2.1.0
 - Microsoft Open R 3.3.2
 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory           21g
spark.driver.memory                     21g
spark.python.worker.memory       4g
spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client 
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.
Kind regards
 Pietro Pugni






--
이종열, Jongyoul Lee, 李宗烈
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: spark.r interpreter becomes unresponsive after some time and R process quits silently

Pietro Pugni
I opened a JIRA with all the details (logs etc):

Thank you
 Pietro Pugni

Il 9 mag 2017 7:48 PM, "Jongyoul Lee" <[hidden email]> ha scritto:
Hi, Thanks for this detail debugging.

At first, notebookserver doesn't have any clue for this symptom because it's used between browser and zeppelin server.

I don't know why R has stoped unexpectedly. Is there any log related to R? I'm not familiar with R actually.

BTW, I'll install R and test it in my local

On Tue, May 9, 2017 at 8:29 AM, Pietro Pugni <[hidden email]> wrote:
I repost this because it didn’t appear on the mailing list board.

These are the step needed to reproduce the error and to track down the log message.

1) I started a brand new instance of zeppelin issuing:
service zeppelin start

and started a bash script that tracks down R processes activity.
After running a simple R script from Zeppelin, the R interpreter process was started:

Mon May  8 11:27:59 CEST 2017 >>> R started

2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin tracked down the connection being closed:
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null

3) At 13:08:00 R was closed. My script returned:
Mon May  8 13:08:00 CEST 2017 >>> R stopped

This is the output from the interpreter log file (deleted non-useful lines):
INFO [2017-05-08 11:27:43,632] ({Thread-0} RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on port 45227
INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkRInterpreter
...
INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1494235664723 finished by scheduler org.apache.zeppelin.spark.SparkRInterpreter2097894179
DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll from ZeppelinServer
DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in handleErrors(returnStatus, conn) : 
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:  No status is returned. Java SparkR backend might have failed.
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: <Anonymous> -> invokeJava -> handleErrors
DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} InterpreterOutputStream.java[processLine]:72) - Interpreter output:Execution halted

This is the output from zeppelin log file (it didn't track the R interpreter failure):
INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:2056) - Job 20170506-145151_1585482989 is finished successfully, status: FINISHED
INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_-<a href="tel:1250116940" value="+821250116940" target="_blank">1250116940 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session2130846287
INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. (1001) null
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:271) - Validating all active sessions...
INFO [2017-05-08 12:27:12,126] ({Thread-33} AbstractValidatingSessionManager.java[validateSessions]:304) - Finished session validation.  No sessions were stopped.

Hope this helps. 
Any hints?

Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni <[hidden email]> ha scritto:

I know for sure that R process gets killed (or quits) but don't know if its father process (interpreter.sh) gets killed too.

I noticed that I can always restart the interpreter on 0.7.1 while sometimes it was impossible to do on 0.7.0 (I had to manually restart zeppelin service). Probably that JIRA improved the situation a little bit.

Now I'm running a bash script that tracks start and stop time of R process in order to shed some light on this issue. I enabled DEBUG logging in log4j properties file.


Il 6 mag 2017 4:43 PM, "Paul Brenner" <[hidden email]> ha scritto:
Great work documenting repeatable steps for this hard to nail down problem. I see similar problems running the spark (scala) interpreter but haven’t been as systematic about hunting down the issue as you. 

I do wonder if this is related somehow to https://issues.apache.org/jira/browse/ZEPPELIN-1832
which just seems to have addressed killing off zombie processes but I’m not sure it covered where zombie processes are coming from. Perhaps we need to open a ticket for this?

In the mean time if you don’t have the ability to restart zeppelin every time you run into this process you can probably just kill the interpreter process. I find myself doing that multiple times in an normal work day.

Paul Brenner
DATA SCIENTIST
<a href="tel:(217)%20390-3033" value="+12173903033" target="_blank">(217) 390-3033  

PlaceIQ:Location Data Accuracy

On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <[hidden email]> wrote:
Hi all,
I am facing a strange issue on two different machines that acts like servers. Each of them runs an instance of Zeppelin installed as a system.d service.
The configuration is:
 - Ubuntu Server 16.04.2 LTS
 - Spark 2.1.0
 - Microsoft Open R 3.3.2
 - Zeppelin 0.7.1 (0.7.0 gave the same problems)

zeppelin-env.sh has the following settings:
export SPARK_HOME="/spark/home/directory"

spark-env.sh has the following settings:
export LANG="en_US"
export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir"
export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir"

spark-defaults.conf is set as:
spark.executor.memory           21g
spark.driver.memory                     21g
spark.python.worker.memory       4g
spark.sql.autoBroadcastJoinThreshold    0

I use Spark in stand-alone mode and it works perfectly. It also works correctly with Zeppelin but this is what happens:
1) Start zeppelin on the server using the command service zeppelin start
2) Connect to port 8080 using Mozilla Firefox from client 
3) Insert username and password (I enabled Shiro authentication)
4) open a notebook
5) Execute the following code:
%spark.r
2+2
6) The code runs correctly and I can see that R is currently running as a process.
7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and Zeppelin remains forever on “Running” or, if the elapsed time is higher (for example 1 day) since the last run, it returns “Error”. The “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is not present in the list of running processes. Spark session remains active because I can access Spark UI from port 4040 and the application name is “Zeppelin”, so it’s the Spark instance created by Zeppelin.

I observed that sometimes I can simply restart the interpreter from Zeppelin UI, but many other times it doesn’t work and I have to restart Zeppelin ( service zeppelin restart ).

This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with previous versions. It also happens if Zeppelin isn’t installed as a service.

I can’t provide more detail because I can’t see any error or warning in the logs.. this is really strange. 

Thank you all.
Kind regards
 Pietro Pugni






--
이종열, Jongyoul Lee, 李宗烈

Loading...