Implementing run all paragraphs sequentially

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Implementing run all paragraphs sequentially

Belousov Maksim Eduardovich

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

https://lh6.googleusercontent.com/jwnb7xfb0fPbFg1CWPoMSqovu7ecSMv4pJfuP4zdKVZbyAUDwzAT2GJ5EiemXVYrqMW73yklemTpjXNyLRJABpTCoHi6us2ZI_AxWKHwZpBEA7MjpMP0-7Nk8saaJQfIF4yBMPfS

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

herval
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

https://lh6.googleusercontent.com/jwnb7xfb0fPbFg1CWPoMSqovu7ecSMv4pJfuP4zdKVZbyAUDwzAT2GJ5EiemXVYrqMW73yklemTpjXNyLRJABpTCoHi6us2ZI_AxWKHwZpBEA7MjpMP0-7Nk8saaJQfIF4yBMPfS

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

moon
Administrator
This is going to be really useful!

Curios why do you prefer 'note option' instead of 'run option'?
Could you compare their pros and cons?

Thanks,
moon

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

image002.jpg

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Darren Govoni
We've been needing this feature as well. Very frustrating the way it currently works.




On Fri, Sep 29, 2017 at 12:04 AM -0400, "moon soo Lee" <[hidden email]> wrote:

This is going to be really useful!

Curios why do you prefer 'note option' instead of 'run option'?
Could you compare their pros and cons?

Thanks,
moon

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

image002.jpg

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

afancy
In reply to this post by Belousov Maksim Eduardovich
+1

I think this is one of the most important features. don't know why this requirement has been skipped.

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

https://lh6.googleusercontent.com/jwnb7xfb0fPbFg1CWPoMSqovu7ecSMv4pJfuP4zdKVZbyAUDwzAT2GJ5EiemXVYrqMW73yklemTpjXNyLRJABpTCoHi6us2ZI_AxWKHwZpBEA7MjpMP0-7Nk8saaJQfIF4yBMPfS

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 


Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Jeff Zhang

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

Paragraph 1.

%spark
// code to load bank data

Paragraph 2.

%spark.sql(deps=p1)
// query the bank data

Paragraph 3.
%spark.sql(deps=p1)
// query the bank data




afancy <[hidden email]>于2017年9月29日周五 下午5:35写道:
+1

I think this is one of the most important features. don't know why this requirement has been skipped.

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 



image002.jpg (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Implementing run all paragraphs sequentially

Partridge, Lucas (GE Aviation)

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

 

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Jeff Zhang

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 



Partridge, Lucas (GE Aviation) <[hidden email]>于2017年9月29日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

 

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

https://lh6.googleusercontent.com/jwnb7xfb0fPbFg1CWPoMSqovu7ecSMv4pJfuP4zdKVZbyAUDwzAT2GJ5EiemXVYrqMW73yklemTpjXNyLRJABpTCoHi6us2ZI_AxWKHwZpBEA7MjpMP0-7Nk8saaJQfIF4yBMPfS

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

 

Reply | Threaded
Open this post in threaded view
|

RE: Implementing run all paragraphs sequentially

Sotnichenko Sergey

It would be very complicated to be honest to build a DAG with names like ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such names.

 


Sergey Sotnichenko

 

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:35 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

 

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 

 

 

 

Partridge, Lucas (GE Aviation) <[hidden email]>2017929日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

 

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

 

Reply | Threaded
Open this post in threaded view
|

RE: Implementing run all paragraphs sequentially

Polyakov Valeriy

I suppose there is a fairly simple solution to the problem. We can use flag on paragraph which means “this paragraph should be run in parallel with previous”. Such a logic could help to create sequential-parallel running. It does not implement full-DAG capabilities, but it’s easy to understand and to use.

 


Valeriy Polyakov

 

From: Sotnichenko Sergey [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:45 PM
To: [hidden email]
Subject: RE: Implementing run all paragraphs sequentially

 

It would be very complicated to be honest to build a DAG with names like ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such names.

 


Sergey Sotnichenko

 

 

From: Jeff Zhang [[hidden email]]
Sent: Friday, September 29, 2017 2:35 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

 

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 

 

 

 

Partridge, Lucas (GE Aviation) <[hidden email]>2017929日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

 

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Jeff Zhang
In reply to this post by Sotnichenko Sergey
Yes, the may looks a little complicated, but it is due to how we name paragraph, not due to this approach I think. IMHO without specifying the dependency relationship between paragraphs, it is almost impossible to schedule paragraphs correctly. 




Sotnichenko Sergey <[hidden email]>于2017年9月29日周五 下午7:45写道:

It would be very complicated to be honest to build a DAG with names like ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such names.

 


Sergey Sotnichenko

 

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:35 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

 

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 

 

 

 

Partridge, Lucas (GE Aviation) <[hidden email]>2017929日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Jeff Zhang
>>> I suppose there is a fairly simple solution to the problem. We can use flag on paragraph which means “this paragraph should be run in parallel with previous”. Such a logic could help to create sequential-parallel running. It does not implement full-DAG capabilities, but it’s easy to understand and to use.

This can cover some cases, but can not cover all the cases I think


Jeff Zhang <[hidden email]>于2017年9月29日周五 下午7:52写道:
Yes, the may looks a little complicated, but it is due to how we name paragraph, not due to this approach I think. IMHO without specifying the dependency relationship between paragraphs, it is almost impossible to schedule paragraphs correctly. 




Sotnichenko Sergey <[hidden email]>于2017年9月29日周五 下午7:45写道:

It would be very complicated to be honest to build a DAG with names like ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such names.

 


Sergey Sotnichenko

 

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:35 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

 

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 

 

 

 

Partridge, Lucas (GE Aviation) <[hidden email]>2017929日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

RE: Implementing run all paragraphs sequentially

Polyakov Valeriy

This can cover most of typical parallel-use cases. Other cases could be transformed to this type of case with some increase of full running time.

 

Building of high-grade DAG dependencies will be much more complicated and looks like functionality of visual-based platform of data transformation (e.g. industrial ETL tools) where you can see connections between steps. It’s really hard to support this using just text references.

 


Valeriy Polyakov

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:56 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

>>> I suppose there is a fairly simple solution to the problem. We can use flag on paragraph which means “this paragraph should be run in parallel with previous”. Such a logic could help to create sequential-parallel running. It does not implement full-DAG capabilities, but it’s easy to understand and to use.

 

This can cover some cases, but can not cover all the cases I think

 

 

Jeff Zhang <[hidden email]>2017929日周五 下午7:52写道:

Yes, the may looks a little complicated, but it is due to how we name paragraph, not due to this approach I think. IMHO without specifying the dependency relationship between paragraphs, it is almost impossible to schedule paragraphs correctly. 

 

 

 

 

Sotnichenko Sergey <[hidden email]>2017929日周五 下午7:45写道:

It would be very complicated to be honest to build a DAG with names like ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such names.

 


Sergey Sotnichenko

 

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:35 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

 

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 

 

 

 

Partridge, Lucas (GE Aviation) <[hidden email]>2017929日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

RE: Implementing run all paragraphs sequentially

Sotnichenko Sergey

Colleagues!

How many paragraphs has the typical note? 5? 10?

For 5-10 paragraphs “this paragraph should be run in parallel with previous” option solves 98% issues. It is simple to implement and it is intuitive and simple to use.

In comparison, full-linked DAG is not so intuitive and sometimes even frustrating, especially when ‘20170929-143857_1744629322’ names are involved.


Sergey Sotnichenko

 

From: Polyakov Valeriy [mailto:[hidden email]]
Sent: Friday, September 29, 2017 3:11 PM
To: [hidden email]
Subject: RE: Implementing run all paragraphs sequentially

 

This can cover most of typical parallel-use cases. Other cases could be transformed to this type of case with some increase of full running time.

 

Building of high-grade DAG dependencies will be much more complicated and looks like functionality of visual-based platform of data transformation (e.g. industrial ETL tools) where you can see connections between steps. It’s really hard to support this using just text references.

 


Valeriy Polyakov

 

From: Jeff Zhang [[hidden email]]
Sent: Friday, September 29, 2017 2:56 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

>>> I suppose there is a fairly simple solution to the problem. We can use flag on paragraph which means “this paragraph should be run in parallel with previous”. Such a logic could help to create sequential-parallel running. It does not implement full-DAG capabilities, but it’s easy to understand and to use.

 

This can cover some cases, but can not cover all the cases I think

 

 

Jeff Zhang <[hidden email]>2017929日周五 下午7:52写道:

Yes, the may looks a little complicated, but it is due to how we name paragraph, not due to this approach I think. IMHO without specifying the dependency relationship between paragraphs, it is almost impossible to schedule paragraphs correctly. 

 

 

 

 

Sotnichenko Sergey <[hidden email]>2017929日周五 下午7:45写道:

It would be very complicated to be honest to build a DAG with names like ‘20170929-143857_1744629322’. Let’s imagine we have 20 paragraphs with such names.

 


Sergey Sotnichenko

 

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: Friday, September 29, 2017 2:35 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

 

'p1', 'p2' is paragraphId. Regarding the readability, we could allow user to set paragraph name, but this is another story, could be an improvement later. 

 

 

 

Partridge, Lucas (GE Aviation) <[hidden email]>2017929日周五 下午7:30写道:

Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were you using that as shorthand for the id of the paragraph?

If the former then what happens if someone inserts, deletes or reorders paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for someone to read and follow the dependency relationships…

 

From: Jeff Zhang [mailto:[hidden email]]
Sent: 29 September 2017 11:58
To: [hidden email]
Subject: EXT: Re: Implementing run all paragraphs sequentially

 

 

I don't think 2 note setting (parallel/sequential) is sufficient for paragraph scheduling (take the spark tutorial note as an example, we should run the loading bank data paragraph first and then could run all the sql paragraph parallelly).  So the key is how we define the dependency relationship between paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). Sequential running is just one special kind of DAG (a linked list).     

 

I believe we discuss it before in community.  My proposal is that we could add attribute to the interpreter indicator of each paragraph, so that user can specify the paragraph's dependency (If user don't specify it, the default dependency is the paragraph ahead of it).  Still take the spark tutorial note as an example. We have 3 paragraphes, the first one will load bank data, and the second, third paragraph will query the data. So paragraph 2,3 can run parallelly but must run after paragraph 1. Then we need to specify their dependency in the interpreter indicator part.  Of course, user don't need to specify dependencies if the want to run all the paragraphes sequentially, because the default dependencies is the paragraph ahead of it. 

 

Paragraph 1.

 

%spark

// code to load bank data

 

Paragraph 2.

 

%spark.sql(deps=p1)

// query the bank data

 

Paragraph 3.

%spark.sql(deps=p1)

// query the bank data

 

 

 

 

afancy <[hidden email]>2017929日周五 下午5:35写道:

+1

I think this is one of the most important features. don't know why this requirement has been skipped.

 

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich <[hidden email]> wrote:

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

herval
In reply to this post by moon
At least in our case, the notebooks that we need to run sequentially are expected to *always* run sequentially - thus it makes more sense to be a note option than a per-run mode

H

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, September 28, 2017 9:03 PM
Subject: Re: Implementing run all paragraphs sequentially
To: <[hidden email]>


This is going to be really useful!

Curios why do you prefer 'note option' instead of 'run option'?
Could you compare their pros and cons?

Thanks,
moon

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

image002.jpg

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 




image002.jpg (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Mohit Jaggi
What is the current behavior?

On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <[hidden email]> wrote:
At least in our case, the notebooks that we need to run sequentially are expected to *always* run sequentially - thus it makes more sense to be a note option than a per-run mode

H

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, September 28, 2017 9:03 PM
Subject: Re: Implementing run all paragraphs sequentially
To: <[hidden email]>


This is going to be really useful!

Curios why do you prefer 'note option' instead of 'run option'?
Could you compare their pros and cons?

Thanks,
moon

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

image002.jpg

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 




Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

moon
Administrator
Current behavior is as parallel as possible.
Run notebook button currently submits all paragraphs in a notebook into each interpreter's own scheduler (FIFO, Parallel) at once. And each individual scheduler of interpreter runs the paragraphs.

I think we can provide "sequential" run button for easier use, which submits paragraph one and waits for finish before submit next paragraphs.

And I think sequential run button doesn't stop having more complex / flexible DAG in the future?

Thanks,
moon

On Fri, Sep 29, 2017 at 10:08 AM Mohit Jaggi <[hidden email]> wrote:
What is the current behavior?

On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <[hidden email]> wrote:
At least in our case, the notebooks that we need to run sequentially are expected to *always* run sequentially - thus it makes more sense to be a note option than a per-run mode

H

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, September 28, 2017 9:03 PM
Subject: Re: Implementing run all paragraphs sequentially
To: <[hidden email]>


This is going to be really useful!

Curios why do you prefer 'note option' instead of 'run option'?
Could you compare their pros and cons?

Thanks,
moon

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

image002.jpg

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

We are glad to hear any thoughts.

Thank you.

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 




Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

Michael Segel
Sorry to jump in… 

If you want to run paragraphs in parallel, you are going to want to have some sort of dependency graph.  Think of a common set up where you need to set up common functions and imports. (setup of %spark.dep) 

A good example is if your notebook is a bunch of unit tests and you need to build the common tear down / set up methods to be used by the other paragraphs. 

If you’re going to do that, you’ll need to build out a metadata structure where you can set up your dependencies  as well as add things like labels beyond the ids (which only need to be unique to the given notebook. ) 

Just my $0.02 

On Sep 29, 2017, at 1:30 PM, moon soo Lee <[hidden email]> wrote:

Current behavior is as parallel as possible.
Run notebook button currently submits all paragraphs in a notebook into each interpreter's own scheduler (FIFO, Parallel) at once. And each individual scheduler of interpreter runs the paragraphs.

I think we can provide "sequential" run button for easier use, which submits paragraph one and waits for finish before submit next paragraphs.

And I think sequential run button doesn't stop having more complex / flexible DAG in the future?

Thanks,
moon

On Fri, Sep 29, 2017 at 10:08 AM Mohit Jaggi <[hidden email]> wrote:
What is the current behavior?

On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <[hidden email]> wrote:
At least in our case, the notebooks that we need to run sequentially are expected to *always* run sequentially - thus it makes more sense to be a note option than a per-run mode

H

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, September 28, 2017 9:03 PM
Subject: Re: Implementing run all paragraphs sequentially
To: <[hidden email]>


This is going to be really useful!

Curios why do you prefer 'note option' instead of 'run option'?
Could you compare their pros and cons?

Thanks,
moon

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:
+1, our internal users at Twitter also often request this


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To: [hidden email]
Subject: Implementing run all paragraphs sequentially
 

Hello, users!


At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.


It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown


image002.jpg

 
 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"


We are glad to hear any thoughts.

Thank you.


 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 
 


Maksim Belousov

 




Reply | Threaded
Open this post in threaded view
|

RE: Implementing run all paragraphs sequentially

Polyakov Valeriy

Let me try to summarize the discussion. Evidently, current behavior of running notes does not meet actual requirements. The most important thing that we need is the ability of sequential running. However, at the same time we want to keep functionality of parallel running. We discussed that the most suitable solution of building paragraphs` dependencies is a DAG (directed acyclic graph). Therefore, surely, this kind of dependencies should be defined in note and the running order should not depend on how we launch it (button / scheduler / API). In this way, our objectives are to implement “dependency definition engine” and to use it in “run engine”. What are the options?

1)      Explicit dependency definition.

We could take for a rule that each paragraph should wait for the end of execution of ALL previous paragraphs. Then we add paragraph option “Wait for …” where we can choose paragraph for which we are waiting for to start execution. In case where the option is set, we start execution immediately after the end of execution of selected paragraph. This pattern allows us to implement full-parallel DAG running order. What are the disadvantages? All of them are about the same – not easy understanding of the dependency management process from the perspective of users (and probably redundancy of the functionality – my personal view). At first, we should use strange format of paragraph IDs, which in addition is hidden. We could come up with visible and handsome paragraph ID aliases, but then it appears necessity of duplication control. The second thing is in some kind of scenarios where we should change existing dependencies (e.g. you need to add new paragraph between one and dependent group – you have to change option “Wait for …” for each paragraph in group).

2)      Implicit dependency definition.

We could take for a rule that each paragraph should wait for the end of execution of ALL previous paragraphs. Then we add paragraph option “Run in parallel with previous” which allows us to create paragraph groups to run in parallel. It turns out that we have the way of sequential running of paragraph groups – group by group in which paragraphs run in parallel. This approach is much more understandable for the users, but the obvious defect in comparison with “Explicit definition” is the fact that dependency graph and level of parallelism are not so cool.

I am not sure which option (1) or (2) is correct to implement at the moment. I hope to hear from product visionaries which way to choose and to get approval for the start of implementation.

Thank you!

 

 


Valeriy Polyakov

 

From: Michael Segel [mailto:[hidden email]]
Sent: Saturday, September 30, 2017 4:22 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

Sorry to jump in… 

 

If you want to run paragraphs in parallel, you are going to want to have some sort of dependency graph.  Think of a common set up where you need to set up common functions and imports. (setup of %spark.dep) 

 

A good example is if your notebook is a bunch of unit tests and you need to build the common tear down / set up methods to be used by the other paragraphs. 

 

If you’re going to do that, you’ll need to build out a metadata structure where you can set up your dependencies  as well as add things like labels beyond the ids (which only need to be unique to the given notebook. ) 

 

Just my $0.02 

 

On Sep 29, 2017, at 1:30 PM, moon soo Lee <[hidden email]> wrote:

 

Current behavior is as parallel as possible.

Run notebook button currently submits all paragraphs in a notebook into each interpreter's own scheduler (FIFO, Parallel) at once. And each individual scheduler of interpreter runs the paragraphs.

 

I think we can provide "sequential" run button for easier use, which submits paragraph one and waits for finish before submit next paragraphs.

 

And I think sequential run button doesn't stop having more complex / flexible DAG in the future?

 

Thanks,

moon

 

On Fri, Sep 29, 2017 at 10:08 AM Mohit Jaggi <[hidden email]> wrote:

What is the current behavior?

 

On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <[hidden email]> wrote:

At least in our case, the notebooks that we need to run sequentially are expected to *always* run sequentially - thus it makes more sense to be a note option than a per-run mode

 

H

 

_____________________________
From: moon soo Lee <
[hidden email]>
Sent: Thursday, September 28, 2017 9:03 PM
Subject: Re: Implementing run all paragraphs sequentially
To: <
[hidden email]>

This is going to be really useful!

 

Curios why do you prefer 'note option' instead of 'run option'?

Could you compare their pros and cons?

 

Thanks,

moon

 

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:

+1, our internal users at Twitter also often request this

 


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To:
[hidden email]
Subject: Implementing run all paragraphs sequentially

 

Hello, users!

 

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

 

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

Image removed by sender. image002.jpg

 

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

 

We are glad to hear any thoughts.

Thank you.

 

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Implementing run all paragraphs sequentially

herval
Why do you need rules and graphs and any of that to support running everything sequentially or everything in parallel?

3) add a “run mode” to the note. If it’s “sequential”, run the paragraphs one at a time, in the order they’re defined. If parallel, run using current scheme (as many at the same time as the threadpool permits)

Simpler and covers all cases, imo


From: Polyakov Valeriy <[hidden email]>
Sent: Monday, October 2, 2017 8:24:35 AM
To: [hidden email]
Subject: RE: Implementing run all paragraphs sequentially
 

Let me try to summarize the discussion. Evidently, current behavior of running notes does not meet actual requirements. The most important thing that we need is the ability of sequential running. However, at the same time we want to keep functionality of parallel running. We discussed that the most suitable solution of building paragraphs` dependencies is a DAG (directed acyclic graph). Therefore, surely, this kind of dependencies should be defined in note and the running order should not depend on how we launch it (button / scheduler / API). In this way, our objectives are to implement “dependency definition engine” and to use it in “run engine”. What are the options?

1)      Explicit dependency definition.

We could take for a rule that each paragraph should wait for the end of execution of ALL previous paragraphs. Then we add paragraph option “Wait for …” where we can choose paragraph for which we are waiting for to start execution. In case where the option is set, we start execution immediately after the end of execution of selected paragraph. This pattern allows us to implement full-parallel DAG running order. What are the disadvantages? All of them are about the same – not easy understanding of the dependency management process from the perspective of users (and probably redundancy of the functionality – my personal view). At first, we should use strange format of paragraph IDs, which in addition is hidden. We could come up with visible and handsome paragraph ID aliases, but then it appears necessity of duplication control. The second thing is in some kind of scenarios where we should change existing dependencies (e.g. you need to add new paragraph between one and dependent group – you have to change option “Wait for …” for each paragraph in group).

2)      Implicit dependency definition.

We could take for a rule that each paragraph should wait for the end of execution of ALL previous paragraphs. Then we add paragraph option “Run in parallel with previous” which allows us to create paragraph groups to run in parallel. It turns out that we have the way of sequential running of paragraph groups – group by group in which paragraphs run in parallel. This approach is much more understandable for the users, but the obvious defect in comparison with “Explicit definition” is the fact that dependency graph and level of parallelism are not so cool.

I am not sure which option (1) or (2) is correct to implement at the moment. I hope to hear from product visionaries which way to choose and to get approval for the start of implementation.

Thank you!

 

 


Valeriy Polyakov

 

From: Michael Segel [mailto:[hidden email]]
Sent: Saturday, September 30, 2017 4:22 PM
To: [hidden email]
Subject: Re: Implementing run all paragraphs sequentially

 

Sorry to jump in… 

 

If you want to run paragraphs in parallel, you are going to want to have some sort of dependency graph.  Think of a common set up where you need to set up common functions and imports. (setup of %spark.dep) 

 

A good example is if your notebook is a bunch of unit tests and you need to build the common tear down / set up methods to be used by the other paragraphs. 

 

If you’re going to do that, you’ll need to build out a metadata structure where you can set up your dependencies  as well as add things like labels beyond the ids (which only need to be unique to the given notebook. ) 

 

Just my $0.02 

 

On Sep 29, 2017, at 1:30 PM, moon soo Lee <[hidden email]> wrote:

 

Current behavior is as parallel as possible.

Run notebook button currently submits all paragraphs in a notebook into each interpreter's own scheduler (FIFO, Parallel) at once. And each individual scheduler of interpreter runs the paragraphs.

 

I think we can provide "sequential" run button for easier use, which submits paragraph one and waits for finish before submit next paragraphs.

 

And I think sequential run button doesn't stop having more complex / flexible DAG in the future?

 

Thanks,

moon

 

On Fri, Sep 29, 2017 at 10:08 AM Mohit Jaggi <[hidden email]> wrote:

What is the current behavior?

 

On Fri, Sep 29, 2017 at 6:56 AM, Herval Freire <[hidden email]> wrote:

At least in our case, the notebooks that we need to run sequentially are expected to *always* run sequentially - thus it makes more sense to be a note option than a per-run mode

 

H

 

_____________________________
From: moon soo Lee <
[hidden email]>
Sent: Thursday, September 28, 2017 9:03 PM
Subject: Re: Implementing run all paragraphs sequentially
To: <
[hidden email]>

This is going to be really useful!

 

Curios why do you prefer 'note option' instead of 'run option'?

Could you compare their pros and cons?

 

Thanks,

moon

 

On Thu, Sep 28, 2017 at 8:32 AM Herval Freire <[hidden email]> wrote:

+1, our internal users at Twitter also often request this

 


From: Belousov Maksim Eduardovich <[hidden email]>
Sent: Thursday, September 28, 2017 8:28:58 AM
To:
[hidden email]
Subject: Implementing run all paragraphs sequentially

 

Hello, users!

 

At the moment our analysts often use mixes of interpreters in their notes.

For example, they prepare data using %jdbc and then use it in %pyspark. Besides, they often use scheduling to make some regular reporting. And they should do something like `time.sleep()` to wait for the data from %jdbc. It doesn`t guarantee the result and doesn`t look cool.

 

You can find early attempts to implement sequential running of all paragraphs in [1].

We are really interested in implementation of the issue [2] and are ready to solve it.

 

It seems a good idea to discuss any requirements.

My idea is to introduce note setting that defines the type of running to use (parallel or sequential) and leave "Run all" to be the only button running all the cells in the note. This will make sequential or parallel running the `note option` but not `run option`.

Option will be controlled by nearby button as shown

 

Image removed by sender. image002.jpg

 

 

 

For new notes the default state would be "Run sequential all", for old - "Run parallel for interpreters"

 

We are glad to hear any thoughts.

Thank you.

 

 

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165

[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368

 

 


Maksim Belousov

 

 

 

 

12