notebook-authorization.json file makes Zeppelin not scalable

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

notebook-authorization.json file makes Zeppelin not scalable

Tan, Jialiang

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Jeff Zhang

There's one ticket for unifying zeppelin storage layer. https://issues.apache.org/jira/browse/ZEPPELIN-2742

But for your case about sharing notebook-authorization across multiple zeppelin instances, I think this ticket is not enough, it would require more deep integration with shiro's authorization. 

Tan, Jialiang <[hidden email]>于2017年10月17日周二 下午3:14写道:

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Tan, Jialiang

Thanks for such quick reply. Does Zepplin 0.8.0-SNAPSHOT MONGO DB store autorizations in db or still in that json file?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:28 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

There's one ticket for unifying zeppelin storage layer. https://issues.apache.org/jira/browse/ZEPPELIN-2742

 

But for your case about sharing notebook-authorization across multiple zeppelin instances, I think this ticket is not enough, it would require more deep integration with shiro's authorization. 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:14写道:

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Jeff Zhang

Still in file format. 


Tan, Jialiang <[hidden email]>于2017年10月17日周二 下午3:38写道:

Thanks for such quick reply. Does Zepplin 0.8.0-SNAPSHOT MONGO DB store autorizations in db or still in that json file?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:28 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

There's one ticket for unifying zeppelin storage layer. https://issues.apache.org/jira/browse/ZEPPELIN-2742

 

But for your case about sharing notebook-authorization across multiple zeppelin instances, I think this ticket is not enough, it would require more deep integration with shiro's authorization. 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:14写道:

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Tan, Jialiang

I went through those tickets. Are there any plans on those improvements? When will the storage layer unification be done approximately?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:45 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

Still in file format. 

 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:38写道:

Thanks for such quick reply. Does Zepplin 0.8.0-SNAPSHOT MONGO DB store autorizations in db or still in that json file?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:28 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

There's one ticket for unifying zeppelin storage layer. https://issues.apache.org/jira/browse/ZEPPELIN-2742

 

But for your case about sharing notebook-authorization across multiple zeppelin instances, I think this ticket is not enough, it would require more deep integration with shiro's authorization. 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:14写道:

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Jeff Zhang

Unify storage could be done in 0.8.0. But for your scenario, it's not about storage, it's about how to update the storage. For now, for each change, the whole file needs to be updated. You scenario means each change only update one record.
 

Tan, Jialiang <[hidden email]>于2017年10月17日周二 下午3:47写道:

I went through those tickets. Are there any plans on those improvements? When will the storage layer unification be done approximately?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:45 AM


To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

Still in file format. 

 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:38写道:

Thanks for such quick reply. Does Zepplin 0.8.0-SNAPSHOT MONGO DB store autorizations in db or still in that json file?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:28 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

There's one ticket for unifying zeppelin storage layer. https://issues.apache.org/jira/browse/ZEPPELIN-2742

 

But for your case about sharing notebook-authorization across multiple zeppelin instances, I think this ticket is not enough, it would require more deep integration with shiro's authorization. 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:14写道:

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Tan, Jialiang

That is correct. In fact I think the scenario is very general as long as we want Zeppelin to be scalable. Unifying storage is not going to be that useful without supporting updating single record. Without that multiple Zeppelin instances working in parallel would not be viable.

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 1:07 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

Unify storage could be done in 0.8.0. But for your scenario, it's not about storage, it's about how to update the storage. For now, for each change, the whole file needs to be updated. You scenario means each change only update one record.

 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:47写道:

I went through those tickets. Are there any plans on those improvements? When will the storage layer unification be done approximately?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:45 AM


To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

Still in file format. 

 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:38写道:

Thanks for such quick reply. Does Zepplin 0.8.0-SNAPSHOT MONGO DB store autorizations in db or still in that json file?

 

From: Jeff Zhang <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Tuesday, October 17, 2017 at 12:28 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: notebook-authorization.json file makes Zeppelin not scalable

 

 

There's one ticket for unifying zeppelin storage layer. https://issues.apache.org/jira/browse/ZEPPELIN-2742

 

But for your case about sharing notebook-authorization across multiple zeppelin instances, I think this ticket is not enough, it would require more deep integration with shiro's authorization. 

 

Tan, Jialiang <[hidden email]>20171017日周二 下午3:14写道:

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

1.       We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.

2.       I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?

 

Reply | Threaded
Open this post in threaded view
|

Re: notebook-authorization.json file makes Zeppelin not scalable

Patrick Maroney
In reply to this post by Tan, Jialiang

For issue (1) you might want to try Amazon EFS. While EFS is designed for “big data”, you can use it for other concurrency use cases.  You need to pay very close attention to your storage size/utilization ratios as EFS can complete choke off bandwidth.  I would also look at using EFS for Notebook storage as this will help up the storage size/utilization ratios and –may—improve performance.   In any case, EFS could provide a solid concurrency solution.  Costs little to test the concept.

 

Note:  twe have similar scenario to you on our roadmap.  Our approach relies on F5 (and SAML for authentication).  We have not got to Zeppelin yet in our SSO/SAML integration roadmap.

 

Patrick Maroney

Principal Engineer – Data Sciences & Analytics

Wapack Labs

609-841-5104

[hidden email]

http://pgp.mit.edu/pks/lookup?op=get&search=0x7C810C9769BD29AF

http://www.wapacklabs.com

 

From: "Tan, Jialiang" <[hidden email]>
Reply-To: <[hidden email]>
Date: Tuesday, October 17, 2017 at 3:14 AM
To: "[hidden email]" <[hidden email]>
Subject: notebook-authorization.json file makes Zeppelin not scalable

 

We want to have a Zeppelin service that serves over 200 people in our company. So we plan to have around 10 – 15 Zeppelin instances behind an ELB. We use S3 as notebook storage, and hence all our Zeppelin instances are referring to the same S3 location for notebooks. But there is one thing that breaks the whole thing: Zeppelin is storing the notebook authorization information into a LOCAL file called notebook-authorization.json. In order to solve the problem we setup some NFS like thing to let every Zeppelin instance to refer to the same configuration location through FS mount. The method has following problems:

  1. We cannot handle concurrency conditions where multiple Zeppelin instances are editing the files at the same time. Some unexpected behaviors will happen.
  2. I found out that Zeppelin only reads the notebook-authorization.json file to memory on startup. After startup, it only treats the authorization in memory as the source of truth. Zeppelin will never read that file anymore unless you restart it. It only writes to it, from memory. Therefore even without the concurrency problem described in (1), it is not able to get the correct authorization for notebooks after other Zeppelin instances change the authorization file.

I know the reasons behind for making authorizations separate from notebook but it actually brings up more serious problems like this. Any ideas how to tackle this problem and make Zeppelin scalable?