Can't download moderately large data or number of rows to csv

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Can't download moderately large data or number of rows to csv

Paul Brenner
There are limits to how much data the download to csv button will download (1.5MB? 3500 rows?) which limit zeppelin’s usefulness for our BI teams. This limit comes up far before we run into issues with showing too many rows of data in zeppelin.

Unfortunately (fortunately?) Hue is the other tool the BI team has been using and there they have no problem downloading much larger datasets to csv. This is definitely not a requirement I’ve ever run into in the way I use zeppelin since I would just use spark to write the data out. However, the BI team is not allowed to run spark jobs (they use hive via jdbc) so that download to csv button is pretty important to them. 

Would it be possible to significantly increase the limit? Even better would it be possible to download more data than is shown? I assume this is the type of thing I would need to open a ticket for, but I wanted to ask here first.

Paul Brenner
DATA SCIENTIST
(217) 390-3033  

PlaceIQ:Location Data Accuracy
Reply | Threaded
Open this post in threaded view
|

Re: Can't download moderately large data or number of rows to csv

Ruslan Dautkhanov
Good idea to introduce in Zeppelin a way to download full datasets without 
actually visualizing them.

Not sure if this helps, we taught our users to use %sh hadoop fs -getmerge /hadoop/path/dir/ /some/nfs/mount/
for large files (they sometimes have to download datasets with millions of records).
They run Zeppelin on edge nodes that have NFS mounts to a drop zone.

Not sure how much it scales up.



--
Ruslan Dautkhanov

On Tue, May 2, 2017 at 10:41 AM, Paul Brenner <[hidden email]> wrote:
There are limits to how much data the download to csv button will download (1.5MB? 3500 rows?) which limit zeppelin’s usefulness for our BI teams. This limit comes up far before we run into issues with showing too many rows of data in zeppelin.

Unfortunately (fortunately?) Hue is the other tool the BI team has been using and there they have no problem downloading much larger datasets to csv. This is definitely not a requirement I’ve ever run into in the way I use zeppelin since I would just use spark to write the data out. However, the BI team is not allowed to run spark jobs (they use hive via jdbc) so that download to csv button is pretty important to them. 

Would it be possible to significantly increase the limit? Even better would it be possible to download more data than is shown? I assume this is the type of thing I would need to open a ticket for, but I wanted to ask here first.

Paul Brenner
DATA SCIENTIST
<a href="tel:(217)%20390-3033" value="+12173903033" target="_blank">(217) 390-3033  

PlaceIQ:Location Data Accuracy

Reply | Threaded
Open this post in threaded view
|

Re: Can't download moderately large data or number of rows to csv

Kevin Niemann
We came across this issue as well, Zeppelin csv export is using the data URI scheme which is base64 encoding all the rows into a single string, Chrome seems to crash with over a few thousand rows, but Firefox has been able to handle over 100k for me. However, the Zeppelin notebook itself becomes slow at that point. I would also like better support for the ability to export a large set of rows, perhaps another tool is more preferred?

On Tue, May 2, 2017 at 10:00 AM, Ruslan Dautkhanov <[hidden email]> wrote:
Good idea to introduce in Zeppelin a way to download full datasets without 
actually visualizing them.

Not sure if this helps, we taught our users to use %sh hadoop fs -getmerge /hadoop/path/dir/ /some/nfs/mount/
for large files (they sometimes have to download datasets with millions of records).
They run Zeppelin on edge nodes that have NFS mounts to a drop zone.

Not sure how much it scales up.



--
Ruslan Dautkhanov

On Tue, May 2, 2017 at 10:41 AM, Paul Brenner <[hidden email]> wrote:
There are limits to how much data the download to csv button will download (1.5MB? 3500 rows?) which limit zeppelin’s usefulness for our BI teams. This limit comes up far before we run into issues with showing too many rows of data in zeppelin.

Unfortunately (fortunately?) Hue is the other tool the BI team has been using and there they have no problem downloading much larger datasets to csv. This is definitely not a requirement I’ve ever run into in the way I use zeppelin since I would just use spark to write the data out. However, the BI team is not allowed to run spark jobs (they use hive via jdbc) so that download to csv button is pretty important to them. 

Would it be possible to significantly increase the limit? Even better would it be possible to download more data than is shown? I assume this is the type of thing I would need to open a ticket for, but I wanted to ask here first.

Paul Brenner
DATA SCIENTIST
<a href="tel:(217)%20390-3033" value="+12173903033" target="_blank">(217) 390-3033  

PlaceIQ:Location Data Accuracy


Reply | Threaded
Open this post in threaded view
|

Re: Can't download moderately large data or number of rows to csv

Rick Moritz
I think whether this is an issue or not, depends a lot on how you use Zeppelin, and what tools you need to integrate with. Sadly Excel is still around as a data processing tool, and many people who I introduce to Zeppelin are quite proficient with it, hence the desire to export to csv in a trivial manner --  or merely the presence of the "download CSV"-button incites them to expect it to work for reasonably sized data (i.e. up to around 10^6 rows).

I do prefer Ruslan's idea, but I think Zeppelin should include something similar out of the box. The key requirement should be that the data doesn't have to travel through the notebook interface, but rather is made available in a temporary folder and then served via a download link. The downside to this approach is, that ideally you'd want this kind of operation to be interpreter agnostic. In that case every interpreter would need to offer an interface which allows to collect the data to a local-to-zeppelin temporary folder.

Nonetheless, to turn Zeppelin into the serve-it-all solution that it could be, I do believe that "fixing" the csv-export is important. I'd definitely vote for a Jira advancing this issue.

On Tue, May 2, 2017 at 9:33 PM, Kevin Niemann <[hidden email]> wrote:
We came across this issue as well, Zeppelin csv export is using the data URI scheme which is base64 encoding all the rows into a single string, Chrome seems to crash with over a few thousand rows, but Firefox has been able to handle over 100k for me. However, the Zeppelin notebook itself becomes slow at that point. I would also like better support for the ability to export a large set of rows, perhaps another tool is more preferred?

On Tue, May 2, 2017 at 10:00 AM, Ruslan Dautkhanov <[hidden email]> wrote:
Good idea to introduce in Zeppelin a way to download full datasets without 
actually visualizing them.

Not sure if this helps, we taught our users to use %sh hadoop fs -getmerge /hadoop/path/dir/ /some/nfs/mount/
for large files (they sometimes have to download datasets with millions of records).
They run Zeppelin on edge nodes that have NFS mounts to a drop zone.

Not sure how much it scales up.



--
Ruslan Dautkhanov

On Tue, May 2, 2017 at 10:41 AM, Paul Brenner <[hidden email]> wrote:
There are limits to how much data the download to csv button will download (1.5MB? 3500 rows?) which limit zeppelin’s usefulness for our BI teams. This limit comes up far before we run into issues with showing too many rows of data in zeppelin.

Unfortunately (fortunately?) Hue is the other tool the BI team has been using and there they have no problem downloading much larger datasets to csv. This is definitely not a requirement I’ve ever run into in the way I use zeppelin since I would just use spark to write the data out. However, the BI team is not allowed to run spark jobs (they use hive via jdbc) so that download to csv button is pretty important to them. 

Would it be possible to significantly increase the limit? Even better would it be possible to download more data than is shown? I assume this is the type of thing I would need to open a ticket for, but I wanted to ask here first.

Paul Brenner
DATA SCIENTIST
<a href="tel:(217)%20390-3033" value="+12173903033" target="_blank">(217) 390-3033  

PlaceIQ:Location Data Accuracy



Reply | Threaded
Open this post in threaded view
|

Re: Can't download moderately large data or number of rows to csv

Paul Brenner
I’m not sure what the best solution is but I created a ticket here:



Paul Brenner
DATA SCIENTIST
(217) 390-3033  

PlaceIQ:Location Data Accuracy

On Wed, May 03, 2017 at 4:01 AM Rick Moritz <[hidden email]> wrote:
I think whether this is an issue or not, depends a lot on how you use Zeppelin, and what tools you need to integrate with. Sadly Excel is still around as a data processing tool, and many people who I introduce to Zeppelin are quite proficient with it, hence the desire to export to csv in a trivial manner --  or merely the presence of the "download CSV"-button incites them to expect it to work for reasonably sized data (i.e. up to around 10^6 rows).

I do prefer Ruslan's idea, but I think Zeppelin should include something similar out of the box. The key requirement should be that the data doesn't have to travel through the notebook interface, but rather is made available in a temporary folder and then served via a download link. The downside to this approach is, that ideally you'd want this kind of operation to be interpreter agnostic. In that case every interpreter would need to offer an interface which allows to collect the data to a local-to-zeppelin temporary folder.

Nonetheless, to turn Zeppelin into the serve-it-all solution that it could be, I do believe that "fixing" the csv-export is important. I'd definitely vote for a Jira advancing this issue.

On Tue, May 2, 2017 at 9:33 PM, Kevin Niemann <[hidden email]> wrote:
We came across this issue as well, Zeppelin csv export is using the data URI scheme which is base64 encoding all the rows into a single string, Chrome seems to crash with over a few thousand rows, but Firefox has been able to handle over 100k for me. However, the Zeppelin notebook itself becomes slow at that point. I would also like better support for the ability to export a large set of rows, perhaps another tool is more preferred?

On Tue, May 2, 2017 at 10:00 AM, Ruslan Dautkhanov <[hidden email]> wrote:
Good idea to introduce in Zeppelin a way to download full datasets without 
actually visualizing them.

Not sure if this helps, we taught our users to use %sh hadoop fs -getmerge /hadoop/path/dir/ /some/nfs/mount/
for large files (they sometimes have to download datasets with millions of records).
They run Zeppelin on edge nodes that have NFS mounts to a drop zone.

Not sure how much it scales up.



--
Ruslan Dautkhanov

On Tue, May 2, 2017 at 10:41 AM, Paul Brenner <[hidden email]> wrote:
There are limits to how much data the download to csv button will download (1.5MB? 3500 rows?) which limit zeppelin’s usefulness for our BI teams. This limit comes up far before we run into issues with showing too many rows of data in zeppelin.

Unfortunately (fortunately?) Hue is the other tool the BI team has been using and there they have no problem downloading much larger datasets to csv. This is definitely not a requirement I’ve ever run into in the way I use zeppelin since I would just use spark to write the data out. However, the BI team is not allowed to run spark jobs (they use hive via jdbc) so that download to csv button is pretty important to them. 

Would it be possible to significantly increase the limit? Even better would it be possible to download more data than is shown? I assume this is the type of thing I would need to open a ticket for, but I wanted to ask here first.

Paul Brenner
DATA SCIENTIST
<a href="tel:(217)%20390-3033" value="+12173903033" target="_blank">(217) 390-3033  

PlaceIQ:Location Data Accuracy