Help with loading a CSV using Spark-SQL & Spark-CSV

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with loading a CSV using Spark-SQL & Spark-CSV

Ryan
Hi,

In a Zeppelin notebook, I am trying to load a csv using the spark-csv package by databricks. I am using the Hortonworks sandbox to run Zeppelin on. Unfortunately, the methods I have been trying have not been working. 

My latest attempt is:
%dep
z.load("com.databricks:spark-csv_2.10:1.2.0")
%spark
sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

This is the error I receive: 
<console>:16: error: not found: value sqlContext sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:12: error: not found: value % %spark ^
Thank you for any help in advance,
Ryan
Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a CSV using Spark-SQL & Spark-CSV

Alexander Bezzubov-2
Hi,

thank you for your interested in Zeppelin!

Couple of things I noticed: as you probably already know , %dep and %spark parts should always be in separate paragraphs.

%spark already exposes sql context though `sqlc` variable, so you better use sqlc.load("...") instead.

And of course to be able to use %spark interpreter in the notebook, you need to make sure you have it binded (cog button, on the top right)

Hope this helps!

--
Kind regards,
Alex


On Mon, Sep 28, 2015 at 4:29 PM, Ryan <[hidden email]> wrote:
Hi,

In a Zeppelin notebook, I am trying to load a csv using the spark-csv package by databricks. I am using the Hortonworks sandbox to run Zeppelin on. Unfortunately, the methods I have been trying have not been working. 

My latest attempt is:
%dep
z.load("com.databricks:spark-csv_2.10:1.2.0")
%spark
sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

This is the error I receive: 
<console>:16: error: not found: value sqlContext sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:12: error: not found: value % %spark ^
Thank you for any help in advance,
Ryan

Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a CSV using Spark-SQL & Spark-CSV

Ryan
Hi Alex,

Thank you for getting back to me!

The tutorial code was a bit confusing and made it seem like sqlContext was the proper variable to use:
// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)

I tried as you mentioned, but am still getting similar errors. Here is the code I tried:

%dep
z.reset()
z.load("com.databricks:spark-csv_2.10:1.2.0")

%spark
sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

The %spark interpreter is binded in the settings. I clicked save again to make sure, then ran it again. I am getting this error:

<console>:17: error: not found: value sqlc sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:13: error: not found: value % %spark
Could it be something to do with my Zeppelin installation? The tutorial code ran without any issues though.

Thanks!
Ryan



On Mon, Sep 28, 2015 at 5:07 PM, Alexander Bezzubov <[hidden email]> wrote:
Hi,

thank you for your interested in Zeppelin!

Couple of things I noticed: as you probably already know , %dep and %spark parts should always be in separate paragraphs.

%spark already exposes sql context though `sqlc` variable, so you better use sqlc.load("...") instead.

And of course to be able to use %spark interpreter in the notebook, you need to make sure you have it binded (cog button, on the top right)

Hope this helps!

--
Kind regards,
Alex


On Mon, Sep 28, 2015 at 4:29 PM, Ryan <[hidden email]> wrote:
Hi,

In a Zeppelin notebook, I am trying to load a csv using the spark-csv package by databricks. I am using the Hortonworks sandbox to run Zeppelin on. Unfortunately, the methods I have been trying have not been working. 

My latest attempt is:
%dep
z.load("com.databricks:spark-csv_2.10:1.2.0")
%spark
sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

This is the error I receive: 
<console>:16: error: not found: value sqlContext sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:12: error: not found: value % %spark ^
Thank you for any help in advance,
Ryan


Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a CSV using Spark-SQL & Spark-CSV

Ryan
Any updates on this? Or perhaps tutorials that successfully integrate spark-csv into Zeppelin? If I can rule out the code as the problem I can start looking into the install to see what's going wrong.

Thanks,
Ryan

On Tue, Sep 29, 2015 at 9:09 AM, Ryan <[hidden email]> wrote:
Hi Alex,

Thank you for getting back to me!

The tutorial code was a bit confusing and made it seem like sqlContext was the proper variable to use:
// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)

I tried as you mentioned, but am still getting similar errors. Here is the code I tried:

%dep
z.reset()
z.load("com.databricks:spark-csv_2.10:1.2.0")

%spark
sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

The %spark interpreter is binded in the settings. I clicked save again to make sure, then ran it again. I am getting this error:

<console>:17: error: not found: value sqlc sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:13: error: not found: value % %spark
Could it be something to do with my Zeppelin installation? The tutorial code ran without any issues though.

Thanks!
Ryan



On Mon, Sep 28, 2015 at 5:07 PM, Alexander Bezzubov <[hidden email]> wrote:
Hi,

thank you for your interested in Zeppelin!

Couple of things I noticed: as you probably already know , %dep and %spark parts should always be in separate paragraphs.

%spark already exposes sql context though `sqlc` variable, so you better use sqlc.load("...") instead.

And of course to be able to use %spark interpreter in the notebook, you need to make sure you have it binded (cog button, on the top right)

Hope this helps!

--
Kind regards,
Alex


On Mon, Sep 28, 2015 at 4:29 PM, Ryan <[hidden email]> wrote:
Hi,

In a Zeppelin notebook, I am trying to load a csv using the spark-csv package by databricks. I am using the Hortonworks sandbox to run Zeppelin on. Unfortunately, the methods I have been trying have not been working. 

My latest attempt is:
%dep
z.load("com.databricks:spark-csv_2.10:1.2.0")
%spark
sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

This is the error I receive: 
<console>:16: error: not found: value sqlContext sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:12: error: not found: value % %spark ^
Thank you for any help in advance,
Ryan



Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a CSV using Spark-SQL & Spark-CSV

Alexander Bezzubov
Hi,

it's really hard to say more without looking into the logs of Zeppelin Server and Spark interpreter in your case.

They way you do it seems to be right and I had no problems before, using it exactly the same way to read csv, except that I never used %spark explicitly but always made sure that in interpreter binding for spark is the first one on the list (you can drag'n'drop bindings to reorder), so it becomes a default one and hence no need to type %spark 

Can you try that out and in case this still does not work it would be better to create an issue in JIRA with logs attached. It might be worth posting a link there to the particular notebook i.e by using something like https://www.zeppelinhub.com/viewer/ to verify the paragraph structure.

Hope this helps!

On Thu, Oct 1, 2015 at 1:50 AM, Ryan <[hidden email]> wrote:
Any updates on this? Or perhaps tutorials that successfully integrate spark-csv into Zeppelin? If I can rule out the code as the problem I can start looking into the install to see what's going wrong.

Thanks,
Ryan

On Tue, Sep 29, 2015 at 9:09 AM, Ryan <[hidden email]> wrote:
Hi Alex,

Thank you for getting back to me!

The tutorial code was a bit confusing and made it seem like sqlContext was the proper variable to use:
// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)

I tried as you mentioned, but am still getting similar errors. Here is the code I tried:

%dep
z.reset()
z.load("com.databricks:spark-csv_2.10:1.2.0")

%spark
sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

The %spark interpreter is binded in the settings. I clicked save again to make sure, then ran it again. I am getting this error:

<console>:17: error: not found: value sqlc sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:13: error: not found: value % %spark
Could it be something to do with my Zeppelin installation? The tutorial code ran without any issues though.

Thanks!
Ryan



On Mon, Sep 28, 2015 at 5:07 PM, Alexander Bezzubov <[hidden email]> wrote:
Hi,

thank you for your interested in Zeppelin!

Couple of things I noticed: as you probably already know , %dep and %spark parts should always be in separate paragraphs.

%spark already exposes sql context though `sqlc` variable, so you better use sqlc.load("...") instead.

And of course to be able to use %spark interpreter in the notebook, you need to make sure you have it binded (cog button, on the top right)

Hope this helps!

--
Kind regards,
Alex


On Mon, Sep 28, 2015 at 4:29 PM, Ryan <[hidden email]> wrote:
Hi,

In a Zeppelin notebook, I am trying to load a csv using the spark-csv package by databricks. I am using the Hortonworks sandbox to run Zeppelin on. Unfortunately, the methods I have been trying have not been working. 

My latest attempt is:
%dep
z.load("com.databricks:spark-csv_2.10:1.2.0")
%spark
sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

This is the error I receive: 
<console>:16: error: not found: value sqlContext sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:12: error: not found: value % %spark ^
Thank you for any help in advance,
Ryan






--
--
Kind regards,
Alexander.

Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a CSV using Spark-SQL & Spark-CSV

Felix Cheung
Do you have the %spark line in the middle of a notebook "box"? It should be only at the beginning of a paragraph.





On Tue, Oct 6, 2015 at 6:42 AM -0700, "Alexander Bezzubov" <[hidden email]> wrote:

Hi,

it's really hard to say more without looking into the logs of Zeppelin Server and Spark interpreter in your case.

They way you do it seems to be right and I had no problems before, using it exactly the same way to read csv, except that I never used %spark explicitly but always made sure that in interpreter binding for spark is the first one on the list (you can drag'n'drop bindings to reorder), so it becomes a default one and hence no need to type %spark 

Can you try that out and in case this still does not work it would be better to create an issue in JIRA with logs attached. It might be worth posting a link there to the particular notebook i.e by using something like https://www.zeppelinhub.com/viewer/ to verify the paragraph structure.

Hope this helps!

On Thu, Oct 1, 2015 at 1:50 AM, Ryan <[hidden email]> wrote:
Any updates on this? Or perhaps tutorials that successfully integrate spark-csv into Zeppelin? If I can rule out the code as the problem I can start looking into the install to see what's going wrong.

Thanks,
Ryan

On Tue, Sep 29, 2015 at 9:09 AM, Ryan <[hidden email]> wrote:
Hi Alex,

Thank you for getting back to me!

The tutorial code was a bit confusing and made it seem like sqlContext was the proper variable to use:
// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)

I tried as you mentioned, but am still getting similar errors. Here is the code I tried:

%dep
z.reset()
z.load("com.databricks:spark-csv_2.10:1.2.0")

%spark
sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

The %spark interpreter is binded in the settings. I clicked save again to make sure, then ran it again. I am getting this error:

<console>:17: error: not found: value sqlc sqlc.load("com.databricks.spark.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:13: error: not found: value % %spark
Could it be something to do with my Zeppelin installation? The tutorial code ran without any issues though.

Thanks!
Ryan



On Mon, Sep 28, 2015 at 5:07 PM, Alexander Bezzubov <[hidden email]> wrote:
Hi,

thank you for your interested in Zeppelin!

Couple of things I noticed: as you probably already know , %dep and %spark parts should always be in separate paragraphs.

%spark already exposes sql context though `sqlc` variable, so you better use sqlc.load("...") instead.

And of course to be able to use %spark interpreter in the notebook, you need to make sure you have it binded (cog button, on the top right)

Hope this helps!

--
Kind regards,
Alex


On Mon, Sep 28, 2015 at 4:29 PM, Ryan <[hidden email]> wrote:
Hi,

In a Zeppelin notebook, I am trying to load a csv using the spark-csv package by databricks. I am using the Hortonworks sandbox to run Zeppelin on. Unfortunately, the methods I have been trying have not been working. 

My latest attempt is:
%dep
z.load("com.databricks:spark-csv_2.10:1.2.0")
%spark
sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes")

This is the error I receive: 
<console>:16: error: not found: value sqlContext sqlContext.load("hdfs://sandbox.hortonworks.com:8020/user/root/data/crime_incidents_2013_CSV.csv", Map("path" -> crimeData, "header" -> "true")).registerTempTable("crimes") ^ <console>:12: error: not found: value % %spark ^
Thank you for any help in advance,
Ryan






--
--
Kind regards,
Alexander.