Re: Minor bug in dynamic form and question on limits of results

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

moon
Administrator
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su



Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

Su She
Hey Moon/All,

sorry for the late reply.

This is the problem I'm encountering when trying to register Hive as a temptable. It seems that it cannot find a table, I have bolded this in the error message that I've c/p below. Please let me know if this is the best way for doing this. My end goal is to execute:

z.show(hc.sql("select * from test1"))

Thank you for the help!

//Code:
import sys.process._
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")

val results = hc.sql("select * from test2 limit 100") //have also tried test1
 
//everything works fine upto here, but due to lazy evaluation, i guess that doesn't mean much
results.map(t => "Name: " + t(0)).collect().foreach(println) 

results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/', [result#112] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 4, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: result#105 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find result#105 in [result#112] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 33 more

Thank you!

On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <[hidden email]> wrote:
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su




Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

moon
Administrator
Hi,

Please try not create hc, sqlContext manually, and use zeppelin created sqlContext. After run

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'")

it supposed to access table 'test1' in your query.

You can also do registerTempTable("test2"), for accessing table 'test2', but  it supposed from valid dataframe. So not

sqlContext("CREATE EXTERNAL TABLE .... ").registerTempTable("test2")

but like this

sqlContext("select * from test1").registerTempTable("test1")

Tell me if it helps.

Best,
moon


On Mon, Jun 29, 2015 at 12:05 PM Su She <[hidden email]> wrote:
Hey Moon/All,

sorry for the late reply.

This is the problem I'm encountering when trying to register Hive as a temptable. It seems that it cannot find a table, I have bolded this in the error message that I've c/p below. Please let me know if this is the best way for doing this. My end goal is to execute:

z.show(hc.sql("select * from test1"))

Thank you for the help!

//Code:
import sys.process._
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")

val results = hc.sql("select * from test2 limit 100") //have also tried test1
 
//everything works fine upto here, but due to lazy evaluation, i guess that doesn't mean much
results.map(t => "Name: " + t(0)).collect().foreach(println) 

results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/', [result#112] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 4, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: result#105 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find result#105 in [result#112] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 33 more

Thank you!

On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <[hidden email]> wrote:
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su




Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

Su She
Thanks for the suggestion Moon, unfortunately, I got the Invocation exception:

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(b int, w string, a string, x int, y int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

sqlContext.sql("select * from test3").registerTempTable("test1")

%sql select * from test1

java.lang.reflect.InvocationTargetException

To clarify, my data is in a csv format within the directory /user/flume/

so:

user/flume/csv1.csv
user/flume/csv2.csv

The reason I created HiveContext was because when I do:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

val results = hc.sql("select * from test3")

results.take(10) // I am able to get the results, but when I replace hc with sqlContext, I can't do results.take()

This last line results an Array of rows

Please let me know if I am doing anything wrong, thanks!




On Tue, Jun 30, 2015 at 11:07 AM, moon soo Lee <[hidden email]> wrote:
Hi,

Please try not create hc, sqlContext manually, and use zeppelin created sqlContext. After run

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'")

it supposed to access table 'test1' in your query.

You can also do registerTempTable("test2"), for accessing table 'test2', but  it supposed from valid dataframe. So not

sqlContext("CREATE EXTERNAL TABLE .... ").registerTempTable("test2")

but like this

sqlContext("select * from test1").registerTempTable("test1")

Tell me if it helps.

Best,
moon


On Mon, Jun 29, 2015 at 12:05 PM Su She <[hidden email]> wrote:
Hey Moon/All,

sorry for the late reply.

This is the problem I'm encountering when trying to register Hive as a temptable. It seems that it cannot find a table, I have bolded this in the error message that I've c/p below. Please let me know if this is the best way for doing this. My end goal is to execute:

z.show(hc.sql("select * from test1"))

Thank you for the help!

//Code:
import sys.process._
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")

val results = hc.sql("select * from test2 limit 100") //have also tried test1
 
//everything works fine upto here, but due to lazy evaluation, i guess that doesn't mean much
results.map(t => "Name: " + t(0)).collect().foreach(println) 

results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/', [result#112] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 4, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: result#105 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find result#105 in [result#112] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 33 more

Thank you!

On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <[hidden email]> wrote:
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su





Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

Su She
Okay, I'm really confused haha, some things are working, but I'm not sure why..


So this works:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/'")
 
val results = hc.sql("select * from test3")

val schemaString = "x, y, z, a, value"

import org.apache.spark.sql._

val schema =StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

val wikiSchemaRDD = sqlContext.applySchema(results, schema)

wikiSchemaRDD.registerTempTable("test4")


z.show(sqlContext.sql("select * from test4 limit 1001"))

Why does this last line work, but %sql select * from test 4 does not??. Zeppelin is able to recognize test4, which was only registered as a temp table, so why can I not use %sql on this? 

Why can't I do z.show(sqlContext.sql("select * from results limit 1001"))

All I see as the difference between test4 and results is that test4 was applied a schema (but, results is of type org.apache.spark.sql.Row (not catalyst.Row)). 

Sorry for all the questions, thanks for the answers!


On Tue, Jun 30, 2015 at 6:38 PM, Su She <[hidden email]> wrote:
Thanks for the suggestion Moon, unfortunately, I got the Invocation exception:

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(b int, w string, a string, x int, y int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

sqlContext.sql("select * from test3").registerTempTable("test1")

%sql select * from test1

java.lang.reflect.InvocationTargetException

To clarify, my data is in a csv format within the directory /user/flume/

so:

user/flume/csv1.csv
user/flume/csv2.csv

The reason I created HiveContext was because when I do:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

val results = hc.sql("select * from test3")

results.take(10) // I am able to get the results, but when I replace hc with sqlContext, I can't do results.take()

This last line results an Array of rows

Please let me know if I am doing anything wrong, thanks!




On Tue, Jun 30, 2015 at 11:07 AM, moon soo Lee <[hidden email]> wrote:
Hi,

Please try not create hc, sqlContext manually, and use zeppelin created sqlContext. After run

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'")

it supposed to access table 'test1' in your query.

You can also do registerTempTable("test2"), for accessing table 'test2', but  it supposed from valid dataframe. So not

sqlContext("CREATE EXTERNAL TABLE .... ").registerTempTable("test2")

but like this

sqlContext("select * from test1").registerTempTable("test1")

Tell me if it helps.

Best,
moon


On Mon, Jun 29, 2015 at 12:05 PM Su She <[hidden email]> wrote:
Hey Moon/All,

sorry for the late reply.

This is the problem I'm encountering when trying to register Hive as a temptable. It seems that it cannot find a table, I have bolded this in the error message that I've c/p below. Please let me know if this is the best way for doing this. My end goal is to execute:

z.show(hc.sql("select * from test1"))

Thank you for the help!

//Code:
import sys.process._
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")

val results = hc.sql("select * from test2 limit 100") //have also tried test1
 
//everything works fine upto here, but due to lazy evaluation, i guess that doesn't mean much
results.map(t => "Name: " + t(0)).collect().foreach(println) 

results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/', [result#112] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 4, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: result#105 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find result#105 in [result#112] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 33 more

Thank you!

On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <[hidden email]> wrote:
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su






Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

moon
Administrator
Hi, 

Thanks for asking questions.

Once you do 
yourRDD.registerTempTable("table")

It supposed to work both
  z.show(sqlContext.sql("select * from table")
and
  %sql select * from table
The result should be identical.

Could you double check if you're not created sqlContext manually?
If you have created sqlContext manually, then %sql will not recognize the table created with registerTempTable().

Thanks,
moon

On Tue, Jun 30, 2015 at 7:16 PM Su She <[hidden email]> wrote:
Okay, I'm really confused haha, some things are working, but I'm not sure why..


So this works:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/'")
 
val results = hc.sql("select * from test3")

val schemaString = "x, y, z, a, value"

import org.apache.spark.sql._

val schema =StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

val wikiSchemaRDD = sqlContext.applySchema(results, schema)

wikiSchemaRDD.registerTempTable("test4")


z.show(sqlContext.sql("select * from test4 limit 1001"))

Why does this last line work, but %sql select * from test 4 does not??. Zeppelin is able to recognize test4, which was only registered as a temp table, so why can I not use %sql on this? 

Why can't I do z.show(sqlContext.sql("select * from results limit 1001"))

All I see as the difference between test4 and results is that test4 was applied a schema (but, results is of type org.apache.spark.sql.Row (not catalyst.Row)). 

Sorry for all the questions, thanks for the answers!


On Tue, Jun 30, 2015 at 6:38 PM, Su She <[hidden email]> wrote:
Thanks for the suggestion Moon, unfortunately, I got the Invocation exception:

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(b int, w string, a string, x int, y int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

sqlContext.sql("select * from test3").registerTempTable("test1")

%sql select * from test1

java.lang.reflect.InvocationTargetException

To clarify, my data is in a csv format within the directory /user/flume/

so:

user/flume/csv1.csv
user/flume/csv2.csv

The reason I created HiveContext was because when I do:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

val results = hc.sql("select * from test3")

results.take(10) // I am able to get the results, but when I replace hc with sqlContext, I can't do results.take()

This last line results an Array of rows

Please let me know if I am doing anything wrong, thanks!




On Tue, Jun 30, 2015 at 11:07 AM, moon soo Lee <[hidden email]> wrote:
Hi,

Please try not create hc, sqlContext manually, and use zeppelin created sqlContext. After run

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'")

it supposed to access table 'test1' in your query.

You can also do registerTempTable("test2"), for accessing table 'test2', but  it supposed from valid dataframe. So not

sqlContext("CREATE EXTERNAL TABLE .... ").registerTempTable("test2")

but like this

sqlContext("select * from test1").registerTempTable("test1")

Tell me if it helps.

Best,
moon


On Mon, Jun 29, 2015 at 12:05 PM Su She <[hidden email]> wrote:
Hey Moon/All,

sorry for the late reply.

This is the problem I'm encountering when trying to register Hive as a temptable. It seems that it cannot find a table, I have bolded this in the error message that I've c/p below. Please let me know if this is the best way for doing this. My end goal is to execute:

z.show(hc.sql("select * from test1"))

Thank you for the help!

//Code:
import sys.process._
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")

val results = hc.sql("select * from test2 limit 100") //have also tried test1
 
//everything works fine upto here, but due to lazy evaluation, i guess that doesn't mean much
results.map(t => "Name: " + t(0)).collect().foreach(println) 

results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/', [result#112] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 4, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: result#105 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find result#105 in [result#112] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 33 more

Thank you!

On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <[hidden email]> wrote:
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su






Reply | Threaded
Open this post in threaded view
|

Re: Minor bug in dynamic form and question on limits of results

Su She
Ahh okay, so when I don't use or create HiveContext it now works. It seems though that I have to apply the schema to "val results = sqlContext.sql("select * from table)" before being able to register the table in a manner sqlContext is able to see the table.

Thanks for the answers Moon!

On Tue, Jun 30, 2015 at 7:27 PM, moon soo Lee <[hidden email]> wrote:
Hi, 

Thanks for asking questions.

Once you do 
yourRDD.registerTempTable("table")

It supposed to work both
  z.show(sqlContext.sql("select * from table")
and
  %sql select * from table
The result should be identical.

Could you double check if you're not created sqlContext manually?
If you have created sqlContext manually, then %sql will not recognize the table created with registerTempTable().

Thanks,
moon

On Tue, Jun 30, 2015 at 7:16 PM Su She <[hidden email]> wrote:
Okay, I'm really confused haha, some things are working, but I'm not sure why..


So this works:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/'")
 
val results = hc.sql("select * from test3")

val schemaString = "x, y, z, a, value"

import org.apache.spark.sql._

val schema =StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

val wikiSchemaRDD = sqlContext.applySchema(results, schema)

wikiSchemaRDD.registerTempTable("test4")


z.show(sqlContext.sql("select * from test4 limit 1001"))

Why does this last line work, but %sql select * from test 4 does not??. Zeppelin is able to recognize test4, which was only registered as a temp table, so why can I not use %sql on this? 

Why can't I do z.show(sqlContext.sql("select * from results limit 1001"))

All I see as the difference between test4 and results is that test4 was applied a schema (but, results is of type org.apache.spark.sql.Row (not catalyst.Row)). 

Sorry for all the questions, thanks for the answers!


On Tue, Jun 30, 2015 at 6:38 PM, Su She <[hidden email]> wrote:
Thanks for the suggestion Moon, unfortunately, I got the Invocation exception:

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(b int, w string, a string, x int, y int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

sqlContext.sql("select * from test3").registerTempTable("test1")

%sql select * from test1

java.lang.reflect.InvocationTargetException

To clarify, my data is in a csv format within the directory /user/flume/

so:

user/flume/csv1.csv
user/flume/csv2.csv

The reason I created HiveContext was because when I do:

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test3(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip:8020/user/flume/'")

val results = hc.sql("select * from test3")

results.take(10) // I am able to get the results, but when I replace hc with sqlContext, I can't do results.take()

This last line results an Array of rows

Please let me know if I am doing anything wrong, thanks!




On Tue, Jun 30, 2015 at 11:07 AM, moon soo Lee <[hidden email]> wrote:
Hi,

Please try not create hc, sqlContext manually, and use zeppelin created sqlContext. After run

sqlContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'")

it supposed to access table 'test1' in your query.

You can also do registerTempTable("test2"), for accessing table 'test2', but  it supposed from valid dataframe. So not

sqlContext("CREATE EXTERNAL TABLE .... ").registerTempTable("test2")

but like this

sqlContext("select * from test1").registerTempTable("test1")

Tell me if it helps.

Best,
moon


On Mon, Jun 29, 2015 at 12:05 PM Su She <[hidden email]> wrote:
Hey Moon/All,

sorry for the late reply.

This is the problem I'm encountering when trying to register Hive as a temptable. It seems that it cannot find a table, I have bolded this in the error message that I've c/p below. Please let me know if this is the best way for doing this. My end goal is to execute:

z.show(hc.sql("select * from test1"))

Thank you for the help!

//Code:
import sys.process._
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

hc.sql("CREATE EXTERNAL TABLE IF NOT EXISTS test1(x string, y string, time string, z int, v int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://.us-west-1.compute.internal:8020/user/flume/'").registerTempTable("test2")

val results = hc.sql("select * from test2 limit 100") //have also tried test1
 
//everything works fine upto here, but due to lazy evaluation, i guess that doesn't mean much
results.map(t => "Name: " + t(0)).collect().foreach(println) 

results: org.apache.spark.sql.SchemaRDD = SchemaRDD[41] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Limit 100 !Project [result#105] NativeCommand CREATE EXTERNAL TABLE IF NOT EXISTS test1(date int, date_time string, time string, sensor int, value int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs://ip-10-0-2-216.us-west-1.compute.internal:8020/user/flume/', [result#112] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 4, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: result#105 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.<init>(Projection.scala:54) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:105) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:44) at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:43) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find result#105 in [result#112] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:53) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 33 more

Thank you!

On Thu, Jun 25, 2015 at 11:51 AM, moon soo Lee <[hidden email]> wrote:
Hi, 

Yes, %sql function is only for the tables that has been registered.
Using DataFrame is basically similar to what currently you're doing. It needs registerTempTable. 

Could you share little bit about your problem when registering tables?

And really appreciate for reporting a bug!

Thanks,
moon

On Wed, Jun 24, 2015 at 11:28 PM Corneau Damien <[hidden email]> wrote:
Yes, you can change the number of records. The default value is 1000

On Thu, Jun 25, 2015 at 2:32 PM, Nihal Bhagchandani <[hidden email]> wrote:
Hi Su,

as per my understanding you can change the limit of 1000record from the interpreter section by setting up the value for variable "zeppelin.spark.maxResult", moon could you please confirm my understanding?

Regards
Nihal



On Thursday, 25 June 2015 10:00 AM, Su She <[hidden email]> wrote:


Hello Everyone,

Excited to be making progress, and thanks for the community for providing help along the way.This stuff is all really cool.


Questions:

1) I noticed that the limit for the visual representation is 1000 results. Are there any short term plans to expand the limit? It seemed a little on the low side as many of the reasons for working with spark/hadoop is to work with large datasets.

2) When can I use the %sql function? Is it only on tables that have been registered? I have been having trouble registering tables unless I do:

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

I am having lots of trouble registering tables through HiveContext or even duplicating the Zeppelin tutorial, is this issue mitigated by using DataFrames ( I am planning to move to 1.3 very soon)?


Bug:

When I do this:
z.show(sqlContext.sql("select * from sensortable limit 100"))

I get the table, but I also get text results in the bottom, please see attached image. For some reason, if the image doesn't go through, i basically get the table, and everything works well, but the select statement also returns text (regardless of its 100 results or all)

 
Thank you !

Best,

Su