Pig Interpreter

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Pig Interpreter

Michael Parco
Is there any current work or plans for a Pig interpreter in Zeppelin?
Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

moon
Administrator
Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?
Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

Nihal Bhagchandani
Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

On 01-Oct-2015, at 12:54, moon soo Lee <[hidden email]> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?
Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

moon
Administrator
I dont know Pig very well, but It's little bit difficult to think how spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <[hidden email]> wrote:
Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

On 01-Oct-2015, at 12:54, moon soo Lee <[hidden email]> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?
Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

Nihal Bhagchandani
Hi,
so as per my understanding:

PIGUses a scripting language called Pig Latin, which is more workflow driven. Is an abstraction layer on top of map-reduce. Pig use batch oriented frameworks, which means your analytic jobs will run for minutes or may be hours depending upon the volume of data. think PIG as step by step SQL execution.

Spark SQL : Allows us to do SQL like actions in HDFS or file-system with 100x faster performance than Map reduce when SQL performed in memory.Else on Disk its ten time faster.

 Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop.

The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.
WHERE → FILTER
The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.
HAVING → FILTER
Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
ORDER BY → ORDER
This keyword behaves pretty much the same in Pig as in SQL.
JOIN
In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.
Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.

Control over Execution

SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.
Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.
Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.
Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.
As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.

here is dennylee's link where you can find SPARK vs PIG http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/

most of the task/processing which is possible thru PIG can be easily achieved by using SPARK, in much lesser easy to understandable code and since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.

Regards
Nihal 








On Thursday, 1 October 2015 3:35 PM, moon soo Lee <[hidden email]> wrote:


I dont know Pig very well, but It's little bit difficult to think how spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <[hidden email]> wrote:
Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

On 01-Oct-2015, at 12:54, moon soo Lee <[hidden email]> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?


Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

IT CTO
The syntax might be similar but spark context can not execute pig script so you would need a pig interpreter to do that.
Eran

On Thu, Oct 1, 2015 at 3:15 PM Nihal Bhagchandani <[hidden email]> wrote:
Hi,
so as per my understanding:

PIGUses a scripting language called Pig Latin, which is more workflow driven. Is an abstraction layer on top of map-reduce. Pig use batch oriented frameworks, which means your analytic jobs will run for minutes or may be hours depending upon the volume of data. think PIG as step by step SQL execution.

Spark SQL : Allows us to do SQL like actions in HDFS or file-system with 100x faster performance than Map reduce when SQL performed in memory.Else on Disk its ten time faster.

 Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop.

The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.
WHERE → FILTER
The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.
HAVING → FILTER
Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
ORDER BY → ORDER
This keyword behaves pretty much the same in Pig as in SQL.
JOIN
In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.
Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.

Control over Execution

SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.
Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.
Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.
Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.
As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.

here is dennylee's link where you can find SPARK vs PIG http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/

most of the task/processing which is possible thru PIG can be easily achieved by using SPARK, in much lesser easy to understandable code and since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.

Regards
Nihal 








On Thursday, 1 October 2015 3:35 PM, moon soo Lee <[hidden email]> wrote:


I dont know Pig very well, but It's little bit difficult to think how spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <[hidden email]> wrote:
Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

On 01-Oct-2015, at 12:54, moon soo Lee <[hidden email]> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?


--
Eran | "You don't need eyes to see, you need vision" (Faithless)
Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

Michael Parco
The syntax between pig and spark sql (sql in general) does share similar features, but in general pig is a scripted based flow as opposed to an ad-hoc basis.

"In comparison to SQL, Pig

  1. uses lazy evaluation,
  2. uses extract, transform, load (ETL),
  3. is able to store data at any point during a pipeline,
  4. declares execution plans,
  5. supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines."

Regardless to the similarities to Spark SQL a pig interpreter is enticing for a few reasons. Currently many organizations still run pig jobs in production today and pig continues to get advanced. Pig's support for custom UDFs have made it a language to do ETL as well has some machine learning over that same data. There is also a lot of work to utilize Spark as an execution engine for Pig. The project Spork spawned from Sigmoid Analytics came about last year and is now a development branch within pig itself. With Pig executing on Spark (there is also work for Pig to execute on Flink, Storm, and Apex) it would be an enhancement to the suite of tools within Zeppelin.


On Thu, Oct 1, 2015 at 8:20 AM, IT CTO <[hidden email]> wrote:
The syntax might be similar but spark context can not execute pig script so you would need a pig interpreter to do that.
Eran

On Thu, Oct 1, 2015 at 3:15 PM Nihal Bhagchandani <[hidden email]> wrote:
Hi,
so as per my understanding:

PIGUses a scripting language called Pig Latin, which is more workflow driven. Is an abstraction layer on top of map-reduce. Pig use batch oriented frameworks, which means your analytic jobs will run for minutes or may be hours depending upon the volume of data. think PIG as step by step SQL execution.

Spark SQL : Allows us to do SQL like actions in HDFS or file-system with 100x faster performance than Map reduce when SQL performed in memory.Else on Disk its ten time faster.

 Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop.

The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.
WHERE → FILTER
The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.
HAVING → FILTER
Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
ORDER BY → ORDER
This keyword behaves pretty much the same in Pig as in SQL.
JOIN
In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.
Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.

Control over Execution

SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.
Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.
Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.
Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.
As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.

here is dennylee's link where you can find SPARK vs PIG http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/

most of the task/processing which is possible thru PIG can be easily achieved by using SPARK, in much lesser easy to understandable code and since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.

Regards
Nihal 








On Thursday, 1 October 2015 3:35 PM, moon soo Lee <[hidden email]> wrote:


I dont know Pig very well, but It's little bit difficult to think how spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <[hidden email]> wrote:
Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

On 01-Oct-2015, at 12:54, moon soo Lee <[hidden email]> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?


--
Eran | "You don't need eyes to see, you need vision" (Faithless)

Reply | Threaded
Open this post in threaded view
|

Re: Pig Interpreter

moon
Administrator
In reply to this post by Nihal Bhagchandani
Thanks Nihal for explanation.

most of the task/processing which is possible thru PIG can be easily achieved by using SPARK, in much lesser easy to understandable code and since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.

but I think this can not be the reason to not have pig interpreter.
Pig's syntax and execution engine is different and that's enough to have interpreter, i think.

Thanks,
moon

On Thu, Oct 1, 2015 at 2:15 PM Nihal Bhagchandani <[hidden email]> wrote:
Hi,
so as per my understanding:

PIGUses a scripting language called Pig Latin, which is more workflow driven. Is an abstraction layer on top of map-reduce. Pig use batch oriented frameworks, which means your analytic jobs will run for minutes or may be hours depending upon the volume of data. think PIG as step by step SQL execution.

Spark SQL : Allows us to do SQL like actions in HDFS or file-system with 100x faster performance than Map reduce when SQL performed in memory.Else on Disk its ten time faster.

 Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop.

The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.
WHERE → FILTER
The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.
HAVING → FILTER
Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.
ORDER BY → ORDER
This keyword behaves pretty much the same in Pig as in SQL.
JOIN
In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.
Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.

Control over Execution

SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.
Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.
Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.
Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.
As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.

here is dennylee's link where you can find SPARK vs PIG http://dennyglee.com/2013/08/19/why-all-this-interest-in-spark/

most of the task/processing which is possible thru PIG can be easily achieved by using SPARK, in much lesser easy to understandable code and since SPARK is in memory its 100x faster than any hadoop map-reduce tasks.

Regards
Nihal 








On Thursday, 1 October 2015 3:35 PM, moon soo Lee <[hidden email]> wrote:


I dont know Pig very well, but It's little bit difficult to think how spark-sql can help pig users. Can you explain more?

Thanks,
moon
On 2015년 10월 1일 (목) at 오전 11:39 Nihal Bhagchandani <[hidden email]> wrote:
Is there is any extra advantage to have a PIG Interpreter when zeppelin already support SPARK-SQL?

Nihal

Sent from my iPhone

On 01-Oct-2015, at 12:54, moon soo Lee <[hidden email]> wrote:

Hi,

As far as i know, there're no ongoing work for a pig interpreter. But no reason to not having one. How about file an issue for it?

Thanks,
moon
On 2015년 9월 23일 (수) at 오후 11:23 Michael Parco <[hidden email]> wrote:
Is there any current work or plans for a Pig interpreter in Zeppelin?