We are really enjoying the workflow of interacting with our data via Zeppelin, but are not sold on using the built in cron scheduling capability. We would like to be able to create more complex DAGs that are better suited for something like Airflow. I was curious as to whether anyone has done an integration of Zeppelin with Airflow.
We also use both Zeppelin and Airflow.
I'm interested in hearing what others are doing here too.
Although honestly there might be some challenges
- Airflow expects a DAG structure, while a notebook has pretty linear structure;
- Airflow is Python-based; Zeppelin is all Java (REST API might be of help?).
Jupyter+Airflow might be a more natural fit to integrate?
On top of that, the way we use Zeppelin is a lot of ad-hoc queries,
while Airflow is for more finalized workflows I guess?
Thanks for bringing this up.
On Fri, May 19, 2017 at 2:20 PM, Ben Vogan <[hidden email]> wrote:
I do not expect the relationship between DAGs to be described in Zeppelin - that would be done in Airflow. It just seems that Zeppelin is such a great tool for a data scientists workflow that it would be nice if once they are done with the work the note could be productionized directly. I could envision a couple of scenarios:
1. Using a zeppelin instance to run the note via the REST API. The instance could be containerized and spun up specifically for a DAG or it could be a permanently available one.
2. A note could be pulled from git and some part of the Zeppelin engine could execute the note without the web UI at all.
I would expect on the airflow side there to be some special operators for executing these.
If the scheduler is pluggable then it should be possible to create a plug in that talks to the Airflow REST API.
I happen to prefer Zeppelin to Jupyter - although I get your point about both being python. I don't really view that as a problem - most of the big data platforms I'm talking to are implemented on the JVM after all. The python part of Airflow is really just describing what gets run and it isn't hard to run something that isn't written in python.
On Fri, May 19, 2017 at 2:52 PM, Ruslan Dautkhanov <[hidden email]> wrote:
Thanks for sharing this Ben.
I agree Zeppelin is a better fit with tighter integration with Spark and built-in visualizations.
We have pretty much standardized on pySpark, so here's one of the scripts we use internally
to extract %pyspark, %sql and %md paragraphs into a standalone script (that can be scheduled in Airflow for example)
Hope this helps.
ps. In my opinion adding dependencies between paragraphs wouldn't be that hard for simple cases,
and can be first step to define a DAG in Zeppelin directly. It would be really awesome if we see this type of
integration in the future.
Othewise I don't see much value if a whole note/ whole workflow would run as a single task in Airflow.
In my opinion, each paragraph has to be a task... then it'll be very useful.
On Fri, May 19, 2017 at 4:55 PM, Ben Vogan <[hidden email]> wrote:
Thanks for sharing this Ruslan - I will take a look.
I agree that paragraphs can form tasks within a DAG. My point was that ideally a DAG could encompass multiple notes. I.e. the completion of one note triggers another and so on to complete an entire chain of dependent tasks.
For example team A has a note that generates data set A*. Teams B & C each have notes that depend on A* to generate B* & C* for their specific purposes. It doesn't make sense for all of that to have to live in one note, but they are all part of a single workflow.
On Fri, May 19, 2017 at 9:02 PM, Ruslan Dautkhanov <[hidden email]> wrote:
We have begun experimenting with an airflow/zeppelin integration. We use the first paragraph of a note to define dependencies and outputs; names and owners; and schedule for the note. There are utility functions (in scala) available that provide a data catalog for retrieving data sources. These functions return a dataframe and record that note's dependency on a particular data source so that a dag can be constructed between the task that creates the input data and the zeppelin note. There is also a function to provide a dataframe writer that captures the outputs that a note provides. This registers that note as the source for data that is then available in the data catalog for other notebooks to use. This allows one notebook to have a dependency on data created by another notebook.
An airflow dag generator (python code) queries the zeppelin notebook server, looking for the results of the first paragraph for each note. It uses these outputs to construct the DAG between notebooks. It generates a ZeppelinNoteOperator for each note that will use the zeppelin REST api to execute the notebook when the scheduler schedules that task.
We've just started to use this so we don't have a lot of experience with it yet. The biggest caveats to start are:
* there is no mechanism for test-cases of note code
* We have to call the notebook server on every iteration of the scheduler/whenever a dag is init'd - we use the cached results if available, but it still requires a round trip to the zeppelin notebook server
On Fri, May 19, 2017 at 10:15 PM Ben Vogan <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|