Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark

6/17/2022

Reading time:5

Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we discussed how we can do Cassandra ETL processes using Airflow and Spark. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we show you how you can set up a basic Cassandra ETL process with Airflow and Spark. If you want to hear why we used the bash operator vs the Spark submit operator like in Data Engineer’s Lunch #25: Airflow and Spark, be sure to check out the live recording of Cassandra Lunch #53 below!

In this walkthrough, we will cover how we can use Airflow to trigger Spark ETL jobs that move data into and within Cassandra. This demo will be relatively simple; however, it can be expanded upon with the addition of other technologies like Kafka, setting scheduling on the Spark jobs to make it a concurrent process, or in general creating more complex Cassandra ETL pipelines. We will focus on showing you how to connect Airflow, Spark, and Cassandra, and in our case today, specifically DataStax Astra. The reason we are using DataStax Astra is that we want everyone to be able to do this demo without having to worry about OS incompatibilities and the sort. For that reason, we will also be using Gitpod, and thus the entire walkthrough can be done within your browser!

For this walkthrough, we will use 2 Spark jobs. The first Spark job will load 100k rows from a CSV and then write it into a Cassandra table. The second Spark job will read the data from the prior Cassandra table, do some transformations, and then write the transformed data into a different Cassandra table. We also used PySpark to reduce the number of steps to get this working. If we used Scala, we would be required to build the JAR’s and that would require more time. If you are interested in seeing how to use the Airflow Spark Submit Operator and run Scala Spark jobs, check out this walkthrough!

You can also do the walkthrough using this GitHub repo! As mentioned above, the live recording is embedded below if you want to watch the walkthrough live.

1. Set-up DataStax Astra

1.1 – Sign up for a free DataStax Astra account if you do not have one already

1.2 – Hit the `Create Database` button on the dashboard

1.3 – Hit the `Get Started` button on the dashboard

This will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.

1.4 – Define your database

Database name: whatever you want
Keyspace name: whatever you want
Cloud: whichever GCP region applies to you.
Hit create database and wait a couple minutes for it to spin up and become active.

1.5 – Generate application token

Once your database is active, connect to it.
Once on dashboard/<your-db-name>, click the Settings menu tab.
Select Admin User for role and hit generate token.
COPY DOWN YOUR CLIENT ID AND CLIENT SECRET as they will be used by Spark

1.6 – Download `Secure Bundle`

Hit the Connect tab in the menu
Click on Node.js (doesn’t matter which option under Connect using a driver)
Download Secure Bundle
Drag-and-Drop the Secure Bundle into the running Gitpod container.

1.7 – Copy and paste the contents of `setup.cql` into the CQLSH terminal

2. Set up Airflow

We will be using the quick start script that Airflow provides here.

bash setup.sh

3. Start Spark in standalone mode

3.1 – Start master

./spark-3.0.1-bin-hadoop2.7/sbin/start-master.sh

3.2 – Start worker

Open port 8081 in the browser, copy the master URL, and paste in the designated spot below

./spark-3.0.1-bin-hadoop2.7/sbin/start-slave.sh <master-URL>

4. Move spark_dag.py to ~/airflow/dags

4.1 – Create ~/airflow/dags

mkdir ~/airflow/dags

4.2 – Move spark_dag.py

mv spark_dag.py ~/airflow/dags

5. Update the TODO’s in properties.config with your specific parameters

vim properties.conf

6, Open port 8080 to see Airflow UI and check if `example_cassandra_etl` exists.

If it does not exist yet, give it a few seconds to refresh.

7. Update Spark Connection, unpause the `example_cassandra_etl`, and drill down by clicking on `example_cassandra_etl`.

7.1 – Under the `Admin` section of the menu, select `spark_default` and update the host to the Spark master URL. Save once done.

7.2 – Select the `DAG` menu item and return to the dashboard. Unpause the `example_cassandra_etl`, and then click on the `example_cassandra_etl`link.

8. Trigger the DAG from the tree view and click on the graph view afterwards

9. Confirm data in Astra

9.1 – Check `previous_employees_by_job_title`

select * from <your-keyspace>.previous_employees_by_job_title where job_title='Dentist';

9.2 – Check `days_worked_by_previous_employees_by_job_title`

select * from <your-keyspace>.days_worked_by_previous_employees_by_job_title where job_title='Dentist';

And that will wrap up our walkthrough. Again, this is an introduction on how to set up a basic Cassandra ETL process run by Airflow and Spark. As mentioned above, these baby steps can be used to further expand and create more complex and scheduled/repeated Cassandra ETL processes run by Airflow and Spark. The live recording of Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark is embedded below, so if you want to watch the walkthrough live, be sure to check it out!

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark

Related Articles

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

hive

elasticsearch

cassandra

GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.

12/1/2023

datastax

cassandra

langchain

Super Charge AI Assistants with Superagent and DataStax | DataStax

11/30/2023

migration

datastax

astra

GitHub - datastax/dsbulk-migrator

11/29/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

graph.visualization

datastax

astra

Home | Quine, Open Source Streaming Graph for Event-Driven Applications

11/10/2022

mlops

datastax

cassandra

Lift your MLOps pipeline to the cloud with Feast and Astra DB | DataStax

11/9/2022

Explore Further

cassandra.lunch

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

hive

elasticsearch

cassandra

GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.

12/1/2023

datastax

cassandra

spark

Apache Cassandra Lunch #72: Databricks and Cassandra - Business Platform Team

6/28/2022

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

cassandra

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

1. Set-up DataStax Astra

1.1 – Sign up for a free DataStax Astra account if you do not have one already

1.2 – Hit the `Create Database` button on the dashboard

1.3 – Hit the `Get Started` button on the dashboard

1.4 – Define your database

1.5 – Generate application token

1.6 – Download `Secure Bundle`

1.7 – Copy and paste the contents of `setup.cql` into the CQLSH terminal

2. Set up Airflow

3. Start Spark in standalone mode

3.1 – Start master

3.2 – Start worker

4. Move spark_dag.py to ~/airflow/dags

4.1 – Create ~/airflow/dags

4.2 – Move spark_dag.py

5. Update the TODO’s in properties.config with your specific parameters

6, Open port 8080 to see Airflow UI and check if `example_cassandra_etl` exists.

7. Update Spark Connection, unpause the `example_cassandra_etl`, and drill down by clicking on `example_cassandra_etl`.

7.1 – Under the `Admin` section of the menu, select `spark_default` and update the host to the Spark master URL. Save once done.

7.2 – Select the `DAG` menu item and return to the dashboard. Unpause the `example_cassandra_etl`, and then click on the `example_cassandra_etl`link.

8. Trigger the DAG from the tree view and click on the graph view afterwards

9. Confirm data in Astra

9.1 – Check `previous_employees_by_job_title`

9.2 – Check `days_worked_by_previous_employees_by_job_title`

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

1. Set-up DataStax Astra

1.1 – Sign up for a free DataStax Astra account if you do not have one already

1.2 – Hit the Create Database button on the dashboard

1.3 – Hit the Get Started button on the dashboard

1.4 – Define your database

1.5 – Generate application token

1.6 – Download Secure Bundle

1.7 – Copy and paste the contents of setup.cql into the CQLSH terminal

2. Set up Airflow

3. Start Spark in standalone mode

3.1 – Start master

3.2 – Start worker

4. Move spark_dag.py to ~/airflow/dags

4.1 – Create ~/airflow/dags

4.2 – Move spark_dag.py

5. Update the TODO’s in properties.config with your specific parameters

6, Open port 8080 to see Airflow UI and check if example_cassandra_etl exists.

7. Update Spark Connection, unpause the example_cassandra_etl, and drill down by clicking on example_cassandra_etl.

7.1 – Under the Admin section of the menu, select spark_default and update the host to the Spark master URL. Save once done.

7.2 – Select the DAG menu item and return to the dashboard. Unpause the example_cassandra_etl, and then click on the example_cassandra_etllink.

8. Trigger the DAG from the tree view and click on the graph view afterwards

9. Confirm data in Astra

9.1 – Check previous_employees_by_job_title

9.2 – Check days_worked_by_previous_employees_by_job_title

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

1.2 – Hit the `Create Database` button on the dashboard

1.3 – Hit the `Get Started` button on the dashboard

1.6 – Download `Secure Bundle`

1.7 – Copy and paste the contents of `setup.cql` into the CQLSH terminal

6, Open port 8080 to see Airflow UI and check if `example_cassandra_etl` exists.

7. Update Spark Connection, unpause the `example_cassandra_etl`, and drill down by clicking on `example_cassandra_etl`.

7.1 – Under the `Admin` section of the menu, select `spark_default` and update the host to the Spark master URL. Save once done.

7.2 – Select the `DAG` menu item and return to the dashboard. Unpause the `example_cassandra_etl`, and then click on the `example_cassandra_etl`link.

9.1 – Check `previous_employees_by_job_title`

9.2 – Check `days_worked_by_previous_employees_by_job_title`