In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we discussed how we can do Cassandra ETL processes using Airflow and Spark. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we show you how you can set up a basic Cassandra ETL process with Airflow and Spark. If you want to hear why we used the bash operator vs the Spark submit operator like in Data Engineer’s Lunch #25: Airflow and Spark, be sure to check out the live recording of Cassandra Lunch #53 below!
In this walkthrough, we will cover how we can use Airflow to trigger Spark ETL jobs that move data into and within Cassandra. This demo will be relatively simple; however, it can be expanded upon with the addition of other technologies like Kafka, setting scheduling on the Spark jobs to make it a concurrent process, or in general creating more complex Cassandra ETL pipelines. We will focus on showing you how to connect Airflow, Spark, and Cassandra, and in our case today, specifically DataStax Astra. The reason we are using DataStax Astra is that we want everyone to be able to do this demo without having to worry about OS incompatibilities and the sort. For that reason, we will also be using Gitpod, and thus the entire walkthrough can be done within your browser!
For this walkthrough, we will use 2 Spark jobs. The first Spark job will load 100k rows from a CSV and then write it into a Cassandra table. The second Spark job will read the data from the prior Cassandra table, do some transformations, and then write the transformed data into a different Cassandra table. We also used PySpark to reduce the number of steps to get this working. If we used Scala, we would be required to build the JAR’s and that would require more time. If you are interested in seeing how to use the Airflow Spark Submit Operator and run Scala Spark jobs, check out this walkthrough!
You can also do the walkthrough using this GitHub repo! As mentioned above, the live recording is embedded below if you want to watch the walkthrough live.
1. Set-up DataStax Astra
1.1 – Sign up for a free DataStax Astra account if you do not have one already
1.2 – Hit the Create Database
button on the dashboard
1.3 – Hit the Get Started
button on the dashboard
This will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.
1.4 – Define your database
- Database name: whatever you want
- Keyspace name: whatever you want
- Cloud: whichever GCP region applies to you.
- Hit
create database
and wait a couple minutes for it to spin up and becomeactive
.
1.5 – Generate application token
- Once your database is active, connect to it.
- Once on
dashboard/<your-db-name>
, click theSettings
menu tab. - Select
Admin User
for role and hit generate token. - COPY DOWN YOUR CLIENT ID AND CLIENT SECRET as they will be used by Spark
1.6 – Download Secure Bundle
- Hit the
Connect
tab in the menu - Click on
Node.js
(doesn’t matter which option underConnect using a driver
) - Download
Secure Bundle
- Drag-and-Drop the
Secure Bundle
into the running Gitpod container.
1.7 – Copy and paste the contents of setup.cql
into the CQLSH terminal
2. Set up Airflow
We will be using the quick start script that Airflow provides here.
bash setup.sh
3. Start Spark in standalone mode
3.1 – Start master
./spark-3.0.1-bin-hadoop2.7/sbin/start-master.sh
3.2 – Start worker
Open port 8081 in the browser, copy the master URL, and paste in the designated spot below
./spark-3.0.1-bin-hadoop2.7/sbin/start-slave.sh <master-URL>
4. Move spark_dag.py to ~/airflow/dags
4.1 – Create ~/airflow/dags
mkdir ~/airflow/dags
4.2 – Move spark_dag.py
mv spark_dag.py ~/airflow/dags
5. Update the TODO’s in properties.config with your specific parameters
vim properties.conf
6, Open port 8080 to see Airflow UI and check if example_cassandra_etl
exists.
If it does not exist yet, give it a few seconds to refresh.
7. Update Spark Connection, unpause the example_cassandra_etl
, and drill down by clicking on example_cassandra_etl
.
7.1 – Under the Admin
section of the menu, select spark_default
and update the host to the Spark master URL. Save once done.
7.2 – Select the DAG
menu item and return to the dashboard. Unpause the example_cassandra_etl
, and then click on the example_cassandra_etl
link.
8. Trigger the DAG from the tree view and click on the graph view afterwards
9. Confirm data in Astra
9.1 – Check previous_employees_by_job_title
select * from <your-keyspace>.previous_employees_by_job_title where job_title='Dentist';
9.2 – Check days_worked_by_previous_employees_by_job_title
select * from <your-keyspace>.days_worked_by_previous_employees_by_job_title where job_title='Dentist';
And that will wrap up our walkthrough. Again, this is an introduction on how to set up a basic Cassandra ETL process run by Airflow and Spark. As mentioned above, these baby steps can be used to further expand and create more complex and scheduled/repeated Cassandra ETL processes run by Airflow and Spark. The live recording of Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark is embedded below, so if you want to watch the walkthrough live, be sure to check it out!
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!