Apache Cassandra Lunch #43: DSBulk with sed and awk

6/10/2022

Reading time:7

Apache Cassandra Lunch #43: DSBulk with sed and awk - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #43: DSBulk with sed and awk, we discuss how we can use the DataStax Bulk Loader with sed and awk for Cassandra data operations. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

In Apache Cassandra Lunch #43: DSBulk with sed and awk, we discuss how we can use the DataStax Bulk Loader with sed and awk for Cassandra data operations. We have a walkthrough to show you how you can do ETL from different Cassandra instances using DSBulk (E + L) with sed and awk (T). The live recording embedded below contains a live demo as well, so be sure to watch that as well!

Also, check out this blog for a series that will outline the general approaches for data operations in business-critical environments that leverage Cassandra and must maintain high-availability in an agile development and continuous delivery environment.

If you are not familiar with the DataStax Bulk Loader, it is an open-source tool that can be used to load and unload CSV or JSON data in and out of supported databases. The tool can be used with DataStax Astra, DSE, and open-source Apache Cassandra. More on the tool can be found here, and you can also check out the GitHub repository for additional information.

We can now move onto the walkthrough portion of the blog. First, you will need to go this GitHub repository. You can follow the instructions on the repository, or follow the instructions on this blog, but they will be the same. In this walkthrough, we will be using dsbulk to unload data from a DataStax Astra instance, do data transformations using awk and sed, and then load it into a Dockerized Apache Cassandra instance. You can do this walkthrough with any combination of 2 Cassandra distributions mentioned above. If you are working with deployed instances, you can use their contact points and more on that can be found here.

We will introduce the scenario that this example follows: Someone on our team wants us to take the previous_employees_by_title table and create a new table that has the time worked in days instead of a start time and end time. This can be done within the same instance, but for the purposes of showing more of what dsbulk can do, we opted for moving data between DataStax Astra and a Dockerized instance of Apache Cassandra.

Prerequisites

DataStax Astra
DataStax Bulk Loader (make sure to add to PATH)
Docker
GNU awk
GNU sed

1. Clone this repo and `cd` into it

git clone https://github.com/Anant/example-cassandra-dsbulk-with-sed-and-awk.git

cd example-cassandra-dsbulk-with-sed-and-awk

2. Set-up DataStax Astra

2.1 – Sign up for a free DataStax Astra account if you do not have one already

2.2 – Hit the `Create Database` button on the dashboard

2.3 – Hit the `Get Started` button on the dashboard

This will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.

2.4 – Define your database

Database name: whatever you want
Keyspace name: test
Cloud: whichever GCP region applies to you.
Hit create database and wait a couple minutes for it to spin up and become active.

2.5 – Generate application token

Once your database is active, connect to it.
Once on dashboard/<your-db-name>, click the Settings menu tab.
Select Admin User for role and hit generate token.
COPY DOWN YOUR CLIENT ID AND CLIENT SECRET as they will be used by dsbulk

2.6 – Download `Secure Bundle`

Hit the Connect tab in the menu
Click on Node.js (doesn’t matter which option under Connect using a driver)
Download Secure Bundle
Move Secure Bundle into the cloned directory.

2.7 – Load CSV data

Hit the Upload Data button
Drag-n-drop the previous_employees_by_title.csv file into the section.
Once uploaded successfully, hit the Next button
Make sure the table name is called previous_employees_by_title
Change employee_id from text to uuid
Change first_day and last_day to timestamp
Select job_title as the Partition Key
Select employee_id as the Clustering Column
Hit the Next button
Select test as the target keyspace
Hit the Next button to begin loading the csv.
If the upload fails, just try it again and it should work by the 2nd try.

3. Setup Dockerized Apache Cassandra

3.1 – Start docker container, expose port 9042, and mount this directory to the root of the container.

docker run --name cassandra -p 9042:9042 -d -v "$(pwd)":/example-cassandra-dsbulk-with-sed-and-awk cassandra:latest

3.1 – Create Destination Table

docker exec -it cassandra cqlsh

source '/example-cassandra-dsbulk-with-sed-and-awk/days_worked_by_previous_employees_by_job_title.cql'

4. Extract, Transform, and Load from DataStax Astra to Dockerized Apache Cassandra with dsbulk, sed, and awk.

We will cover 2 methods for loading data into Dockerized Apache Cassandra after unloading it from DataStax Astra. The first method breaks up the unloading and loading steps into 2 seperate methods. The second method does everything in one pipe.

4.1 – Unload to CSV file after extracting and transforming.

In this command, we are unloading from DataStax Astra, running an awk script, doing a sed transformation, and then writing the output to a CSV. The awk script cleans up the timestamp format of the first_day and last_day columns so that we can use the mktime() function to calculate the time in number of seconds. Then, the script prints out the first 3 columns and then the calculated duration of time worked in days per row. We also have some conditionals saying if that the value returned is negative, then make it positive. This occurred because we used a CSV generator and calculated random datetimes for those 2 columns for ~100,000 rows. Then, we use sed to fix the header that awk spits out with what we want for the destination table and write the output to a new CSV called days_worked_by_previous_employees_by_job_title.csv.

NOTE: Input your specific variables for the placeholders in the command

dsbulk unload -k test -t previous_employees_by_title -b "secure-connect-<db>.zip" -u <Client ID> -p <Client Secret> | gawk -F, -f duration_calc.awk | sed 's/job_title,employee_id,employee_name,0/job_title,employee_id,employee_name,number_of_days_worked/' > days_worked_by_previous_employees_by_job_title.csv

4.2 – Load via CSV file to Dockerized Apache Cassandra Instance

dsbulk load -url days_worked_by_previous_employees_by_job_title.csv -k test -t days_worked_by_previous_employees_by_job_title

4.3 – Confirm via CQLSH

select count(*) from test.days_worked_by_previous_employees_by_job_title ;

4.3 – Truncate table

truncate table test.days_worked_by_previous_employees_by_job_title ;

4.4 – Do everything in one command

dsbulk unload -k test -t previous_employees_by_title -b "secure-connect-<db>.zip" -u <Client ID> -p <Client Secret> | gawk -F, -f duration_calc.awk | sed 's/job_title,employee_id,employee_name,0/job_title,employee_id,employee_name,number_of_days_worked/' | dsbulk load -k test -t days_worked_by_previous_employees_by_job_title

4.5 – Confirm via CQLSH

select count(*) from test.days_worked_by_previous_employees_by_job_title ;

And that wraps up how we can quickly do some Cassandra data operations using dsbulk with sed and awk. Again, if you want to watch this walkthrough live, we have a demonstration in the live recording of Apache Cassandra Lunch #43: DSBulk with sed and awk embedded below.

If you missed last week’s Apache Cassandra Lunch #42: SSTable Files with SSTableloader, be sure to check that out as well! If you want to attend Cassandra Lunch live every Wednesday at 12 PM EST, then you can register here now! Additionally, the playlist with all the previously recorded Cassandra Lunches is available here.

Additional Resources

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Modern Business | Comments Off on Apache Cassandra Lunch #43: DSBulk with sed and awk

Related Articles

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

datastax

cassandra

langchain

Super Charge AI Assistants with Superagent and DataStax | DataStax

11/30/2023

migration

datastax

astra

GitHub - datastax/dsbulk-migrator

11/29/2023

graph.visualization

datastax

astra

Home | Quine, Open Source Streaming Graph for Event-Driven Applications

11/10/2022

mlops

datastax

cassandra

Lift your MLOps pipeline to the cloud with Feast and Astra DB | DataStax

11/9/2022

cassandra

dsbulk

datastax

DataStax Bulk Loader Pt. 1 — Introduction and Loading | Datastax

7/17/2022

arithmetic.operators

cassandra

cql

Apache Cassandra Lunch #99: CQL Arithmetic Operators - Business Platform Team

7/9/2022

stargate

cassandra.lunch

cassandra

Apache Cassandra Lunch #87: Cassandra.api, Astra, and Stargate - Business Platform Team

7/8/2022

terraform

datastax

cassandra

Apache Cassandra Lunch #86: DataStax Astra Terraform Provider - Business Platform Team

7/7/2022

cqlsh

cassandra.lunch

cassandra

Apache Cassandra Lunch #77: Connect to DataStax Astra via Standalone CQLSH - Business Platform Team

7/2/2022

Explore Further

sed

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

awk

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

cassandra

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

Prerequisites

1. Clone this repo and `cd` into it

2. Set-up DataStax Astra

2.1 – Sign up for a free DataStax Astra account if you do not have one already

2.2 – Hit the `Create Database` button on the dashboard

2.3 – Hit the `Get Started` button on the dashboard

2.4 – Define your database

2.5 – Generate application token

2.6 – Download `Secure Bundle`

2.7 – Load CSV data

3. Setup Dockerized Apache Cassandra

3.1 – Start docker container, expose port 9042, and mount this directory to the root of the container.

3.1 – Create Destination Table

4. Extract, Transform, and Load from DataStax Astra to Dockerized Apache Cassandra with dsbulk, sed, and awk.

4.1 – Unload to CSV file after extracting and transforming.

4.2 – Load via CSV file to Dockerized Apache Cassandra Instance

4.3 – Confirm via CQLSH

4.3 – Truncate table

4.4 – Do everything in one command

4.5 – Confirm via CQLSH

Additional Resources

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

Prerequisites

1. Clone this repo and cd into it

2. Set-up DataStax Astra

2.1 – Sign up for a free DataStax Astra account if you do not have one already

2.2 – Hit the Create Database button on the dashboard

2.3 – Hit the Get Started button on the dashboard

2.4 – Define your database

2.5 – Generate application token

2.6 – Download Secure Bundle

2.7 – Load CSV data

3. Setup Dockerized Apache Cassandra

3.1 – Start docker container, expose port 9042, and mount this directory to the root of the container.

3.1 – Create Destination Table

4. Extract, Transform, and Load from DataStax Astra to Dockerized Apache Cassandra with dsbulk, sed, and awk.

4.1 – Unload to CSV file after extracting and transforming.

4.2 – Load via CSV file to Dockerized Apache Cassandra Instance

4.3 – Confirm via CQLSH

4.3 – Truncate table

4.4 – Do everything in one command

4.5 – Confirm via CQLSH

Additional Resources

Cassandra.Link

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?

1. Clone this repo and `cd` into it

2.2 – Hit the `Create Database` button on the dashboard

2.3 – Hit the `Get Started` button on the dashboard

2.6 – Download `Secure Bundle`