Apache Cassandra Lunch #94: StreamSets and Cassandra

5/31/2022

Reading time:3

Apache Cassandra Lunch #94: StreamSets and Cassandra - Business Platform Team

This resource is based on an article originally published here.

In Cassandra Lunch #94, we discuss how to connect StreamSets and Cassandra! The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. Subscribe to our YouTube Channel to keep up to date and watch Cassandra Lunches live at 12 PM EST on Thursdays!

In Cassandra Lunch #94, we discuss how to connect StreamSets and Cassandra! Be sure to watch the demo in the live recording of Cassandra Lunch embedded below! Additionally, check out our Data Engineer’s Lunch where we covered how to use StreamSets for Data Engineering, which included a walkthrough using the Transformer Engine with Spark as well! In the walkthrough embedded below, we only focus on the Data Collecter with the use case of Cassandra.

StreamSets is a data integration platform built for dataops. With StreamSets, you can build streaming, batch, CDC, ETL, and ML pipelines from a single UI and deploy data and workloads to any cloud. The DataOps platform has a free tier (no cc required) with Data Collector Engine, Transform Engine, Control Hub. As shown in the demo below, it allows for self-managed deployments via Docker. In the free tier, we get 2 active jobs, 2 active users, and 10 published pipelines, so depending on your workload, you might be able to get away with using the free tier itself on self-managed deployments.

streamsets ui — StreamSets DataOps Platform UI

As mentioned above, with the DataOps Platform, we get the Control Hub, Data Collecter Engine, Transform Engine, and pre-built connectors and native integrations. The Data Collector Engine is open source, which can be found here: https://github.com/streamsets/datacollector-oss. The transform engine can natively execute on Apache Spark, Snowflake, AWS EMR, Google Cloud Dataproc, and Databricks platforms. And the pre-built connectors and native integrations allow for connection to applications, big data, SQL/NoSQL DBs, storage/warehouses, and streaming tools. Check out all the connectors here: https://streamsets.com/support/connectors/.

Streamsets Basic Architecture — StreamSets Basic Architecture

In order to connect StreamSets and Cassandra, we will need the Cassandra Connector. The Cassandra connector supports Apache Cassandra 1.2, 2.x, and 3.x. For additional security, you can enable DSE auth if using DSE, Kerberos, SSL, and TLS.

The Cassandra connector uses 2 methods of batch writes: logged and unlogged. Logged writes are for distributed batch log and are atomic. Unlogged writes can write partial batches of records to Cassandra. So keep that in mind when configuring the settings for your Cassandra connector as the write method may change pending your use case.

As mentioned above, we have a demo of how we can connect StreamSets and Cassandra! Don’t forget to like and subscribe! In the demo, we go through the following items:

Spin up Data Collector Deployment from Control Hub + Docker
Get CSV data from GitHub
Make data transformations + arithmetic
Spin Up Cassandra on Docker + Connect in Control Hub
Write to Cassandra

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Apache Cassandra Lunch #94: StreamSets and Cassandra

Related Articles

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

arithmetic.operators

cassandra

cql

Apache Cassandra Lunch #99: CQL Arithmetic Operators - Business Platform Team

7/9/2022

datastax

cassandra.basics

cassandra.lunch

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

6/29/2022

cassandra.lunch

cassandra

spark

Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

6/18/2022

container

kubernetes

cassandra

Apache Cassandra Lunch #44: Cassandra on Kubernetes - Docker/Kubernetes/Helm Part 2 - Business Platform Team

6/11/2022

sed

awk

cassandra

Apache Cassandra Lunch #43: DSBulk with sed and awk - Business Platform Team

6/10/2022

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

Explore Further

data.engineering

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

cassandra

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

streamsets

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?