Illustration Image

5/31/2022

Reading time:3

Apache Cassandra Lunch #94: StreamSets and Cassandra - Business Platform Team

logo

This resource is based on an article originally published here.

In Cassandra Lunch #94, we discuss how to connect StreamSets and Cassandra! The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. Subscribe to our YouTube Channel to keep up to date and watch Cassandra Lunches live at 12 PM EST on Thursdays!

In Cassandra Lunch #94, we discuss how to connect StreamSets and Cassandra! Be sure to watch the demo in the live recording of Cassandra Lunch embedded below! Additionally, check out our Data Engineer’s Lunch where we covered how to use StreamSets for Data Engineering, which included a walkthrough using the Transformer Engine with Spark as well! In the walkthrough embedded below, we only focus on the Data Collecter with the use case of Cassandra.

StreamSets is a data integration platform built for dataops. With StreamSets, you can build streaming, batch, CDC, ETL, and ML pipelines from a single UI and deploy data and workloads to any cloud. The DataOps platform has a free tier (no cc required) with Data Collector Engine, Transform Engine, Control Hub. As shown in the demo below, it allows for self-managed deployments via Docker. In the free tier, we get 2 active jobs, 2 active users, and 10 published pipelines, so depending on your workload, you might be able to get away with using the free tier itself on self-managed deployments.

streamsets ui
StreamSets DataOps Platform UI

As mentioned above, with the DataOps Platform, we get the Control Hub, Data Collecter Engine, Transform Engine, and pre-built connectors and native integrations. The Data Collector Engine is open source, which can be found here: https://github.com/streamsets/datacollector-oss. The transform engine can natively execute on Apache Spark, Snowflake, AWS EMR, Google Cloud Dataproc, and Databricks platforms. And the pre-built connectors and native integrations allow for connection to applications, big data, SQL/NoSQL DBs, storage/warehouses, and streaming tools. Check out all the connectors here: https://streamsets.com/support/connectors/.

Streamsets Basic Architecture
StreamSets Basic Architecture

In order to connect StreamSets and Cassandra, we will need the Cassandra Connector. The Cassandra connector supports Apache Cassandra 1.2, 2.x, and 3.x. For additional security, you can enable DSE auth if using DSE, Kerberos, SSL, and TLS.

The Cassandra connector uses 2 methods of batch writes: logged and unlogged. Logged writes are for distributed batch log and are atomic. Unlogged writes can write partial batches of records to Cassandra. So keep that in mind when configuring the settings for your Cassandra connector as the write method may change pending your use case.

As mentioned above, we have a demo of how we can connect StreamSets and Cassandra! Don’t forget to like and subscribe! In the demo, we go through the following items:

  • Spin up Data Collector Deployment from Control Hub + Docker
  • Get CSV data from GitHub
  • Make data transformations + arithmetic
  • Spin Up Cassandra on Docker + Connect in Control Hub
  • Write to Cassandra

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Related Articles

logo
python
cassandra
spark

Explore Further

data.engineering

cassandra

streamsets

Become part of our
growing community!
Welcome to Planet Cassandra, a community for Apache Cassandra®! We're a passionate and dedicated group of users, developers, and enthusiasts who are working together to make Cassandra the best it can be. Whether you're just getting started with Cassandra or you're an experienced user, there's a place for you in our community.
A dinosaur
Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.
© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?