Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker

6/29/2022

Reading time:8

Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker - Business Platform Team

This resource is based on an article originally published here.

In Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker, we discussed getting started with DataStax Enterprise on Docker, we discussed some of the applications that make up the DataStax ecosystem. In the process we pulled some Docker images of the applications we are interested in and we got into working with the DSE Search, DSE Analytics with Spark, and DSE Graph on the Docker desktop. We were able to learn about these tools and their strengths. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Getting Started with DataStax Enterprise (DSE) on Docker

In this article, we are going to look at getting started with DataStax Enterprise on Docker. Recently, there have been lots of shifts in looking at how data problems can be solved by NoSQL databases. There are different problems in using relational databases that are very difficult to navigate.

One of the interesting features of NoSQL databases is the ability to scale whenever there is a daunting and disturbing workload that may cause applications to go offline. Interestingly, customers are very good at waiting to get responses at a longer time, as such your application must be available at all times, there must not be downtime.

Due to the fact that Cassandra is known to be scalable and highly available, different companies have been working on using Apache Cassandra as their application database. DataStax Enterprise has done a great deal of work by building transformational data architectures for applications, microservices, and experiences that require data sovereignty, availability, scale, agility, and accessibility by any user. They have built different applications by leveraging Apache Cassandra and building enterprise applications that make application deployment much earlier.

One of these applications is the DataStax Enterprise (DSE), built on Apache Cassandra which is well known for 100% uptime, unmatched low latency, and it also has the ability to handle massive data at a planetary scale.

A diagram showing DSE Analytics, DSE Search and DSE Graph feature enabled in DSE

There are different packages and capabilities that have been introduced into the DataStax Enterprise ecosystem, we are going to look at some of these software. As part of this, we are going to provision these software on Docker containers and work with them. We are going to look at working with DSE Search, DSE Analytics (Spark), and DSE Graph to demonstrate handling data workloads.

Diagram showing some of the packages that have been introduced into the DataStax ecosystem

DataStax Enterprise (DSE)

In DSE, there are different capabilities you can leverage, to handle different data problems. We are going to look at DataStax Studio, DataStax enterprise server which comes with DSE Search, DSE Analytics with Spark, and DSE Graph

DataStax Enterprise Search

DSE Search allows you to quickly find data and provide a modern search experience for your users, helping you create features like product catalogs, document repositories, ad-hoc reporting engines, and more. One of the restrictions possessed by Apache Cassandra is being able to use tables for applications that the tables were not predefined to support beforehand. Cassandra provides a solution called materialized view and the creating secondary views, however, these solutions are not flexible because managing these views and tables require some tricks. Indexing on data types like tuples and user-defined types is a lot of work. Let’s get straight at it.

Install DataStax Enterprise server and enable Search capability

Before you start installing the packages using Docker, you must download the Docker desktop from here.

After the Docker desktop installation, you can now start pulling all the DataStax Docker images we will be working with from the Docker hub using the below commands. Make sure your Docker desktop is running and open your Powershell and type the following commands to pull the DataStax images.

$ docker pull datastax/ddac:5.1.17
$ docker pull datastax/dse-server:6.8.16
$ docker pull datastax/dse-opscenter:6.8.15
$ docker pull datastax/dse-studio:6.8.15

With all the images pulled, we are going to run and start the containers of all the images we pulled. We will run the containers using the command below.

$ docker run -e DS_LICENSE=accept --name my-ddac -d datastax/ddac
$ docker run -e DS_LICENSE=accept -p 7080:7080 -p 7081:7081 --name datastax_server -d datastax/dse-server -k -s -g
$ docker run -e DS_LICENSE=accept -p 8888:8888 --name my-opscenter -d datastax/dse-opscenter
$ docker run -e DS_LICENSE=accept --name my-studio -p 9091:9091 -d datastax/dse-studio --link datastax_server

Use the command below to get into the running containers.

$ docker exec -it <container name> /bin/bash

Use the command below to check for status of the node, you will be able to know if your node is running or not with this command.

$ dsetool status

In my case, I am going to start working with the DSE server, notice I have enabled the DSE Search, the Analytics with Spark and the Graph capability with the below line.

$ docker run -e DS_LICENSE=accept -p 7080:7080 -p 7081:7081 --name datastax_server -d datastax/dse-server -k -s -g
$ docker exec -it <container name> /bin/bash

Use CQL to create the search index on all columns in the table and all the Search nodes in the datacenter. Use the CQLSH command to get into the CQL shell.

$ CREATE SEARCH INDEX ON keyspace.table;
$ CREATE SEARCH INDEX ON voting_system.voters;
$ CREATE SEARCH INDEX ON voting_system.voters WITH COLUMNS column1, column2, column3, ...;

DataStax Enterprise with Analytics using Spark

DataStax Enterprise (DSE) integrates real-time and batch operational analytics capabilities with an enhanced version of Apache Spark. With DSE Analytics you can easily generate ad-hoc reports, target customers with personalization, and process real-time streams of data. The analytics toolset lets you write code once and then use it for both real-time and batch workloads.

Diagram showing the combination of DSE and Spark in a node

Use DSE Analytics to analyze huge databases. DSE Analytics includes integration with Apache Spark, Spark is the framework that will help to support our analytics applications. Spark is a distributed computation engine that is designed to handle big data and for in-memory processing. According to Apache Spark, Spark supports interactive and batch analytics, it is up to 100 times faster than Hadoop. Spark requires 5 – 10 times less code compared to Hadoop and at the same time supports efficiency and scalability. One of the features of Spark is fault tolerance.

This is a diagram showing two datacenter Cassandra deployment with an external Spark cluster setup.

Some of the advantages of using DSE Analytics include the following;

No single point of failure
Spark Master management
Ability to perform analytics without ETL
There is no need to configure a new file system like S3, Azure Blob because DataStax Enterprise file system (DSEFS) is readily available
DSE Analytics Solo
Integrated security
AlwaysOn SQL

Use the command below to get to the Spark shell.

$ docker exec -it <container name> /bin/bash
$ dse spark

Use the Spark Scala command below to manipulate the Cassandra table.

$ val table = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace"->"voting_system", "table"-> "voters")).load() 
$ val voters_table = table.select("voterid", "city", "state","supporting").where("solr_query='supporting:A'").show()

DataStax Enterprise Graph

DSE Graph is a distributed graph database that is optimized for fast data storage and traversals. DSE Graph database ensures zero downtime, analysis of complex, disparate, and related datasets in real-time. With all capabilities that come with DSE Graph, the database is capable of scaling to massive datasets and executing both transactional and analytical workloads (OLTP and OLAP). DSE Graph incorporates all of the enterprise-class functionality found in DataStax Enterprise, this includes advanced security protection, built-in DSE Analytics, the DSE Search functionality, visual management and monitoring, and development tools including DataStax Studio.

A diagram showing the DataStax Enterprise (DSE) Graph database setup.

DSE Graph is built on top of Apache TinkerPop, Apache Cassandra, Apache Solr, and Apache Spark. DSE Graph uses Apache TinkerPop standards for data and traversal while also using Apache Cassandra for scalable storage and retrieval. DSE Graph leverages Apache Solr for search and for indexing Capabilities. DSE Graph employs Apache Spark for fast analytic traversal. All these components are integrated into the DSE graph to form a real-time graph database management system.

DSE Graph supports both transactional and analytic workloads, using two different engines. The analytic engine solely relies on Spark, which comes as part of the DSE product.

Get into the Gremlin console using the below command.

$ docker exec -it <container name> /bin/bash
$ dse gremlin-console

Check if a graph called voting_system exists in the system.

$ system.graph("voting_system").exists()

Create a new graph using the below command.

$ system.graph("graph_testing").create()

Get a list of graphs available in the system.

$ system.graphs()

I find it very easy to work with graph and CQL commands on DataStax studio, the interactive notebook makes things a lot easier. We are going to link the DataStax studio with the DSE server and start working on the graph database from the studio. We are going to open the command line interface and use the below command to link the DSE server with the DSE studio and access the studio at http://localhost:9091/.

docker run -e DS_LICENSE=accept --name <my-studio-name> -p 9091:9091 -d datastax/dse-studio --link <datastax_server_container_name>

With the DSE studio and the DSE server connected, we can now start defining our graph vertices, the graph edges, and our graph properties. Just a recap of what we covered in this article, we looked into various packages and capabilities of DSE, we created indexes on our Cassandra table and we also query and work with the table on Spark, then we proceeded to create a graph database on the DataStax studio.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!

Posted in Data & Analytics, Events | Comments Off on Cassandra Lunch #75: Getting Started with DataStax Enterprise (DSE) on Docker

Related Articles

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

examples

cassandra

datastax

GitHub - datastaxdevs/workshop-betterreads: Clone of Good Reads using Spring and Cassandra

12/2/2023

examples

cassandra

datastax

NoSQL Database Built on Apache Cassandra | DataStax

12/2/2023

examples

cassandra

datastax

DataStax Examples

12/2/2023

web.scraping

scraping

datastax

Build a Website Scraper with Astra DB + Python Examples | DataStax

12/2/2023

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

datastax

cassandra

langchain

Super Charge AI Assistants with Superagent and DataStax | DataStax

11/30/2023

database

datastax

aws

Getting Started with DataStax Astra DB and Amazon Bedrock | DataStax

11/30/2023

migration

datastax

astra

GitHub - datastax/dsbulk-migrator

11/29/2023

cassio

datastax

llm

DataStax, Google partner to bring vector search to NoSQL AstraDB

6/12/2023

Explore Further

datastax

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

examples

cassandra

datastax

GitHub - datastaxdevs/workshop-betterreads: Clone of Good Reads using Spring and Cassandra

12/2/2023

examples

cassandra

datastax

NoSQL Database Built on Apache Cassandra | DataStax

12/2/2023

examples

cassandra

datastax

DataStax Examples

12/2/2023

cassandra.basics

cassandra.lunch

cassandra

Cassandra Lunch #70: Basics of Apache Cassandra - Business Platform Team

6/27/2022

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023