5/18/2023

Reading time:9

Cassandra with Python: Simple to Complex

This resource is based on an article originally published here.

Pt. 6 of a series on the KillrVideo Python project

Jeff Carpenter

·

Follow

7 min readMar 1, 2019

--

This is the latest installment of a series about building a Python application with Apache Cassandra — specifically a Python implementation of the KillrVideo microservice tier. In previous posts I shared what motivated this project, how I started with infrastructure including GRPC and Etcd, the testing approach, and most recently, how I began implementing data access using Cassandra.

In this post we’ll look at some additional examples of data access using the DataStax Python Driver, ranging from the simple to the complex. (I’ll be making reference some driver / mapper concepts from the previous post so you may want to review that if you haven’t already.)

Keeping it simple

In the last post I shared how easy it was to implement the data access for the User Management Service using Apache Cassandra and the DataStax Python Driver. Using the cqlengine mapper was a big factor in my productivity. Because of that productivity boost, when implementing the other services I started with the mapper and only used other approaches when the complexity of the data access required it.

In fact, a couple additional services were implemented entirely using the mapper: the Statistics Service and the Ratings Service. These demonstrate Cassandra features including counters and batches.

Simple example 1: Counters in the Statistics Service

The Statistics Service (statistics_service.py) stores counts of how many time each video has been viewed. Counting statistics is one of the relatively few use cases for which the Cassandra counter type is a good fit, because and therefore a good example of how to manipulate counters with the mapper. The model class we used to track statistics is shown here:

class VideoPlaybackStatsModel(Model):
"""Model class that maps to the video_playback_stats table"""
__table_name__ = 'video_playback_stats'
video_id = columns.UUID(primary_key=True, db_field='videoid')
    views = columns.Counter()

Note the use of the columns.Counter() type for the views column.

Incrementing the count of views in the record_playback_started() operation is very simple:

VideoPlaybackStatsModel(video_id=video_id).update(views=1)

Note that the value views=1 represents a single view of the video.

Simple example 2: Batch Writes in the Ratings Service

The Ratings Service (ratings_service.py) stores user ratings of videos and allows retrieval of the ratings either by user or by video. One interesting aspect of this service is the use of the mapper as part of a Cassandra batch when storing ratings in order to support writes to two denormalized tables supporting the “by user” and “by video” queries mentioned above. The code block below shows the data access code from the rate_video() operation:

now = datetime.utcnow()# create and execute batch statement to insert into multiple tables
batch_query = BatchQuery(timestamp=now)VideoRatingsByUserModel.batch(batch_query)\
    .create(video_id=video_id, user_id=user_id, rating=rating)# updating counter columns rating_counter and rating_total 
# values are interpreted as amount to increment
VideoRatingsModel(video_id=video_id).\
    update(rating_counter=1, rating_total=rating).batch(batch_query)batch_query.execute()

In this example, we use two different mapper classes to write to two different tables. Note the use of each mapper’s batch() operation to add a statement to the batch_query, which we can then execute. The first statement created is a CQLINSERT into the killrvideo.video_ratings_by_user via the VideoRatingsByUserModel.

If you’re looking closely may you have also noticed something about that second write — we’re using the VideoRatingsModel, which goes to the killrvideo.video_ratings table to do a CQLUPDATE. This table is another interesting use of counters. In this case, two counter columns are used. The rating_counter tracks the number of times a video has been rated, while the rating_total tracks the sum of all ratings for the video. The average rating can then be calculated by the client via simple division (rating_total/rating_counter).

When simple doesn’t cut it

There were a few other services that involved more complex data access where the mapper couldn’t fully address my needs. These included the Video Catalog Service, the Comments Service, and the Search Service.

Complex example 1: Paging in the Comments Service

Although it’s not “YouTube scale”, the KillrVideo application is designed to support a very large number of videos, users, ratings, and other data types, in order to demonstrate best practices for data modeling and driver usage found in real-world applications.

With this in mind, imagine a user being presented with screens of videos and comments in a web browser on a client device. It would be infeasible and unnecessary to return the entire video catalog or the entire comment history for a popular video to the client at once, so of course some form of paging is required. However, paging is a feature that isn’t supported by the cqlengine mapper provided with the DataStax Python Driver. So, we need to fall back to other methods.

Paging in the KillrVideo Comments Service is used by the web application

A good example of the approach is found in the Comments Service (comments_service.py). Since paging only comes into effect on reads, we can use the mapper for writes, and then use regular CQL statements on the reads to get control over paging.

As an example, to access comments by user, we first create prepared statements in the constructor for the class:

# Prepared statements for get_user_comments()
self.userComments_startingPointPrepared = \
    session.prepare('SELECT * FROM comments_by_user WHERE userid = ? AND (commentid) <= (?)')
self.userComments_noStartingPointPrepared = \
    session.prepare('SELECT * FROM comments_by_user WHERE userid = ?')

Then, in the get_user_comments() operation, we use one of the prepared statements to create a bound statement which we can then execute:

bound_statement = Noneif starting_comment_id:
    bound_statement = self.userComments_startingPointPrepared
        .bind([user_id, starting_comment_id])
else:
    bound_statement = self.userComments_noStartingPointPrepared
        .bind([user_id])bound_statement.fetch_size = page_size
result_set = Noneif paging_state:
# see below where we encode paging state to hex before returning
result_set = self.session.execute(bound_statement, 
        paging_state=paging_state.decode('hex'))
else:
    result_set = self.session.execute(bound_statement)

Note where the fetch_size of the batch statement is set to the page_size requested by the client. The code after this iterates over the rows in the result_set to build up a list of comments to return to the client. If page_size rows were returned, then the Cassandra paging state is extracted from the result set and returned to the client:

if len(results) == page_size:
# Use hex encoding since paging state is raw bytes that won't encode to UTF-8
next_page_state = result_set.paging_state.encode('hex')

This allows the client to pass back the paging state on a subsequent call to get_user_comments() to retrieve the next page. Note that we encode/decode the paging state to a hex string, which works well over the Protobuf message format we’re using on our service interfaces.

The Video Catalog Service follows a similar paging approach, with an interesting twist. To read about this and some other nuances of paging in KillrVideo, check out my previous post on this topic. Although that post uses examples from the Java implementation of the microservice tier, the basic algorithms described in that post are also used in the Python implementation.

Complex example 2: Full-text Search in the Search Service

The Search Service presents a different sort of problem. If you’re familiar with Cassandra data modeling practices, you’ll be aware that Cassandra doesn’t support arbitrary searches, and the secondary index implementation that comes with Cassandra is known to perform poorly over large data sets.

Instead, the best practice is to design a table per query, with primary keys based on the attributes you will specify in each query. However, your options when you don’t know the exact key you are looking for, with limited support for range queries and no support for text search features like prefix/postfix or fuzzy matching.

However, we do have a couple of cases where we need text search features in the Search Service, in both the get_query_suggestions() operation, which performs a typeahead search for common search terms, and the search_videos() operation, which is used to search for videos containing a search term in the title or description.

Following the pattern set in the other language implementations of the KillrVideo services, the Python implementation of the Search Service uses DataStax Enterprise Search. First, we need a search index on the videos table (extracted from the CQL for the search schema):

CREATE SEARCH INDEX IF NOT EXISTS on videos;

Then, we create a prepared statement in the constructor of the Search Service:

self.search_videos_prepared = \
    session.prepare('SELECT * FROM videos WHERE solr_query = ?')

Finally, in the search_videos() operation, we use the solr_query syntax supported by DSE Search to create a bound statement containing our desired CQL query:

solr_query = '{"q":"name:(' + query + ')^4 OR tags:(' + query + \
')^2 OR description:(' + query + ')", "paging":"driver"}'
bound_statement = self.search_videos_prepared.bind([solr_query])

We can then execute the bound statement and iterate over the results (not shown).

The solr_query defined above searches for the provided search term (query) in the name, tags, or description columns. The solr_query places additional weight on the name and description columns so that movies with the query appearing in the name column will appear the highest in search results, followed by those that have the search term in the description column.

And now for something completely different

In the past two posts I’ve taken you on a guided tour of the data access code for most of the KillrVideo Python services. The remaining service we haven’t discussed yet is the Suggested Videos Service. As you may have guessed, this involves building a recommender, which we’re doing using DataStax Enterprise Graph since that is built on top of Cassandra.

In upcoming posts I’ll share about this recommender, starting with how we shared data between services using Kafka in order to populate data into a graph used for recommendations.

Next post in this series: Kafka + Cassandra — like peanut butter and chocolate?

Jeff Carpenter

Data Model Meets World

Simplifying assumptions won’t fly in a real application

7 min read·Mar 18, 2017

--

Jeff Carpenter

Python App Dev with ProtoBuf and GRPC

Pt. 2 of a series on the KillrVideo Python project

8 min read·Jan 31, 2019

--

Jeff Carpenter

Data Model Meets World, Part VI: Design for Scalability

You’ve created an awesome data model, but can it handle success?

8 min read·Apr 24, 2017

--

Jeff Carpenter

Advertising Python services via etcd

Pt. 3 of a series on the KillrVideo Python project

6 min read·Feb 8, 2019

--

Alexander Nguyen
in
Level Up Coding

Why I Keep Failing Candidates During Google Interviews…

They don’t meet the bar.

·4 min read·Apr 13

--

Anthony Morast
in
Level Up Coding

Protocol Buffers in Python (with Kafka)

Protocol Buffers (or protobufs for short), developed by Google, are a mechanism for defining messages in the proto format which allow for…

·5 min read·Jan 16

--

Andrew Courter
in
Level Up Coding

Is Star Schema Dead?

Star schema has been used in data warehouses and data marts for decades now. With storage getting cheaper is it time to move on?

·3 min read·Jan 2

--

The PyCoach
in
Artificial Corner

You’re Using ChatGPT Wrong! Here’s How to Be Ahead of 99% of ChatGPT Users

Master ChatGPT by learning prompt engineering.

·7 min read·Mar 17

--

Dmytro Nikolaiev (Dimid)
in
Towards Data Science

Graphs with Python: Overview and Best Libraries

Graph analysis, interactive visualizations, and graph machine learning

·11 min read·Dec 2, 2022

--

Aleid ter Weel
in
Better Advice

10 Things To Do In The Evening Instead Of Watching Netflix

Device-free habits to increase your productivity and happiness.

·5 min read·Feb 15, 2022

--

Related Articles

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

cassandra

python

kafka

GitHub - princebhatt9588/Stock_Market_Real_Time_Data_Pipeline_Project_with-Apache-Kafka-and-Cassandra: This app utilizes Python, Apache Kafka, and Cassandra to fetch and process real-time stock market data, providing valuable insights for investors and traders.

12/2/2023

cassandra

python

Getting started with Apache Cassandra and Python

5/19/2023

cassandra

python

celery

celery.backends.cassandra — Celery 5.2.7 documentation

5/16/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

cassandra

cassandra.repair

GitHub - Ericsson/ecchronos: Ericsson distributed repair scheduler for Apache Cassandra

12/2/2023

Explore Further

cassandra

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

python

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

cassandra

python

kafka

GitHub - princebhatt9588/Stock_Market_Real_Time_Data_Pipeline_Project_with-Apache-Kafka-and-Cassandra: This app utilizes Python, Apache Kafka, and Cassandra to fetch and process real-time stock market data, providing valuable insights for investors and traders.

12/2/2023

spark

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

Pt. 6 of a series on the KillrVideo Python project

Keeping it simple

Simple example 1: Counters in the Statistics Service

Simple example 2: Batch Writes in the Ratings Service

When simple doesn’t cut it

Complex example 1: Paging in the Comments Service

Complex example 2: Full-text Search in the Search Service

And now for something completely different

Data Model Meets World

Simplifying assumptions won’t fly in a real application

Python App Dev with ProtoBuf and GRPC

Pt. 2 of a series on the KillrVideo Python project

Data Model Meets World, Part VI: Design for Scalability

You’ve created an awesome data model, but can it handle success?

Advertising Python services via etcd

Pt. 3 of a series on the KillrVideo Python project

Why I Keep Failing Candidates During Google Interviews…

They don’t meet the bar.

Protocol Buffers in Python (with Kafka)

Protocol Buffers (or protobufs for short), developed by Google, are a mechanism for defining messages in the proto format which allow for…

Is Star Schema Dead?

Star schema has been used in data warehouses and data marts for decades now. With storage getting cheaper is it time to move on?

You’re Using ChatGPT Wrong! Here’s How to Be Ahead of 99% of ChatGPT Users

Master ChatGPT by learning prompt engineering.

Graphs with Python: Overview and Best Libraries

Graph analysis, interactive visualizations, and graph machine learning

10 Things To Do In The Evening Instead Of Watching Netflix

Device-free habits to increase your productivity and happiness.

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?