In Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra, we will discuss how you can use Apache Spark and Apache Cassandra to perform basic Machine Learning tasks. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #50, we will discuss how to use Apache Spark and Apache Cassandra for basic machine learning tasks such as the KMeans Clustering algorithm. We discuss basic concepts of machine learning as a whole, the technologies used in the presentation, and present a Github repository with a demo project which uses these technologies and be run without any issues. The live recording embedded below contains a live demo as well, so be sure to watch that as well!
1. Basic Concepts of Machine Learning
1.1 What is Machine Learning?
Machine Learning is “the study of computer algorithms that improve automatically through experience and by the use of data”. It is seen as a part of artificial intelligence. Machine Learning is used in a large number of applications in today’s society in many different fields I.E. Medicine (drug discovery), image recognition, email filtering, and more. Machine Learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners. However, many open-source and easy to access tools exist for working with and using machine learning in the real world. Some examples include:
- TensorFlow: https://www.tensorflow.org/
- Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html
Additionally, many well constructed resources exist for learning machine learning online:
- Basics ML courses on websites like Coursera
- More advanced, theory-oriented look at ML i.e. Caltech CS 156 lectures: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
1.2 Neat Examples of Machine Learning
Below we list a couple of neat and well known examples of machine learning:
- Image Classification – attempting to get the system to be able to classify pictures as being part of some category
- A popular example in many beginner machine learning courses – classifying images between being a picture of a dog or being a picture of a cat
- Well known competition on Kaggle + very large dataset to use if you are interested in trying: https://www.kaggle.com/c/dogs-vs-cats
- Another popular example – determining what digit is written in a picture (0-9).
- Popular dataset: MNIST Dataset, includes 60,000 examples as part of a training set and 10,000 examples as part of a testing set. http://yann.lecun.com/exdb/mnist/
- A popular example in many beginner machine learning courses – classifying images between being a picture of a dog or being a picture of a cat
- Sentiment Analysis – attempting to determine whether a given text (i.e. a tweet from Twitter) is generally “positive” or “negative”.
- The well-known dataset called “sentiment140”, containing 1.6 million tweets rated between 0-4 (negative to positive): https://www.kaggle.com/kazanova/sentiment140
1.3 Machine Learning: 4 Stages of Processing Data
Generally, there are four stages to a machine learning task. The following diagram splits these four stages up:
- Preparation of data
- Converting raw data into data that may be labeled and have features/components separated out
- Splitting of data
- A portion of your data needs to be used for training your model, a part for validating that the training is “possibly okay” and then a portion for testing your model on
- Training your model
- Obtaining predictions using the trained model
1.4 Machine Learning: Supervised vs Unsupervised
Machine learning model training can be roughly categorized into two categories (and varying combinations of these two categories exist as well): Supervised vs Unsupervised learning:
- Supervised Learning: Training a machine learning model with a dataset that has labels attached to it
- I.E. Image classification of dogs vs cats with labeled pictures of dogs and cats, and many other examples
- Unsupervised Learning: A machine learning model is given a dataset without explicit labels on the data, thus without explicit instructions on what exactly to look for
- Used to find unknown trends or groupings in data
- Example of the kinds of things that unsupervised learning can look for: patterns in images. See the picture below for some examples from Nvidia.
2. Technologies used in Demo
The following three technologies are the primary drivers of the demo project associated with this blog:
- Apache Cassandra
- Open-source distributed No-SQL database designed to handle large volumes of data across multiple different servers
- Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability)
- Horizontal scalability is part of why Cassandra is so powerful – cheap machines can be added to a cluster to improve its performance in a significant manner
- Note: Demo runs code from DataStax, and they use the DSE (DataStax Enterprise) version of Cassandra (not open source)
- Spark is also run together with Cassandra in DSE
- Apache Spark
- Open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance
- In particular, the demo project uses the Spark Machine Learning Library (MLlib)
- Spark’s machine learning library
- Features many common ML algorithms, tools for modifying data and pipelines, loading/saving algorithms, various utilities, and more.
- Primary API used to be RDD (Resilient Distributed Datasets) based, but has switched over to being DataFrame based since Spark 2.0
- Project Jupyter
- Jupyter is an open-source web application for creating and sharing documents containing code, equations, markdown text, and more.
- Github repo which will be demoed in this presentation contains a few different Jupyter notebooks containing sections of Python code along with explanations/narration of the general idea of the code
- Jupyter notebooks can be used with programming languages other than python, including R, Julia, Scala, and many others.
3. Demo project
GitHub Repo link for the demo: https://github.com/HadesArchitect/CaSpark
The demo we go over in this week’s Cassandra Lunch was written by HadesArchitect at DataStax and contains a Docker compose file which will run three docker images:
- DSE – DataStax Enterprise version of Cassandra with analytics(Spark) enabled.
- Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook
- DataStax Studio
The Github repo contains a lot of examples Jupyter notebooks with different machine learning tasks/models being used, but we will be primarily looking at KMeans.ipynb. The notebook KMeans.ipynb focuses on running the K-Means clustering algorithm and applying it to some social media data to determine whether a social media post is a status vs a video depending on the number of likes and number of comments it has. In order to run the demo project, one change needs to be made to the docker-compose.yml file in the main folder of the project. The line which starts with “PYSPARK_SUBMIT_ARGS” needs to be replaced with the following:
PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
This change uses a more up-to-date version of the DataStax Spark Cassandra Connector and will allow PySpark in the example Jupyter Notebooks to run (successfully connect to Spark running in the DSE Cassandra docker container). Instructions for running the demo besides this change are included in the readme of the repository.
That will wrap up this blog on using Cassandra and Spark for basic machine learning tasks. As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Apache Cassandra Lunch #49: Spark SQL for Cassandra Data Operations, be sure to check that out as well!
Resources
- https://www.tensorflow.org/
- http://spark.apache.org/docs/latest/ml-guide.html
- https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
- https://www.kaggle.com/c/dogs-vs-cats
- http://yann.lecun.com/exdb/mnist/
- https://www.kaggle.com/kazanova/sentiment140
- https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
- https://jupyter.org/
- https://github.com/HadesArchitect/CaSpark
- https://hub.docker.com/r/jupyter/pyspark-notebook
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!