In Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra, we will discuss how you can use Apache Spark and Apache Cassandra to perform additional basic Machine Learning tasks. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #54, we will discuss how you can use Apache Spark and Apache Cassandra to perform additional basic Machine Learning tasks such as Naive Bayes classification and Random Forest classification. We discuss basic concepts of machine learning as a whole, the technologies used in the presentation, and present a Github repository with a demo project which uses these technologies and be run without any issues. The live recording embedded below contains a live demo as well, so be sure to watch that as well!
Basic Concepts of Machine Learning
Machine Learning is “the study of computer algorithms that improve automatically through experience and by the use of data”. It is seen as a part of artificial intelligence. Machine Learning is used in a large number of applications in today’s society in many different fields I.E. Medicine (drug discovery), image recognition, email filtering, and more. Machine Learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners. However, many open-source and easy-to-access tools exist for working with and using machine learning in the real world. Some examples include:
- TensorFlow: https://www.tensorflow.org/
- Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html
Additionally, many well constructed resources exist for learning machine learning online:
- Basics ML courses on websites like Coursera
- More advanced, theory-oriented look at ML i.e. Caltech CS 156 lectures: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
Machine Learning – Why Cassandra?
There exist a large number of reasons why Cassandra is a great database to use for machine learning applications. Below we list a couple of them:
- Speed/Reliability:
- Machine learning projects tend to involve processing very large amounts of data in order to properly train complicated models (millions of points of data are necessary for high accuracy in some applications)
- Cassandra is very good at dealing with large amounts of incoming data.
- Additionally, Cassandra is built upon a masterless architecture – there are no “master” nodes, and thus there is no single point of failure.
- Data in Cassandra is replicated and multiple copies are stored across >1 nodes, ensuring the availability of data even when nodes drop from the cluster or unexpectedly shut down.
- Machine learning projects tend to involve processing very large amounts of data in order to properly train complicated models (millions of points of data are necessary for high accuracy in some applications)
- Scalability/Upgradability:
- Cassandra can be scaled both horizontally and vertically:
- Horizontal scalability allows the addition of many weaker machines as opposed to upgrading to more powerful hardware on a few machines, thus it is easy to scale up your performance with relatively little cost.
- Cassandra can be scaled both horizontally and vertically:
Machine Learning – Random Forest
In the following section, we briefly discuss the Random Forest, which is an ensemble learning method for classification, regression, and similar tasks, and involves building a set of decision trees with incoming data. The building block of a Random Forest is a concept known as a decision tree:
- Decision Tree Learning is a predictive modelling approach used in statistics, data mining and machine learning
- Given a set of data and labels to the data, we can build a decision tree to try and figure out how to separate the data into classes while testing on various attributes of the data.
- The following diagram shows an example set of data off of which a decision tree is created based on selected attributes
- Each node represents an attribute we test on.
A Random Forest, as the name may suggest, consists of a large group of decision trees made with different segments of the data. Suppose we have N pieces of data with M different features. We arbitrarily select to make n decision trees as part of the random forest.
- For each initial input used to make the decision trees, we take a random sample of size N from the data but with replacement (intentionally adding in some duplicate pieces of data).
- The reasoning for this is quite complicated.
- Each decision tree also arbitrarily selects only m < M features of the data to make the decision tree based on.
- This is known as Feature Randomness. It forces more variation amongst the trees, thus ultimately lowering correlation across trees and producing more diversification.
- A given piece of data is then run through/classified by all of the decision trees and whichever class is predicted the most is the result of the prediction for the Random Forest model.
The following diagram visually shows the process a single piece of data goes through in order to be classified by a Random Forest classifier:
Machine Learning – Naive Bayes Classifier
The Naive Bayes classifier is a collection of simple probability based classification algorithms based on Bayes’ Theorem:
In a classification problem, however, we are trying to obtain the probability that some object/thing is a member of a particular class given some large number of known values. If there is plenty of variation/possible values for the input data, then using Bayes’ theorem in complicated problems is not very practical if we are basing our results on probability tables. We write out Bayes theorem in terms of the probability of some piece of data being part of a class Ck given some known variables about the data x:
If we have a complicated real-life problem with many portions to the known variables x, then we may not have a single data point for a given class Ck with exactly the values contained in x. Thus, we cannot properly obtain a value for p(x | Ck) using a probability table for such a complicated problem. To resolve this issue, we make the “naive” assumptions of conditional independence: assume that all features in x are mutually independent, conditional on the category Ck:
Although different features of most data sets are not conditionally independent, it turns out that making this assumption and building a classifier based on it gives pretty good results on real-life data sets. After going through some math, we come to the following conclusion given this set of assumptions:
From here, we can obtain values of p(xi | Ck) from a probability table and build a classifier based on this equation, which is the Naive Bayes classifier.
Technologies used in Demo
The following three technologies are the primary drivers of the demo project associated with this blog:
- Apache Cassandra
- Open-source distributed No-SQL database designed to handle large volumes of data across multiple different servers.
- Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability).
- Horizontal scalability is part of why Cassandra is so powerful – cheap machines can be added to a cluster to improve its performance in a significant manner.
- Note: Demo runs code from DataStax, and they use the DSE (DataStax Enterprise) version of Cassandra (not open source).
- Spark is also run together with Cassandra in DSE.
- Apache Spark
- Open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
- In particular, the demo project uses the Spark Machine Learning Library (MLlib).
- Spark’s machine learning library
- Features many common ML algorithms, tools for modifying data and pipelines, loading/saving algorithms, various utilities, and more.
- Primary API used to be RDD (Resilient Distributed Datasets) based, but has switched over to being DataFrame based since Spark 2.0.
- Project Jupyter
- Jupyter is an open-source web application for creating and sharing documents containing code, equations, markdown text, and more.
- Github repo which will be demoed in this presentation contains a few different Jupyter notebooks containing sections of Python code along with explanations/narration of the general idea of the code.
- Jupyter notebooks can be used with programming languages other than python, including R, Julia, Scala, and many others.
Demo Project
GitHub Repo link for the demo: https://github.com/HadesArchitect/CaSpark
The demo we go over in this week’s Cassandra Lunch was written by HadesArchitect at DataStax and contains a Docker compose file which will run three docker images:
- DSE – DataStax Enterprise version of Cassandra with analytics(Spark) enabled.
- Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook
- DataStax Studio
The Github repo contains a lot of examples of Jupyter notebooks with different machine learning tasks/models being used, but we will be primarily looking at Random Forest.ipynb and Naivebayes.ipynb. The notebook Random Forest.ipynb focuses on running the Random Forest classifier to classify a wine into a classification score based on the physical and chemical properties of the wine. The notebook Naivebayes.ipynb does a similar classification task but using the Naive Bayes classifier instead of Random Forest. In order to run the demo project, one change needs to be made to the docker-compose.yml file in the main folder of the project. The line which starts with “PYSPARK_SUBMIT_ARGS” needs to be replaced with the following:
PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
This change uses a more up-to-date version of the DataStax Spark Cassandra Connector and will allow PySpark in the example Jupyter Notebooks to run (successfully connect to Spark running in the DSE Cassandra docker container). Instructions for running the demo besides this change are included in the readme of the repository.
That will wrap up this blog on using Cassandra and Spark to perform additional basic Machine Learning tasks. As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, be sure to check that out as well!
Resources
- https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/
- https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark-cassandra/
- https://towardsdatascience.com/understanding-random-forest-58381e0602d2
- https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a-random-forest-depend-on-the-number-of-pred/36183
- https://pandio.com/blog/five-ways-apache-cassandra-is-designed-to-support-machine-learning-use-cases/
- https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead-of-without-replacement
- https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- https://www.geeksforgeeks.org/naive-bayes-classifiers/
- http://users.sussex.ac.uk/~christ/crs/ml/lec02b.html
- https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
Cassandra.Link
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!