Build a Website Scraper with Astra DB + Python Examples

12/2/2023

Reading time:2

Build a Website Scraper with Astra DB + Python Examples | DataStax

This resource is based on an article originally published here.

Learn how to scrape websites with Astra DB, Python, Selenium, Requests HTML, Celery, & FastAPI.

180 minutes • Advanced

Updated October 19, 2021

180 minutes, Advanced, Start Building

Learn how to scrape websites with Astra DB, Python, Selenium, Requests HTML, Celery, & FastAPI by following along with CodingEntrepreneurs' video, located here.

Quick Start

Signup for DataStax Astra, or login to your already existing account.
Create an Astra DB Database if you don't already have one.
Create an Astra DB Keyspace called sag_python_scraper in your database.
Generate an Application Token with the role of Database Administrator for the Organization that your Astra DB is in.
Get your secure connect bundle from the connect page of your database and save it to your project folder. Rename it to bundle.zip
Setup your system: Below is a preflight checklist to ensure you system is fully setup to work with this course. All guides and setup can be found in the setup directory of this repo.
- Install Selenium & Chromedriver - setup guide
- Install Redis - setup guide
- Create a virtual environment & install dependencies
- Setup an account with DataStax
- Create your first AstraDB and get API credentials
- Use cassandra-driver to verify your connection to AstraDB

How this works

Follow along in this video tutorial: https://youtu.be/NyDT3KkscSk.

This series is broken up into 4 parts:

Scraping How to scrape and parse data from nearly any website with Selenium & Requests HTML.
Data models how to store and validate data with cassandra-driver, pydantic, and AstraDB.
Worker & Scheduling how to schedule periodic tasks (ie scraping) integrated with Redis & AstraDB
Presentation How to combine the above steps in as robust web application service

Here's what each tool is used for:

Python 3.9 download - programming the logic.
AstraDB sign up - highly perfomant and scalable database service by DataStax. AstraDB is a Cassandra NoSQL Database. Cassandra is used by Netflix, Discord, Apple, and many others to handle astonding amounts of data.
Selenium docs - an automated web browsing experience that allows:
- Run all web-browser actions through code
- Loads JavaScript heavy websites
- Can perform standard user interaction like clicks, form submits, logins, etc.
Requests HTML docs - we're going to use this to parse an HTML document extracted from Selenium
Celery docs - Celery providers worker processes that will allow us to schedule when we need to scrape websites. We'll be using redis as our task queue.
FastAPI docs - as a web application framework to Display and monitor web scraping results from anywhere

Related Articles

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

examples

cassandra

datastax

GitHub - datastaxdevs/workshop-betterreads: Clone of Good Reads using Spring and Cassandra

12/2/2023

examples

cassandra

datastax

NoSQL Database Built on Apache Cassandra | DataStax

12/2/2023

examples

cassandra

datastax

DataStax Examples

12/2/2023

starter.template

examples

astra

GitHub - datastaxdevs/workshop-social-media-tiktok: A simple Tik Tok clone running on AstraDB that leverages the Document API.

12/2/2023

datastax

cassandra

langchain

Super Charge AI Assistants with Superagent and DataStax | DataStax

11/30/2023

database

datastax

aws

Getting Started with DataStax Astra DB and Amazon Bedrock | DataStax

11/30/2023

migration

datastax

astra

GitHub - datastax/dsbulk-migrator

11/29/2023

cassio

datastax

llm

DataStax, Google partner to bring vector search to NoSQL AstraDB

6/12/2023

astra

cassandra

vue

GitHub - datastaxdevs/workshop-vuejs

6/6/2023

Explore Further

web.scraping

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

scraping

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

flink

beam

dataflow

• Google Dataflow - Awesome-Astra

5/10/2023

data.modeling

cassandra

spark

Dealing with Large Spark Partitions

2/17/2023

datastax

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

examples

cassandra

datastax

GitHub - datastaxdevs/workshop-betterreads: Clone of Good Reads using Spring and Cassandra

12/2/2023

examples

cassandra

datastax

NoSQL Database Built on Apache Cassandra | DataStax

12/2/2023

examples

cassandra

datastax

DataStax Examples

12/2/2023

Quick Start

How this works

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?