Stratio’s Lucene-based index for Cassandra, now a plugin

7/29/2022

Reading time:2

Stratio’s Lucene-based index for Cassandra, now a plugin - Stratio Blog

This resource is based on an article originally published here.

Thanks to the changes proposed at CASSANDRA-8717, CASSANDRA-7575 and CASSANDRA-6480, Stratio is glad to present its Lucene-based implementation of Cassandra secondary indexes as a plugin that can be attached to the Apache distribution. Before the above changes, Lucene index was distributed inside a fork of Apache Cassandra, with all the difficulties it implied, i.e. maintaining a fork. As of now, the fork is discontinued and new users should use the recently created plugin, which maintains all the features of Stratio Cassandra.

Stratio’s Lucene index extends Cassandra’s functionality to provide near real-time distributed search engine capabilities such as with ElasticSearch or Solr, including full text search capabilities, free multivariable search,relevance queries and field-based sorting. Each node indexes its own data, so high availability and scalability is guaranteed.

With the new distribution as a plugin you can easily patch an existing installation of Cassandra with the following command:

mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_INSTALL_DIR>

Now, given an existing table with the following schema

CREATE TABLE tweets (
   id        bigint,
   createdAt timestamp,
   message   text,
   userid    bigint,
   username  text,
   PRIMARY KEY (userid, createdAt, id) );

you can create a Lucene index with this query:

ALTER TABLE tweets ADD lucene TEXT;
CREATE CUSTOM INDEX tweets_idx ON  twitter.tweets (lucene) 
USING 'com.stratio.cassandra.lucene.Index' 
WITH OPTIONS = {
   'refresh_seconds'    : '60',
   'schema' : '{ default_analyzer : "English",
     fields : {
           createdat : {type : "date", pattern : "yyyy-MM-dd"}, 
           message   : {type : "text", analyzer : "English"}, 
           userid    : {type : "string"}, 
           username  : {type : "string"}
       }}     
'};

Once the index has been created, we can start issuing queries. For instance, retrieve 10 tweets containing the word Cassandra in the message body:

SELECT * FROM tweets WHERE lucene = ‘{
   filter : {type  : "match", 
             field : "message",
             value : "cassandra"}
}’ LIMIT 10;

We can also find the 5 best matching tweets containing the word Cassandra in the message body for the previous query, involving Lucene relevance features:

SELECT * FROM tweets WHERE lucene = ‘{
   query: {type: "match", 
           field : "message",
           value : "cassandra"}
}’ LIMIT 5;

Queries, filters and boolean predicates can be combined to do queries as complex as requesting the tweets inside a certain time range starting with an ‘A’ or containing a ‘B’, or with the word ‘FAST’, sorting them by ascending time, and then by descending alphabetic user name:

SELECT * FROM tweets WHERE lucene = '{
   filter : 
   {
    type : "boolean", must : 
    [
       {type : "range", field : "createdat", lower : "2014-04-25"},
       {type : "boolean", should : 
        [
           {type : "prefix", field : "userid", value : "a"} ,
           {type : "wildcard", field : "userid", value : "*b*"} ,
           {type : "match", field : "userid", value : "fast"} 
        ]
       }
    ]
   },
   sort : 
   {
     fields: [ {field :"createdat", reverse : true},
               {field : "userid", reverse : false} ] 
   }
}' LIMIT 10000;

You can learn more about Stratio’s Lucene index for Cassandra reading the comprehensive documentation or viewing the video of its presentation at Cassandra Summit Europe 2014.

We will be back very soon with new exciting features such as geospatial search and bitemporal indexing.

Related Articles

lucene

cassandra

search / secondary indexes

Stratio's Cassandra Lucene index: Geospatial use cases by Andres de la Peña

8/9/2022

lucene

geospatial

cassandra

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones)

8/5/2022

lucene

plugin

cassandra

GitHub - instaclustr/cassandra-lucene-index: Lucene based secondary indexes for Cassandra

7/29/2022

cassandra.lunch

cassandra

secondary.indexes

Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra - Business Platform Team

6/20/2022

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

cassandra

cassandra.repair

GitHub - Ericsson/ecchronos: Ericsson distributed repair scheduler for Apache Cassandra

12/2/2023

migration

schema

scylladb

GitHub - eighty4/cquill: Versioned CQL migrations for Cassandra and ScyllaDB

12/2/2023

Explore Further

lucene

cassandra

search / secondary indexes

Stratio's Cassandra Lucene index: Geospatial use cases by Andres de la Peña

8/9/2022

lucene

geospatial

cassandra

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones)

8/5/2022

lucene

plugin

cassandra

GitHub - instaclustr/cassandra-lucene-index: Lucene based secondary indexes for Cassandra

7/29/2022

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

cassandra

rest

python

flask

GitHub - rohitsakala/CassandraRestfulAPI: CassandraRestfulAPI project exposes the cassandra data tables with the help of Restful API's. The project follows the standard Restful API rules. This project is developed as Major project of the Cloud Computing course by Team 15. The project is developed using Python Driver provided by Datastax using Flask framework. #IIITHyderabad #CloudComputing #CSE565 #Monsoon16 #SIEL #Cassandra #Flask #RestAPI

12/9/2023

etl

cassandra

Using Oracle GoldenGate for Big Data

12/9/2023

node

hybrid.cloud

datastax

GitHub - IBM/datastax-cassandra-clickstream: Use DataStax Enterprise built on Apache Cassandra as a clickstream database

12/8/2023

windows

cassandra

tutorial

Install Cassandra on Windows 10: Tutorial With Simple Steps

12/2/2023

search / secondary indexes

lucene

cassandra

search / secondary indexes

Stratio's Cassandra Lucene index: Geospatial use cases by Andres de la Peña

8/9/2022

lucene

plugin

cassandra

GitHub - instaclustr/cassandra-lucene-index: Lucene based secondary indexes for Cassandra

7/29/2022

python

cassandra

spark

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

12/2/2023

python

cassandra

spark

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

12/2/2023

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?