Vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes. Each vector has a certain number of dimensions, which can range from tens to thousands, depending on the complexity and granularity of the data.
Vector databases are specialized storage systems designed for efficient management of dense vectors and support advanced similarity search, while vector libraries are integrated into existing DBMS or search engines to enable similarity search within a broader database context. The choice between the two depends on the specific requirements and scale of the application.
-
Elasticsearch (64.9k ⭐) — A distributed search and analytics engine that supports various types of data. One of the data types that Elasticsearch supports is vector fields, which store dense vectors of numeric values. In version 7.10, Elasticsearch added support for indexing vectors into a specialized data structure to support fast kNN retrieval through the kNN search API. In version 8.0, Elasticsearch added support for native natural language processing (NLP) with vector fields.
-
Faiss (24.1k ⭐) — A library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. It is developed primarily at Meta’s Fundamental AI Research group.
-
Milvus (22.4k ⭐) — An open-source vector database that can manage trillions of vector datasets and supports multiple vector search indexes and built-in filtering.
-
Qdrant (12.5k ⭐) — A vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.
-
Chroma (8.2k ⭐) — An AI-native open-source embedding database. It is simple, feature-rich, and integrable with various tools and platforms for working with embeddings. It also provides a JavaScript client and a Python API for interacting with the database.
-
OpenSearch (7.4k ⭐) — A community-driven, open source fork of Elasticsearch and Kibana following the license change in early 2021. It includes a vector database functionality that allows you to store and index vectors and metadata, and perform vector similarity search using k-NN indexes.
-
Weaviate (7.3k ⭐) — An open-source vector database that allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.
-
Vespa(4.6k ⭐) — A fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query. Integrated machine-learned model inference allows you to apply AI to make sense of your data in real time.
-
pgvector (5.3k ⭐) — An open-source extension for PostgreSQL that allows you to store and query vector embeddings within your database. It is built on top of the Faiss library, which is a popular library for efficient similarity search of dense vectors. pgvector is easy to use and can be installed with a single command.
-
Vald (1.3k ⭐) — A highly scalable distributed fast approximate nearest neighbor dense vector search engine. Vald is designed and implemented based on the Cloud-Native architecture. It uses the fastest ANN Algorithm NGT to search neighbors. Vald has automatic vector indexing and index backup, and horizontal scaling which made for searching from billions of feature vector data.
-
Apache Cassandra (8.1k ⭐) — An open source NoSQL distributed database trusted by thousands of companies. Vector search is coming to Apache Cassandra in its 5.0 release, which is expected to be available in late 2023 or early 2024. This feature is based on a collaboration between DataStax and Google, who are working on integrating Apache Cassandra with Google’s open source vector database engine, ScaNN.
-
ScaNN (Scalable Nearest Neighbors, Google Research) — A library for efficient vector similarity search, which finds the k nearest vectors to a query vector, as measured by a similarity metric. Vector similarity search is useful for applications such as image search, natural language processing, recommender systems, and anomaly detection.
-
Pinecone — A vector database that is designed for machine learning applications. It is fast, scalable, and supports a variety of machine learning algorithms. Pinecone is built on top of Faiss, a library for efficient similarity search of dense vectors.
Common features
Vector databases and vector libraries are both technologies that enable vector similarity search, but they differ in functionality and usability:
- Vector databases can store and update data, handle various types of data sources, perform queries during data import, and provide user-friendly and enterprise-ready features.
- Vector libraries can only store data, handle vectors only, require importing all the data before building the index, and require more technical expertise and manual configuration.
Some vector databases are built on top of existing libraries, such as Faiss. This allows them to take advantage of the existing code and features of the library, which can save time and effort in development.
These vector databases & libraries are used in artificial intelligence (AI) applications such as machine learning, natural language processing and image recognition. They share some common features:
-
They support vector similarity search, which finds the k nearest vectors to a query vector, as measured by a similarity metric. Vector similarity search is useful for applications such as image search, natural language processing, recommender systems, and anomaly detection.
-
They use vector compression techniques to reduce the storage space and improve the query performance. Vector compression methods include scalar quantization, product quantization, and anisotropic vector quantization.
-
They can perform exact or approximate nearest neighbor search, depending on the trade-off between accuracy and speed. Exact nearest neighbor search provides perfect recall, but may be slow for large datasets. Approximate nearest neighbor search uses specialized data structures and algorithms to speed up the search, but may sacrifice some recall.
-
They support different types of similarity metrics, such as L2 distance, inner product, and cosine distance. Different similarity metrics may suit different use cases and data types.
-
They can handle various types of data sources, such as text, images, audio, video, and more. Data sources can be transformed into vector embeddings using machine learning models, such as word embeddings, sentence embeddings, image embeddings, etc.
When choosing a vector database, it is important to consider your specific needs and requirements.
What are vector embeddings?
Vector embeddings, also known as vector representations or word embeddings, are numerical representations of words, phrases, or documents in a high-dimensional vector space. They capture semantic and syntactic relationships between words and allow machines to understand and process natural language more effectively.
{image} --> {image model} --> [1.3,0.6,1.2,-0.4,...]
{text} --> {text model} --> [1.2,0.4,1.5,-0.8,...]
{audio} --> {audio model} --> [1.1,1.6,-1.1,0.4,...]
Vector embeddings are typically generated using machine learning techniques, such as neural networks, that learn to map words or textual inputs to dense vectors. The underlying idea is to represent words with similar meanings or contexts as vectors that are close together in the vector space.
One popular method for generating vector embeddings is Word2vec, which learns representations based on the distributional properties of words in a large corpus of text. It can be trained in two ways: the continuous bag-of-words (CBOW) model or the skip-gram model. CBOW predicts a target word based on its context words, while skip-gram predicts context words given a target word. Both models learn to map words to vector representations that encode their semantic relationships.
Another widely used technique is GloVe (Global Vectors for Word Representation), which leverages co-occurrence statistics to generate word embeddings. GloVe constructs a word co-occurrence matrix based on the frequencies of words appearing together in a corpus and then applies matrix factorization to obtain the embeddings.
Vector embeddings have various applications in natural language processing (NLP) tasks, such as language modeling, machine translation, sentiment analysis, and document classification.
By representing words as dense vectors, models can perform mathematical operations on these vectors to capture semantic relationships, such as word analogies (e.g., “king” - “man” + “woman” ≈ “queen”). Vector embeddings enable machines to capture the contextual meaning of words and enhance their ability to process and understand human language.