pinecone_text.sparse

Sparse Vectorizers

Sparse vectorizers are used to convert a document into a sparse vector representation. This is useful for indexing and searching large collections of documents. The sparse vector representation is a list of indices and values. The indices are the token ids and the values are the token weights. The token weights are calculated using the BM25 or SPLADE algorithms.

BM25

Okapi BM25 is a probabilistic ranking function that is used to rank documents based on a query. BM25 is a bag-of-words model that does not take into account the order of the words in the document. The BM25 algorithm is used to calculate the token weights in the sparse vector representation.

Important note: Our BM25 implementation is not the same as the one in the original paper. We use a different TF-IDF representation that are more suitable for vector representations. The BM25 implementation in this library is based work done by Pinecone. For more information, see the Pinecone documentation.

SPLADE

SPLADE is a Transformer based encoder, that uses sophisticated expansion to encode documents and queries in a sparse representation. This allows a semantic search to be performed on the sparse vectors. The SPLADE encoder is based on the work done by the research team in Naver Labs Europe. For more information, see the SPLADE paper. The SPLADE encoder is currently only available for inference only.

 1"""
 2# Sparse Vectorizers
 3
 4Sparse vectorizers are used to convert a document into a sparse vector representation. This is useful for
 5indexing and searching large collections of documents. The sparse vector representation is a list of indices and
 6values. The indices are the token ids and the values are the token weights. The token weights are calculated
 7using the BM25 or SPLADE algorithms.
 8
 9## BM25
10Okapi BM25 is a probabilistic ranking function that is used to rank documents based on a query. BM25 is
11a bag-of-words model that does not take into account the order of the words in the document. The BM25
12algorithm is used to calculate the token weights in the sparse vector representation.
13
14Important note:
15Our BM25 implementation is not the same as the one in the original paper. We use a different TF-IDF representation
16that are more suitable for vector representations. The BM25 implementation in this library is based work done by
17Pinecone. For more information, see the [Pinecone documentation](https://docs.pinecone.io/docs/hybrid-search).
18
19## SPLADE
20SPLADE is a Transformer based encoder, that uses sophisticated expansion to encode documents and queries in a sparse representation.
21This allows a semantic search to be performed on the sparse vectors. The SPLADE encoder is based on the work done by the research team in Naver Labs Europe.
22For more information, see the [SPLADE paper](https://arxiv.org/abs/2109.10086). The SPLADE encoder is currently only available for inference only.
23"""
24
25
26from typing import Union, Dict, List
27
28SparseVector = Dict[str, Union[List[int], List[float]]]
29
30from .bm25_encoder import BM25Encoder  # noqa: F401
31from .splade_encoder import SpladeEncoder  # noqa: F401