Semantic Search

Page Sections

What is semantic search?

Semantic search is an information retrieval process where search engines understand the meaning of queries and search objects (such as text, images, videos, and audio data) similar to humans.

Semantic search is realized by representing the meaning of search objects and queries as vectors. Searching by capturing the meaning of both entities in vectors brings the following benefits:

  • Semantic search is not limited to text. As long as the meaning of search objects, such as images, videos, and audio data, can be converted into vectors, any type of object, even if different, can be searchable (multimodal).
  • For the same reason, documents written in other languages can be searched with queries in one's own language. For example, you can search for an English manual with a Japanese query (without translation).
  • In text search, even if keywords do not match exactly, semantically similar ones can be retrieved. Therefore, there is no need for maintenance of synonym dictionaries.
  • Because searches can be conducted even when keywords do not match exactly, there can be a vocabulary gap between document authors (domain experts) and searchers (those not familiar with the domain), yet searches can still be performed.

Semantic search is sometimes referred to by the following terms, but they all refer to the same technology. In this book, we will use the term 'semantic search' for consistency.

  • Vector search
  • Dense vector search
  • Neural search

The Apache Solr provided by KandaSearch supports semantic search from version 9.0 onwards.

Semantic search feature of Apache Solr

We will explain the necessary configurations for using semantic search with Apache Solr.

Schema configuration

You need a field in the managed-schema.xml to store vector values. Here's an example (4 dimensions, cosine similarity, field name 'vector'):

<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
  • Specify the number of dimensions with vectorDimension (required).
  • Specify the function used for relevance calculation with similarityFunction (required; choose from euclidean, dot_product, cosine).

Indexing

Using some model, prepare document data like the following. As mentioned earlier, since the field name for storing vector values ​​was set to 'vector', the following JSON data also includes a 4-dimensional vector value in the 'vector' field.

[
  { "id": "1", "vector": [1.0, 2.5, 3.7, 4.1] },
  { "id": "2", "vector": [1.5, 5.5, 6.7, 65.1] }
]

Index the prepared JSON data (let's call it data.json) into Apache Solr as follows.

curl -X POST 'https://<subdomain>.c.kandasearch.com/solr/<collection>/update?commit=true' --data-binary @data.json -H 'Content-Type: text/json'

Searching

Prepare a query vector (in this case, 4-dimensional) using the same model used during indexing, and execute semantic search as follows.

curl https://<subdomain>.c.kandasearch.com/solr/<collection>/select?&q={!knn f=vector topK=10}[1.0, 2.5, 3.7, 4.1]&fl=id score
  • Specify the vector field name with f (required).
  • Specify the maximum number of hits to retrieve with topK (default is 10).

Semantic search feature of KandaSearch

In the case of KandaSearch, for text searches, both query-time and small-scale indexing-time vector calculations can be executed within KandaSearch.

Plugin for Semantic Search in KandaSearch

KandaSearch provides Apache Solr plugins for semantic search that are pre-installed (no need to download additional extensions or plugins). Here, we'll explain how to use these built-in plugins.

EmbeddingsProcessorFactory

EmbeddingsProcessorFactory is an Apache Solr UpdateRequestProcessor suitable for performing small-scale vector calculations during indexing. To use it, configure the following settings in solrconfig.xml.

<updateRequestProcessorChain name="get-embeddings">
  <processor class="com.rondhuit.solr.update.dense.EmbeddingsProcessorFactory">
      <str name="clientType">COMMUNITY_JA</str>
      <str name="sourceField">body</str>
      <str name="targetField">body_vector</str>
  </processor>

  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

The parameters are as follows, and all of them are required to be specified:

Parameter Name Description
clientType Specify one of the following:
COMMUNITY_JA: Japanese model
COMMUNITY_EN: English model
COMMUNITY_MULT: Solr manual model (fine-tuned from the multilingual model)
SENTENCE_EMBEDDINGS: Refer to 'Using models published by Hugging Face' below
sourceField Field name for performing vector calculations
targetField Field name of type DenseVectorField to store the result of vector calculations

The UpdateRequestProcessorChain configured like this, named get-embeddings, can be called during indexing by specifying it in the update.chain parameter as follows.

curl -X POST 'https://<subdomain>.c.kandasearch.com/solr/<collection>/update?update.chain=get-embeddings&commit=true' --data-binary @yourfile.json -H 'Content-type: application/json'
DenseVectorSearchComponent

The DenseVectorSearchComponent is an Apache Solr SearchComponent. By using it, semantic search can be executed in Solr by sending query text to Solr just like normal keyword searches. To use it, configure the following settings in solrconfig.xml.

<searchComponent name="dense-search" class="com.rondhuit.solr.handler.component.DenseVectorSearchComponent">
  <str name="clientType">COMMUNITY_JA</str>
</searchComponent>

<requestHandler name="/semantic" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="wt">json</str>
    <bool name="indent">true</bool>
    <bool name="_ks.dense">true</bool>
    <str name="_ks.dense.field">body_vector</str>
    <int name="_ks.dense.k">100</int>
  </lst>
  <arr name="first-components">
    <str>dense-search</str>
  </arr>
</requestHandler>
Parameter Name Description
clientType Specify one of the following (required):
COMMUNITY_JA: Japanese model
COMMUNITY_EN: English model
COMMUNITY_MULT: Solr manual model (fine-tuned from a multilingual model)
SENTENCE_EMBEDDINGS: Refer to 'Using models publicly available on Hugging Face'
mentioned below
_k.dense Specify true (default) or false to toggle the functionality of
DenseVectorSearchComponent on or off
_k.dense.field Vector search target field name (required)
_k.dense.k Specify how many documents to retrieve as search results (default: 100)
Using models publicly available on Hugging Face

KandaSearch provides EmbeddingsProcessorFactory and DenseVectorSearchComponent, which can use endpoints published on Hugging Face in addition to the previously mentioned models (COMMUNITY_JA, COMMUNITY_EN, COMMUNITY_MULT). Below are the common parameters for EmbeddingsProcessorFactory and DenseVectorSearchComponent.

Parameter Name Description
clientType Specify SENTENCE_EMBEDDINGS to declare the use of Hugging Face endpoints
hostName Hugging Face endpoint
apiKey Hugging Face API key

For DenseVectorSearchComponent, you can use it as follows.

<searchComponent name="dense-search" class="com.rondhuit.solr.handler.component.DenseVectorSearchComponent">
  <str name="hostName">{Your Hugging Face Endpoint}</str>
  <str name="clientType">SENTENCE_EMBEDDINGS</str>
  <str name="apiKey">{Your API Key}</str>
</searchComponent>

And to set up the endpoint in the Hugging Face project, you implement a custom handler. This custom handler should return JSON in the following format (with a field named embeddings containing an array of floats).

{
  "embeddings":[
    -0.045720744878053665,-0.007091674953699112, ...
  ]
}

Example of Semantic Search using Livedoor News Corpus (embeddings)

Next, let's explore how to perform semantic search in KandaSearch using Apache Solr through an example of executing semantic search using the Livedoor News Corpus (embeddings) data provided in the KandaSearch extension library.

1. Creating a Collection

Creating a collection using the configuration in KandaSearch extensions.
The operation method is as follows:

  1. Add 'Livedoorニュース(embeddings) configuration(Solr 9)' from the extension library to your project. If there is no project, create one.
  2. Download the configuration added in step 1 to your local PC from the project view's 'Extensions'.
  3. From the project overview, navigate to the instance where you want to add the collection. If there is no instance, create one. The Community plan allows you to create instances for free.
  4. Select 'Collections' from the left-side menu in the instance view.
  5. Click on '+ ADD A COLLECTION' in the collection screen, specify the file downloaded in step 2 under 'IMPORT' tab, and create the instance.
2. Registration of Livedoor News Corpus (embeddings) data

Register (indexing) the collection using the data file with vector data provided in the extension library.
The operation method is as follows:

  1. Add 'Livedoor News Corpus (embeddings)' of data to your project from the extension library.
  2. From the project view, download the data added in step 1 from 'Extensions' and unzip it on your local PC. The JSON file is approximately 270MB.
  3. Since the data file exceeds the size that can be registered from the instance overview, use cURL commands from Mac Terminal or Windows WSL, etc., to register it.
curl -X POST 'https://<subdomain>.c.kandasearch.com/solr/<collection>/update?commit=true&indent=true' --data-binary @livedoor_embeddings.json -H 'Content-Type: text/json'

Once the data registration is complete, execute the semantic search from the search UI.
The operation method is as follows:

(1) Select 'Search' from the left-side menu in the instance view.

(2) Specify the search criteria as follows (example settings) and click 'Search' button.

  • Keyword: Specify the keyword or text you want to search for. For example: 'スマホ', '男性がもらって喜ぶもの'(gifts for men), etc.
  • Collection: Select the collection name created in the above steps.
  • Request Handler: Specify as /semantic.
  • Unique Key: id is automatically selected.
  • Title: Specify as title.
  • Body: Specify as body.
  • Enable Semantic Search: Check the box to enable semantic search.
  • Dense Field: Specify the field name where vector values are stored. Specify as body_vector.
  • Dense Top K: Specify the number of search results to be returned. Specify as 10.
  • Default Field: Specify as body.

(3) Search results will be displayed.

4. Adding or updating data

By using the previously mentioned EmbeddingsProcessorFactory, you can perform embedding when adding or updating small amounts of data. Let's try registering the following JSON file, sample.json, into KandaSearch's Apache Solr.

$ cat sample.json
[
    {
    "id":"otani-san.txt",
    "url":"http://example.com/otani-san.html",
    "category":"sports-watch",
    "date":"2024-02-28T00:00:00Z",
    "title":"ドジャースの大谷翔平選手",
    "body":[
        "大谷翔平がドジャースのグラウンドに姿を現すと、出待ちしていたファンから大声援が起きた。"
    ]
    }
]

Execute the following curl command to register the data.

curl -X POST 'https://<subdomain>.c.kandasearch.com/solr/<collection>/update?update.chain=get-embeddings&commit=true' --data-binary @sample.json -H 'Content-type: application/json'

Then, execute semantic search with the same settings as before. This time, set the query to something like 'ドジャースの大谷選手'. You will see that the added data is retrieved.

Vectorization of Large Data

Vectorization of large data in KandaSearch is done through the installation of GPS. For more details, please contact the KandaSearch team.

You can see various examples of semantic search in action on Rondhuit's Semantic Search Demo Site.