BLOG

Combined Query と RRF をローカルで評価する

Author: Elpidio Gonzalez Valbuena

Posted: November 27, 2025

ここまでの記事では、Combined Query 機能の狙いや設計、そして RRF が Solr のハイブリッドサーチ全体のストーリーの中でどう位置付けられているかを見てきました。しかし、多くの実務者にとって決定的なのは、「仕様書としてどう書かれているか」だけではなく、「実際のクエリとデータに対してどう振る舞うのか」です。幸いなことに、Sonu Sharma 氏の実装は公開されており、現時点でもこのアプローチを試してみて、Combined Query と RRF が実際にどう連携するのかを自分の手で確かめることができます。

以下では、このブランチの取得からビルド、ローカル Solr インスタンスの起動までを段階的に説明し、この記事で使った実験を再現したり、読者自身のユースケースに合わせて応用できるようにします。

1. Combined Query ブランチを clone する

まずはコントリビュータのブランチを直接 clone します。この PR は ercsonusharma/solr フォークの feat_combined_query ブランチをベースにしています。

git clone --branch feat_combined_query --single-branch https://github.com/ercsonusharma/solr.git
cd solr

この時点で、作業ツリーには Combined Query の実装（CombinedQueryComponent、CombinedQuerySearchHandler、RRF combiner など）が入り、ローカルの Solr インスタンスでビルド・検証できる状態になります。

2. dev ディストリビューションをビルドする

Solr のルートディレクトリで:

./gradlew dev

これにより、solr/packaging/build/dev 以下に自己完結した「dev」Solr インスタンスが生成されます（通常のバイナリディストリビューションと同じレイアウトで、bin/solr、server/、configsets/ などが含まれます）。

3. PR 同梱のテストでサニティチェックする（任意）

この PR には Combined Query のためのテストがいくつか同梱されています（CombinedQueryComponentTest、CombinedQuerySearchHandlerTest、CombinedQuerySolrCloudTest など）。

リポジトリのルートから:

./gradlew :solr:core:test --tests '*CombinedQuery*'

これが通れば、自分の環境が作者の実行環境とおおむね揃っていると考えてよいでしょう。

4. dev ディストリビューションから Solr を起動する

dev ディストリビューションのルートから:

cd solr/packaging/build/dev

bin/solr start

Solr が起動し、SolrCloud モードで動いていることを確認するには次のようにします。

bin/solr status

出力はおおよそ次のようになります。

Solr process 14714 running on port 8983
{
  "solr_home":"~/solr/solr/packaging/build/dev/server/solr",
  "version":"10.0.0-SNAPSHOT aa0c23ab0cae7f1b498a9864f8324a81cefafe64 [snapshot build, details omitted]",
  "startTime":"Wed Nov 19 10:42:15 JST 2025",
  "uptime":"0 days, 0 hours, 0 minutes, 13 seconds",
  "memory":"227.6 MB (%44.4) of 512 MB",
  "cloud":{
    "ZooKeeper":"127.0.0.1:9983",
    "liveNodes":"1",
    "collections":"0"
  }
}

ここで重要なのは cloud セクションです。ローカル ZooKeeper と単一ノードを使った SolrCloud モードで動作していることが分かります。Combined Query / RRF 機能は分散検索を前提に設計されているので、ここで SolrCloud を使うのは意図されたデプロイモデルに合致しています。

5. Combined Query ハンドラを使う configset を用意する

Combined RRF クエリを Solr に送れるようにするには、まず新しいハンドラとコンポーネントを設定した コレクション が必要です。

CombinedQuerySearchHandler
CombinedQueryComponent

PR にはすでに、この機能向けに配線済みの テスト用 solrconfig と テスト用 schema が含まれています。

solr/core/src/test-files/solr/collection1/conf/solrconfig-combined-query.xml
solr/core/src/test-files/solr/collection1/conf/schema-vector-catchall.xml

Solr の JUnit テストは、これらのファイルを使ったミニ Solr を起動して Combined Query を検証しています。そのため、これらを再利用するのが最も手早いアプローチです。テストと同じハンドラ・コンポーネントチェーン・ベクトルフィールド定義を持つローカルセットアップを、ほぼそのまま再現できます。

5.1 新しい configset を作成する

dev ディストリビューション 側で、まず _default configset を複製します。これにより、標準的なログ設定や基本的なデフォルト値を引き継げます。

cd solr/packaging/build/dev/server/solr

cp -r configsets/_default configsets/combined-query

この時点で configsets/combined-query は _default のコピーにすぎません。

ここから、solrconfig.xml を PR のテスト用 solrconfig で上書きし、あわせてクリーンな schema.xml を用意します。

5.2 PR のテスト用 solrconfig を再利用・整理する

最初に PR ブランチを checkout した ソースツリー で、次のファイルを探します。

solr/core/src/test-files/solr/collection1/conf/solrconfig-combined-query.xml

次に、dev ディストリビューション側でこれを新しい configset にコピーします。

# まだ solr/packaging/build/dev/server/solr にいる前提

cp ~/solr/core/src/test-files/solr/collection1/conf/solrconfig-combined-query.xml \
   configsets/combined-query/conf/solrconfig.xml

このテスト用 solrconfig は良い出発点ですが、通常の Solr サーバには存在しない テスト専用コンポーネント がいくつか含まれています。

org.apache.solr.handler.component.ForcedDistributedComponent （fd という名前で /tfd から利用される）
org.apache.solr.handler.component.combine.TestCombiner （combined_query サーチコンポーネント内で設定されている）

このファイルをそのまま使うと、これらのクラスに対する ClassNotFoundException によりコア作成が失敗してしまいます。そのため、次の 2 点を修正します。

`/combined` を使うようにし、テスト専用の分散コンポーネントを削除する

solrconfig.xml 内で、/search ハンドラを次のように書き換えます。

<requestHandler name="/combined"
                class="solr.CombinedQuerySearchHandler">
</requestHandler>

続いて、initParams を /select と /combined を対象にするよう更新し、/tfd は完全に削除します。

<initParams path="/select,/combined">
  <lst name="defaults">
    <str name="df">text</str>
  </lst>
</initParams>

さらに、ForcedDistributedComponent と、その専用テストハンドラ /tfd も solrconfig.xml から取り除いて構いません。これらはテストスイートが特定の分散動作を強制するためだけに使っているもので、本番サーバやこのチュートリアルの目的には不要です。

テスト用 combiner を削除し、実際の RRF combiner を利用する

元のテスト設定では、Combined Query 用サーチコンポーネントはおおよそ次のようになっています。

<searchComponent class="solr.CombinedQueryComponent" name="combined_query">
  <int name="maxCombinerQueries">2</int>
  <lst name="combiners">
    <lst name="test">
      <str name="class">org.apache.solr.handler.component.combine.TestCombiner</str>
      <int name="var1">30</int>
      <str name="var2">test</str>
    </lst>
  </lst>
</searchComponent>

今回のセットアップでは、この <lst name="combiners">…</lst> ブロックを丸ごと削除し、次のようにシンプルにします。

<searchComponent class="solr.CombinedQueryComponent" name="combined_query">
  <int name="maxCombinerQueries">2</int>
</searchComponent>

これだけで RRF は動作します。実際の Reciprocal Rank Fusion の実装はサーバ側のコードにあり、combiner=true、combiner.algorithm=rrf、combiner.upTo、combiner.rrf.k といったリクエストパラメータで有効化されます。そのため、標準的なハイブリッド RRF クエリを動かすだけなら、solrconfig.xml にカスタム combiner クラスを設定する必要はありません。

テスト設定の TestCombiner は主に次の目的で存在しています。

設定ファイル経由で 任意のカスタム combiner を差し替え可能 であることを示すデモ。
テストスイートが挙動を簡単にアサートできる、単純で決定的な combiner を提供すること。

実用的な RRF セットアップにおいては、CombinedQueryComponent 自体は残しつつ、TestCombiner を含むテスト専用の設定は削除して問題ありません。これらの小さな修正を行うことで、combined-query configset は通常の dev SolrCloud ノード上でもクリーンに動作し、キーワード＋ベクトルの Combined RRF クエリを受け付ける準備が整います。

5.3 クリーンな schema.xml を用意し、ベクトルフィールドを追加する

次に、schema を準備し、テスト用 schema からベクトル関連の定義だけを取り込んでいきます。

まず、_default の managed schema から schema.xml を作成します。

cd solr/packaging/build/dev/server/solr

cp configsets/_default/conf/managed-schema.xml configsets/combined-query/conf/schema.xml

この schema.xml には、_default 相当の一般的なテキスト・数値フィールドがひととおり定義されています。

続いて、PR 側にある schema-vector-catchall.xml をエディタで開き、必要な部分だけコピーします。

solr/core/src/test-files/solr/collection1/conf/schema-vector-catchall.xml

このファイルから、dense ベクトルの fieldType と、単一の vector フィールド定義を configsets/combined-query/conf/schema.xml に追加します。具体的には、<fieldType> セクションに次の定義を追加します。

<fieldType name="knn_vector_cosine"
           class="solr.DenseVectorField"
           vectorDimension="384"
           similarityFunction="cosine"/>

さらに、<fields> セクションに次を追加します。

<field name="vector"
       type="knn_vector_cosine"
       indexed="true"
       stored="true"/>

vectorDimension は利用する埋め込みベクトルの次元数と一致させる必要があります（実システムでは 384 や 768 など）。テストでは 4 を使っており、PoC としてはそれでも十分です。
similarityFunction はベクトル類似度の計算方法（cosine、dot product、euclidean など）を指定します。
vector フィールドは、各ドキュメントの埋め込みベクトルを保存し、後で knn や knn_text_to_vector パーサーで検索するためのフィールドです。

加えて、このスキーマには実用的なテキストフィールドがないため、次のフィールドも追加します。

<field name="text"
       type="text_general"
       indexed="true"
       stored="true"
       multiValued="true"/>

元の _default スキーマには、全文インデックス用の _text_ フィールド（stored="false"）はありますが、今回のような例で素直に使える text フィールドがないため、それを補う目的です。

これでスキーマにはすでに次の 2 つが揃いました。

キーワード検索用の自由テキストフィールド（text）
k-NN 検索用の dense ベクトルフィールド（vector）

この 2 つがあれば、「キーワードサブクエリ 1 本 + k-NN サブクエリ 1 本を RRF で統合する」というハイブリッドクエリをデモするには十分です。

6. この configset を使ってコレクションを作成する

次のステップは、作成した configset を利用して実際の コレクション を作ることです。ここにドキュメントをインデックスし、クエリを投げていきます。

6.1 `hybrid` コレクションを作成する

combined-query configset を使う hybrid という名前のコレクションを作成します。

cd solr/packaging/build/dev

bin/solr create -c hybrid -d server/solr/configsets/combined-query/conf

このコマンドが行うことは次の通りです。

SolrCloud 上に hybrid という コレクション を登録します。
先ほど準備した combined-query config ディレクトリ（PR の solrconfig + ベクトル対応 schema）をこのコレクションの設定として利用します。
デフォルトでは、ローカルノード上に 1 shard・1 replica の構成で作成されます。

コマンドが成功すれば、次のような状態になっています。

solrconfig.xml に CombinedQuerySearchHandler と CombinedQueryComponent を含む hybrid コレクションが起動済み。
ベクトルフィールドを定義したスキーマにより、k-NN ＋キーワードのハイブリッドクエリを受け付ける準備が整った状態。

7. シンプルなドキュメントをインデックスする

最初は キーワード検索だけ にフォーカスし、ベクトルは一旦無視します。hybrid コレクションに、1 つの text フィールドだけを使った簡単なドキュメント群をインデックスしてみます。

curl -X POST "http://localhost:8983/solr/hybrid/update?commit=true" \
  -H "Content-Type: application/json" -d '[
    {"id":"1", "text":"solr hybrid search with keyword matching"},
    {"id":"2", "text":"vector search with knn and dense vectors"},
    {"id":"3", "text":"bm25 keyword search for solr"},
    {"id":"4", "text":"neural semantic search in apache solr"},
    {"id":"5", "text":"hybrid retrieval: combining vector and keyword search"}
  ]'

この段階では text フィールドだけを気にしていれば十分です。これだけでも、Combined Query ハンドラが通常のキーワード検索（BM25）と正しく連携しているかどうかを確認できます。ベクトルフィールドや、真の意味でのハイブリッド（キーワード＋k-NN）クエリは、この後のセクションで導入します。

8. シンプルなマルチキーワードの Combined Query を実行する（まだベクトルなし）

text フィールドにドキュメントをいくつか投入したので、次は 2 本の キーワードサブクエリ を Combined Query ハンドラに渡し、その結果を Reciprocal Rank Fusion（RRF）でマージしてみます。

Combined Query の JSON には、主に次の 3 つの要素があります。

サブクエリを定義するトップレベルの queries マップ（各エントリは lucene、edismax、knn などの JSON DSL クエリ）
combiner を有効化し、RRF の挙動を制御する params ブロック
どのサブクエリを統合対象にするかを指定するパラメータ

先ほど作成した hybrid コレクションに対して、text フィールドに対する 2 本の Lucene クエリ（solr と vector）を投げ、JSON レスポンスを取得してみます。

curl -X POST "http://localhost:8983/solr/hybrid/combined?wt=json" \
  -H "Content-Type: application/json" -d '{
    "queries": {
      "q1": { "lucene": { "df": "text", "query": "solr" } },
      "q2": { "lucene": { "df": "text", "query": "vector" } }
    },
    "limit": 10,
    "fields": ["id","score", "text"],
    "params": {
      "combiner": "true",
      "combiner.algorithm": "rrf",
      "combiner.upTo": "100",
      "combiner.rrf.k": "60",

      "combiner.query": ["q1","q2"],
      "combiner.resultKey": ["q1","q2"]
    }
  }'

queries.q1 と queries.q2 は、デフォルトフィールド text のみが異なる 2 本の通常の Lucene クエリです（用語はそれぞれ solr と vector）。
combiner=true によって、このリクエストが Combined Query モード で処理されます。
combiner.algorithm=rrf、combiner.upTo、combiner.rrf.k が RRF の挙動（どこまでのランクを統合対象にするか、k パラメータをどうするか）を制御します。
combiner.query=["q1","q2"] と combiner.resultKey=["q1","q2"] は、どの JSON サブクエリを実行し、結果を統合対象とするのかを明示します。

レスポンスは次のようになります。

{
  ...
  "response":{
    "numFound":3,
    "start":0,
    "maxScore":0.016393442,
    "numFoundExact":false,
    "docs":[{
      "id":"3",
      "text":["bm25 keyword search for solr"],
      "score":0.016393442
    },{
      "id":"2",
      "text":["vector search with knn and dense vectors"],
      "score":0.016393442
    },{
      "id":"5",
      "text":["hybrid retrieval: combining vector and keyword search"],
      "score":0.016129032
    },{
      "id":"1",
      "text":["solr hybrid search with keyword matching"],
      "score":0.016129032
    },{
      "id":"4",
      "text":["neural semantic search in apache solr"],
      "score":0.015873017
    }]
  }
}

combiner が実際に何をしているのかを理解するには、この統合結果を、q1 と q2 をそれぞれ個別に /select に投げた場合のランキングと比較するのが有効です。

# q1 のベースライン: text で "solr" を検索
curl "http://localhost:8983/solr/hybrid/select?wt=json" \
  --get \
  --data-urlencode "q=solr" \
  --data-urlencode "df=text" \
  --data-urlencode "rows=10"

レスポンス:

{
  ...
  "response":{
    "numFound":3,
    "start":0,
    "numFoundExact":true,
    "docs":[{
      "id":"3",
      "text":["bm25 keyword search for solr"]
    },{
      "id":"1",
      "text":["solr hybrid search with keyword matching"]
    },{
      "id":"4",
      "text":["neural semantic search in apache solr"]
    }]
  }
}

# q2 のベースライン: text で "vector" を検索
curl "http://localhost:8983/solr/hybrid/select?wt=json" \
  --get \
  --data-urlencode "q=vector" \
  --data-urlencode "df=text" \
  --data-urlencode "rows=10"

レスポンス:

{
  ...
  "response":{
    "numFound":2,
    "start":0,
    "numFoundExact":true,
    "docs":[{
      "id":"2",
      "text":["vector search with knn and dense vectors"]
    },{
      "id":"5",
      "text":["hybrid retrieval: combining vector and keyword search"]
    }]
  }
}

これら個別の /select レスポンスと /combined レスポンスを比較すると、RRF combiner が 2 本のキーワード結果セットをマージする際に、どのようにドキュメントの順序を並び替えているかがよく分かります。つまり、Combined Query ハンドラが実際に複数サブクエリの結果を統合しているのであって、一方のクエリだけの結果を返しているわけではないと確認できます。

3 つのレスポンスを見比べると、統合スコアが標準的な RRF の式に対応していることも分かります。

score(d) = Σ over queries q of 1 / (k + rank_q(d))

ここでは k = 60 とします。

q=solr のベースラインでは、ドキュメント id=3 が 1 位なので 1 / (60 + 1) ≈ 0.01639 を得ます。これは統合スコアの 0.016393442 と一致します。id=1 と id=4 は q=solr における 2 位と 3 位なので、それぞれ 1 / 62 ≈ 0.01613 と 1 / 63 ≈ 0.01587 を受け取り、統合結果の 0.016129032 と 0.015873017 に対応します。一方、q=vector 側では id=2 と id=5 が 1 位と 2 位で、それぞれ 1 / 61 と 1 / 62 に対応しています。

最終的な統合リストの順序は、「各クエリにおける 1 位のドキュメント（id=3 と id=2）が先に並び、その次に 2 位のドキュメント（id=5 と id=1）、最後に 3 位のドキュメント（id=4）」という形になっており、これは 2 本のランキングに対する RRF を適用した結果として自然な挙動です。

9. 本物のハイブリッドクエリ（キーワード＋ベクトル）を実行する

ここからは、ハイブリッド検索のメリットがよりはっきり現れる例に進みます。

このステップでは、最小例から一歩進んで、hybrid コレクションに 小さく構造化されたデモ用コーパス をインデックスした前提で話を進めます。

ここで用意したのは、以下のようなクラスターに自然と分かれる、短い単一トピックの段落群です。

安いスマートフォン / 携帯電話 — ドキュメント 1, 2, 3, 4, 6, 15（5 は近いが少し外れ）
- 例: コストを重視するユーザー向けの「cheap phones under 200 dollars」ガイドなど。
日本旅行 — ドキュメント 7 と 8
- 例: アラートや柔軟な日付指定を使って cheap flights to Tokyo を見つける話。
リモートワークと分散チーム — ドキュメント 9 と 10
- 在宅勤務ポリシー、非同期コミュニケーション、タイムゾーンをまたぐマネジメントなどについて。
検索 / ランキング / IR の概念 — ドキュメント 11–14
- 例: BM25 ranking（語の出現頻度・文書長など）、neural retrieval、ハイブリッド検索、「なぜキーワードベースのランキングが依然として強いベースラインなのか」など。

コレクション自体は意図的に小さいものの、意味的に重なり合う部分や、「cheap flights vs cheap phones」のような罠も含まれており、ハイブリッド検索がどう振る舞うかを観察しやすくなっています。

ベクトル検索用には、各ドキュメントの text フィールドを次の SentenceTransformer モデルで埋め込みました。

モデル: sentence-transformers/all-MiniLM-L6-v2
タイプ: SBERT スタイルの SentenceTransformers モデル
言語: 英語
埋め込み次元: 384
典型的な用途: セマンティックサーチ、クラスタリング、類似度計算

実装では、すべてのテキストに対して normalize_embeddings=True でエンコードし（ベクトルが単位球面上に分布し、cosine 類似度との相性がよい）、得られた 384 次元ベクトルをそのまま Solr の dense ベクトルフィールド vector に格納しました。

これにより、BM25 は主に「cheap phones」や「iPhone for sale」のような 文字列としてのフレーズ に反応する一方、ベクトルフィールドは「budget smartphones」「affordable mid-range devices」「used phone offers」のような 意味的な近傍 をとらえることができ、Combined Query ハンドラが両者を RRF で統合できる状態になります。

9.1 キーワードのみのビュー: `cheap phones`

まずは素の BM25 クエリから始めます。

curl "http://localhost:8983/solr/hybrid/select?wt=json" \
  --get \
  --data-urlencode "q=cheap phones" \
  --data-urlencode "df=text" \
  --data-urlencode "rows=10" \
  --data-urlencode "fl=id,score,text"

筆者の環境では、Solr は次のような結果を返しました。

    "docs":[{
      "id":"5",
      "text":"This report analyzes enterprise mobile strategies. Large companies rarely buy cheap phones; instead they standardize on a small set of secure, well supported devices. Cost matters, but reliability, security patches, and fleet management tools often matter more.",
      "score":1.2165878
    },{
      "id":"1",
      "text":"This guide compares cheap phones under 200 dollars. We focus on low cost Android devices that still feel fast for everyday tasks like messaging, web browsing, and social media. If you want the cheapest phone that does not feel completely sluggish, start here.",
      "score":1.1558574
    },{
      "id":"7",
      "text":"Looking for cheap flights to Tokyo can be frustrating. Prices fluctuate every day, so this tutorial shows how to track airfare with alerts, flexible dates, and alternative airports to find low cost tickets to Japan without spending hours refreshing search results.",
      "score":0.69270647
    },{
      "id":"6",
      "text":"We compare mid range Android phones to top tier flagships. The mid range devices are not the cheapest phones on the market, but they strike a balance between price and performance and are often recommended for people who do not need cutting edge cameras.",
      "score":0.6504394
    },{
      "id":"2",
      "text":"Students often look for budget smartphones that can handle online classes, video calls, and note taking apps. This article reviews affordable mid range Android phones that cost less than a typical flagship but still offer decent cameras and battery life.",
      "score":0.4867006
    },{
      "id":"3",
      "text":"Flagship phones such as the latest iPhone are incredibly powerful, but they are not exactly inexpensive. We explain why premium devices command a higher price and when it might be worth paying extra for features like advanced cameras and long term software support.",
      "score":0.47698247
    }]

BM25 は期待通りの挙動を示します。cheap と phones という語の共起に強く依存しており、「budget smartphones」や「affordable devices」が本質的には同じ概念であることは理解していません。

9.2 ベクトルのみのビュー: 意味的な「cheap phones」

次に、フレーズ "cheap phones" をエンコードして、その埋め込みを {!knn} パーサー経由で vector フィールドに対して kNN クエリ として投げてみます。

# 疑似コード – 実際には手元で計算した埋め込み配列を使う
EMBED='[0.12,0.34,0.56, ... ]'  # vector フィールドと同じ次元数

curl "http://localhost:8983/solr/hybrid/select?wt=json" \
  --data-urlencode 'q={!knn f=vector topK=10}$EMBED' \
  --data-urlencode 'fl=id,score,text' \
  --data-urlencode 'rows=10'

筆者の環境での上位近傍は次のようになりました。

    "docs":[{
      "id":"1",
      "text":"This guide compares cheap phones under 200 dollars. We focus on low cost Android devices that still feel fast for everyday tasks like messaging, web browsing, and social media. If you want the cheapest phone that does not feel completely sluggish, start here.",
      "score":0.8472254
    },{
      "id":"2",
      "text":"Students often look for budget smartphones that can handle online classes, video calls, and note taking apps. This article reviews affordable mid range Android phones that cost less than a typical flagship but still offer decent cameras and battery life.",
      "score":0.81867886
    },{
      "id":"3",
      "text":"Flagship phones such as the latest iPhone are incredibly powerful, but they are not exactly inexpensive. We explain why premium devices command a higher price and when it might be worth paying extra for features like advanced cameras and long term software support.",
      "score":0.8162396
    },{
      "id":"6",
      "text":"We compare mid range Android phones to top tier flagships. The mid range devices are not the cheapest phones on the market, but they strike a balance between price and performance and are often recommended for people who do not need cutting edge cameras.",
      "score":0.77509284
    },{
      "id":"15",
      "text":"Online marketplaces often advertise massive discounts on flagship smartphones, but many of the cheapest listings come from unknown sellers. This article explains how to evaluate used phone offers, avoid scams, and decide when a deal is truly good value.",
      "score":0.7708108
    },{
      "id":"5",
      "text":"This report analyzes enterprise mobile strategies. Large companies rarely buy cheap phones; instead they standardize on a small set of secure, well supported devices. Cost matters, but reliability, security patches, and fleet management tools often matter more.",
      "score":0.76603323
    },{
      "id":"4",
      "text":"If you want an iPhone for sale at a reasonable price, buying last year s model or a refurbished unit can save a lot of money. This guide lists older iPhones and certified pre owned devices that feel modern but are much more affordable than brand new flagships.",
      "score":0.7364917
    },{
      "id":"7",
      "text":"Looking for cheap flights to Tokyo can be frustrating. Prices fluctuate every day, so this tutorial shows how to track airfare with alerts, flexible dates, and alternative airports to find low cost tickets to Japan without spending hours refreshing search results.",
      "score":0.6251746
    },{
      "id":"8",
      "text":"Travelers who want affordable airfare deals to Japan should consider flying into Osaka or Nagoya instead of Tokyo. By being flexible with dates and airports, many people manage to book tickets that are hundreds of dollars cheaper than the obvious direct routes.",
      "score":0.56306547
    },{
      "id":"9",
      "text":"Many companies are updating their remote work policy. Some now support hybrid work, where employees split their time between home and the office. Others adopt fully distributed teams and rely on detailed work from home guidelines to keep communication smooth.",
      "score":0.54597485
    }]

ここでは id=1 が明確な 1 位になり、「cheap phones」という語を明示的に含まないドキュメントも、安価な / 予算を意識したスマートフォン について語っているため高いスコアを得ています。

9.3 ハイブリッドビュー: キーワード＋ベクトルを Combined Query で統合する

最後に、2 つのクエリを サブクエリ として同時に実行し、そのランキングリストを RRF でマージしてみます。

JSON リクエストは次のようになります。

curl -X POST "http://localhost:8983/solr/hybrid/combined?wt=json" \
  -H "Content-Type: application/json" -d '{
    "queries": {
      "keyword": {
        "edismax": {
          "qf": "text",
          "query": "cheap phones"
        }
      },
      "knn": {
        "knn": {
          "f": "vector",
          "vector": [0.12, 0.34, 0.56, 0.78],  // ここに実際の埋め込みを入れる
          "k": 10
        }
      }
    },
    "limit": 10,
    "fields": ["id", "score", "text"],
    "params": {
      "combiner": "true",
      "combiner.algorithm": "rrf",
      "combiner.upTo": "10",
      "combiner.rrf.k": "60",

      "combiner.query": ["keyword", "knn"],
      "combiner.resultKey": ["keyword", "knn"]
    }
  }'

この小さなデータセットで得られた統合ランキングは次のとおりです。

"docs":[{
      "id":"1",
      "text":"This guide compares cheap phones under 200 dollars. We focus on low cost Android devices that still feel fast for everyday tasks like messaging, web browsing, and social media. If you want the cheapest phone that does not feel completely sluggish, start here.",
      "score":0.032522473
    },{
      "id":"5",
      "text":"This report analyzes enterprise mobile strategies. Large companies rarely buy cheap phones; instead they standardize on a small set of secure, well supported devices. Cost matters, but reliability, security patches, and fleet management tools often matter more.",
      "score":0.031544957
    },{
      "id":"2",
      "text":"Students often look for budget smartphones that can handle online classes, video calls, and note taking apps. This article reviews affordable mid range Android phones that cost less than a typical flagship but still offer decent cameras and battery life.",
      "score":0.031513646
    },{
      "id":"6",
      "text":"We compare mid range Android phones to top tier flagships. The mid range devices are not the cheapest phones on the market, but they strike a balance between price and performance and are often recommended for people who do not need cutting edge cameras.",
      "score":0.03125
    },{
      "id":"3",
      "text":"Flagship phones such as the latest iPhone are incredibly powerful, but they are not exactly inexpensive. We explain why premium devices command a higher price and when it might be worth paying extra for features like advanced cameras and long term software support.",
      "score":0.031024532
    },{
      "id":"7",
      "text":"Looking for cheap flights to Tokyo can be frustrating. Prices fluctuate every day, so this tutorial shows how to track airfare with alerts, flexible dates, and alternative airports to find low cost tickets to Japan without spending hours refreshing search results.",
      "score":0.0305789
    },{
      "id":"15",
      "text":"Online marketplaces often advertise massive discounts on flagship smartphones, but many of the cheapest listings come from unknown sellers. This article explains how to evaluate used phone offers, avoid scams, and decide when a deal is truly good value.",
      "score":0.015384615
    },{
      "id":"4",
      "text":"If you want an iPhone for sale at a reasonable price, buying last year s model or a refurbished unit can save a lot of money. This guide lists older iPhones and certified pre owned devices that feel modern but are much more affordable than brand new flagships.",
      "score":0.014925373
    },{
      "id":"8",
      "text":"Travelers who want affordable airfare deals to Japan should consider flying into Osaka or Nagoya instead of Tokyo. By being flexible with dates and airports, many people manage to book tickets that are hundreds of dollars cheaper than the obvious direct routes.",
      "score":0.014492754
    },{
      "id":"9",
      "text":"Many companies are updating their remote work policy. Some now support hybrid work, where employees split their time between home and the office. Others adopt fully distributed teams and rely on detailed work from home guidelines to keep communication smooth.",
      "score":0.014285714
    }]

ここで重要なのは、上位数件がどう振る舞っているか です。

id=1 はキーワードクエリでは 2 位、ベクトルクエリでは 1 位 なので、RRF によって最も高い統合スコアを獲得します。
id=2 はキーワードでは 5 位 ですがベクトルでは 2 位 であり、結果として強力な 2 位 に浮上します。
id=5 はキーワードでは 1 位 ですがベクトルでは 6 位 にとどまるため、最終的に 3 位 に落ち着きます。

RRF が重視するのは 生のスコアではなくランク です。つまり、id=1 のように 両方のランキングで安定して良い位置にいるドキュメント は、どちらか一方で非常に強いだけのドキュメント（id=2 や id=5 など）よりも優先されます。これはまさにハイブリッドシステムに期待したい振る舞いであり、最終ランキングが キーワードの精度 と セマンティックな再現性 をバランス良く取り込んでいて、どちらか一方に支配されていないことを示しています。

10. デバッグ出力を確認する

ハイブリッドクエリが動作するようになったら、RRF combiner がスコアを どのように計算しているか を確認してみる価値があります。同じ /combined リクエストに、いつものデバッグ用パラメータを付けるだけで十分です。

debugQuery=on
debug=results

JSON レスポンスには debug セクションが追加されます。Combined Query の場合、この中に combinerExplanations というマップが含まれ、各ドキュメントの RRF スコアがサブクエリごとのランクからどのように計算されたかが明示されます。

以下は、「cheap phones」のハイブリッドクエリ（キーワード + kNN）を投げた際の抜粋です。

"combinerExplanations":{
  "1":"org.apache.lucene.search.Explanation:0.032522473 = 1/(60+1) + 1/(60+2) because its ranks were: 1 for query(keyword), 2 for query(knn)\n",
  "5":"org.apache.lucene.search.Explanation:0.031544957 = 1/(60+6) + 1/(60+1) because its ranks were: 6 for query(keyword), 1 for query(knn)\n",
  "2":"org.apache.lucene.search.Explanation:0.031513646 = 1/(60+2) + 1/(60+5) because its ranks were: 2 for query(keyword), 5 for query(knn)\n",
  "6":"org.apache.lucene.search.Explanation:0.03125 = 1/(60+4) + 1/(60+4) because its ranks were: 4 for query(keyword), 4 for query(knn)\n",
  "3":"org.apache.lucene.search.Explanation:0.031024532 = 1/(60+3) + 1/(60+6) because its ranks were: 3 for query(keyword), 6 for query(knn)\n",
  "7":"org.apache.lucene.search.Explanation:0.0305789 = 1/(60+8) + 1/(60+3) because its ranks were: 8 for query(keyword), 3 for query(knn)\n",
  "15":"org.apache.lucene.search.Explanation:0.015384615 = 1/(60+5) because its ranks were: 5 for query(keyword), not in the results for query(knn)\n",
  "4":"org.apache.lucene.search.Explanation:0.014925373 = 1/(60+7) because its ranks were: 7 for query(keyword), not in the results for query(knn)\n",
  "8":"org.apache.lucene.search.Explanation:0.014492754 = 1/(60+9) because its ranks were: 9 for query(keyword), not in the results for query(knn)\n",
  "9":"org.apache.lucene.search.Explanation:0.014285714 = 1/(60+10) because its ranks were: 10 for query(keyword), not in the results for query(knn)\n"
}

ここで注目したい点は次の通りです。

キー（"1"、"5"、"2"、…）はそのまま ドキュメント ID になっています。
各値は Lucene の Explanation 文字列であり、内部で用いている RRF の式 をそのまま表現しています。
- 例えば k = 60 のとき、keyword で 1 位、knn で 2 位だったドキュメントは score = 1/(60+1) + 1/(60+2) ≈ 0.03252 を得ます。
- あるドキュメントが一方のランキング（例: keyword）にしか登場しない場合、そのスコアは 1/(60+5) のような単項になります。
文末の「because its ranks were: 1 for query(keyword), 2 for query(knn)」といったテキストは、そのドキュメントが各サブクエリの結果リストで 何位だったのか を明示しています。

このセクションを読むことで、次のような点を確認できます。

各ドキュメントに対して、どのサブクエリがスコアに貢献しているのか。
RRF において、各ランク位置がどの程度の重みを持つのか。
なぜ一部の「橋渡し」ドキュメント（両方のランキングでそこそこ強いもの）が、最終的なハイブリッドランキングの上位に浮上してくるのか。

Combined Query 機能のデバッグや技術解説の観点では、この combinerExplanations ブロックが、「PR が単なるスコアの再ソートではなく、サブクエリごとのランキングに対する正真正銘の RRF を実装している」ことを示す、最も分かりやすい証拠と言えるでしょう。

ここまでの実験を通じて、Sonu Sharma 氏の Combined Query + RRF の実装は、机上のアイデアにとどまらず、Solr における実用的なハイブリッド検索への具体的な道筋を提示していることが分かります。ブランチを clone してクリーンな configset と schema を組み立て、新しいハンドラが JSON queries を受け付けることを確認し、単純なマルチキーワードの融合から、BM25 と dense ベクトルがそれぞれ異なる強みを発揮する本格的なハイブリッドシナリオまでを一通り歩いてきました。その中で、RRF combiner が両者のバランスを透明かつ説明可能な形で取っている様子も確認できました。

とりわけ combinerExplanations の出力は、この実装が複数の結果リストに対する真のランク融合を行っていることをはっきり示しており、単なるスコアの「それっぽい」並べ替えではないことが分かります。また、「cheap phones」コーパスは、キーワードの精度とセマンティックな再現性がどのように相互作用するのかを直感的に掴むのにちょうど良い教材になっています。現時点での印象として、この実装は Solr におけるプロダクションレベルのハイブリッド検索に向けた有望な土台であり、Solr ユーザーやコントリビュータがブランチを取り込んで自分たちのワークロードで試し、この機能を将来の Solr リリースへの取り込みに向けて一緒に押し上げていく価値は十分にあると感じます。

Return to the blog list