TUTORIAL

04. Indexing

On this page

We understood that a search engine can perform keyword searches quickly using an index.
Now, let's proceed to register (which we'll call 'indexing') the documents to be searched in the search engine, and confirm that we can actually perform searches.

While it's good to prepare the search target documents and configuration files for indexing on your own, here we will use the data and config from the 'livedoor News Corpus' provided in KandaSearch's extension library.

livedoor News Corpus

The 'livedoor News Corpus' is a corpus created from Livedoor news articles by NHN Japan (now LINE Yahoo Japan), with HTML tags removed. A corpus refers to a large collection of language resources such as text.

The field configuration is as follows:

  • ID (id)
  • Category (category)
  • Date (date)
  • Title (title)
  • Body (body)
  • Article URL (url)

In the original livedoor News Corpus, the article URL serves as the ID, but in the livedoor News Corpus distributed by KandaSearch, it is a JSON file with the file name serving as the ID.

The collection period is early September 2012, containing livedoor news distributed from August 2009 to May 2012.

The data for the 'livedoor News Corpus' in the extension library consists of the following two types:

  • Livedoorニュースコーパス
  • Livedoorニュースコーパス(mini)

The 'Livedoorニュースコーパス (mini)' is a subset of the 'Livedoorニュースコーパス' that has been extracted and adjusted to be less than 2 megabytes in size, making it possible to register from the KandaSearch admin panel.

The 'Livedoorニュースコーパス' (the larger-sized one) cannot be registered from the KandaSearch admin panel. The method for indexing will be explained in the 'Indexing API' section of the 'WebApp Development' chapter.

Downloading Corpus Data and Checking Content

We distribute the 'Livedoorニュースコーパス(mini)' data (in JSON format) as an extension of KandaSearch. Let's go through how to obtain this (download it to your PC).

You can access the contents of the extension library even when you are not logged into your KandaSearch account.
The extensions are available by logging into KandaSearch and adding them to your project.

Downloading data from the extension to your computer

When logged into KandaSearch, click on the jigsaw puzzle icon at the top of the screen.

tutorial-en-045001

You will see a list of extensions in the library, so look for 'Livedoorニュースコーパス(mini)'.

Notes The extensions are categorized into several categories.
'Livedoorニュースコーパス(mini)' is classified under the 'DATA' category. Click the 'DATA' button at the top of the page for filtering, and only the 'DATA' extensions will be displayed, making it easy to find.
To clear this filter, click '× CLEAR' or click on the colored 'DATA'.

tutorial-en-045002

Next, click on 'DETAILS' inside 'Livedoorニュースコーパス(mini)'.
After clicking on 'ADD TO PROJECT' on the details page, click on the name of the project you want to add it to.
If you see 'Get this extension' on the details page, click on it, then log in to KandaSearch and add it to your project.

Livedoorニュースコーパス(mini) DATA

tutorial-en-045003

The added extensions can be viewed from the 'Extensions' on the left side of the project view, which was specified when added.
Look for 'Livedoorニュースコーパス(mini)' from the list of extensions displayed, and click on 'DOWNLOAD' in its block.

tutorial-en-045004

If there are multiple versions of the selected extension, a list of them will be displayed. Click on the download icon at the right end of the row for the appropriate version (often the latest version is optimal) to download it to your computer.

tutorial-en-045005

Checking the contents of the JSON file

Before registering the documents, let's take a look at the contents of the JSON file. The JSON file contains multiple documents (livedoor news articles) stored in an array format.
When you extract the first document, it looks like this:

{
  "id":"dokujo-tsushin-4778030.txt",
  "url":"http://news.livedoor.com/article/detail/4778030/",
  "category":"dokujo-tsushin",
  "date":"2010-05-22T14:30:00Z",
  "title":"友人代表のスピーチ、独女はどうこなしている?",
  "body":[
    " もうすぐジューン・ブライドと呼ばれる6月。独女の中には自分の式はまだなのに呼ばれてばかり……という「お祝い貧乏」状態の人も多いのではないだろうか? さらに出席回数を重ねていくと、こんなお願いごとをされることも少なくない。",
    "",
    " 「お願いがあるんだけど……友人代表のスピーチ、やってくれないかな?」",
    "",
    " さてそんなとき、独女はどう対応したらいいか?",
    "",
    " 最近だとインターネット等で検索すれば友人代表スピーチ用の例文サイトがたくさん出てくるので、それらを参考にすれば、無難なものは誰でも作成できる。しかし由利さん(33歳)はネットを参考にして作成したものの「これで本当にいいのか不安でした。一人暮らしなので聞かせて感想をいってくれる人もいないし、かといって他の友人にわざわざ聞かせるのもどうかと思うし……」ということで活用したのが、なんとインターネットの悩み相談サイトに。そこに作成したスピーチ文を掲載し「これで大丈夫か添削してください」とメッセージを送ったというのである。",
    "",
    " 「一晩で3人位の人が添削してくれましたよ。ちなみに自分以外にもそういう人はたくさんいて、その相談サイトには同じように添削をお願いする投稿がいっぱいありました」(由利さん)。ためしに教えてもらったそのサイトをみてみると、確かに「結婚式のスピーチの添削お願いします」という投稿が1000件を超えるくらいあった。めでたい結婚式の影でこんなネットコミュニティがあったとは知らなかった。",
    "",
    " しかし「事前にお願いされるスピーチなら準備ができるしまだいいですよ。一番嫌なのは何といってもサプライズスピーチ!」と語るのは昨年だけで10万以上お祝いにかかったというお祝い貧乏独女の薫さん(35歳)",
    "",
    " 「私は基本的に人前で話すのが苦手なんですよ。だからいきなり指名されるとしどろもどろになって何もいえなくなる。そうすると自己嫌悪に陥って終わった後でもまったく楽しめなくなりますね」",
    " ",
    " サプライズスピーチのメリットとしては、準備していない状態なので、フランクな本音をしゃべってもらえるという楽しさがあるようだ。しかしそれも上手に対応できる人ならいいが、苦手な人の場合だと「フランク」ではなく「しどろもどろ」になる危険性大。ちなみにプロの司会者の場合、本当のサプライズではなく式の最中に「のちほどサプライズスピーチとしてご指名させていただきます」という一言があることも多いようだが、薫さん曰く「そんな何分前に言われても無理!」らしい。要は「サプライズを楽しめる」というタイプの人選が大切ということか。",
    "",
    " 一方「ありきたりじゃつまらないし、ネットで例文を検索している際に『こんな方法もあるのか!』って思って取り入れました」という幸恵さん(30歳)が行ったスピーチは「手紙形式のスピーチ」というもの。",
    "",
    " 「○○ちゃんへ みたいな感じで新婦の友人にお手紙を書いて読み上げるやり方です。これなら多少フランクな書き方でも大丈夫だし、何より暗記しないで堂々と読み上げることができますよね。読んだものはそのまま友人にあげれば一応記念にもなります」(幸恵さん)",
    "なるほど、確かにこれなら読みあげればいいだけなので、人前で話すのが苦手な人でも失敗しないかもしれない。",
    "",
    " 主役はあくまで新郎新婦ながらも、いざとなると緊張し、内容もあれこれ考えて、こっそりリハーサル……そんな人知れず頑張るスピーチ担当独女たちにも幸あれ(高山惠)"
  ]
}

We will use this data later.

Adding a collection

A collection is a grouping of schema and index data. By adding a collection, it becomes possible to register and search documents.
In KandaSearch, you can add multiple collections to one instance.

Let's now proceed with adding a collection for the livedoor News Corpus.

Downloading the configuration from the extension to your computer

The configuration to be used when adding a collection is obtained from the extension library.

Click on the jigsaw puzzle icon at the top of the screen.
From the extension library, find 'Livedoorニュース configuration(Solr 9)' and click on 'DETAILS' in the same block.
Clicking on 'SOLR COLLECTION CONFIG' from the category filtering button will display only the configuration files, making it easier to find.

Configurations are provided for each version of Apache Solr. Please select the configuration that matches the version of Apache Solr used in your instance.

Next, click on 'ADD TO PROJECT' on the details page, then select the project name you want to add it to.

Livedoorニュース configuration(Solr 9用) CONFIGURATION

Similar to when downloading the data, select 'Extensions' from the left side of the project view. Click on 'DOWNLOAD' in the 'Livedoorニュース configuration(Solr 9)' block that appears.

If there are multiple versions available, click the download icon on the right side of the appropriate version's row (usually the latest version is preferable) to download it to your computer.

Reviewing the contents of the configuration file

Before adding the collection, let's review the schema definition in the downloaded configuration.

Extract the downloaded ZIP file on your computer. This is for checking the contents. When specifying it during the collection addition, upload it as it is in ZIP format.

When you open the 'managed-schema' file extracted, you'll see the following fields defined. (Note that this may vary depending on the version, and only some are shown here as an excerpt.)

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="category" type="string" indexed="true" stored="true" docValues="true"/>
<field name="date" type="date" indexed="true" stored="false"/>
<field name="url" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="body" type="text_ja" indexed="true" stored="true" multiValued="true"/>
<field name="title" type="text_ja" indexed="true" stored="true"/>

Here, 'id' is the unique key field, and it is configured in managed-schema.xml as follows:

<uniqueKey>id</uniqueKey>

The unique key, similar to a primary key in a database, is assigned a string value to uniquely identify a document.
While setting a unique key is not mandatory in the search engine, not setting it can lead to various limitations such as the inability to update documents or use the highlighting feature, as mentioned later.
In KandaSearch, setting a unique key is mandatory.
In business applications, setting a unique key is an implicit requirement, so it's good to remember that a unique key must always be set.

Adding a Collection

Once the preparation is complete, it's time to add the collection.
In the project overview, click on the instance name where you want to add the collection.

tutorial-en-045011

When the instance view appears, select 'Collections' from the left-side menu.
Click on '+ ADD A COLLECTION' in the collections screen.

tutorial-en-045012

Click on the 'IMPORT' tab in the 'Add a Collection' screen.
In the dotted outline box labeled 'Add a Collection,' drag and drop the downloaded config ZIP file, or click 'CHOOSE A File' to specify the file.
Specify a collection name (for this tutorial, let's use 'livedoornews') and click 'SAVE'.

tutorial-en-045013

The collection has been added and will be displayed in the list.

tutorial-en-045014

Document registration

Using the 'Livedoorニュースコーパス(mini)' data downloaded in the section 'Livedoor News Corpus' of this chapter, we will perform document registration.

Note
The file to be specified during document registration must be in uncompressed format, but the downloaded data may be in compressed format. In that case, please perform the extraction process on your computer beforehand. The extension 'Livedoorニュースコーパス(mini)' is downloaded as an uncompressed JSON file, so there is no need for extraction.

Move to the Instance Overview.
Click on 'REGISTER DOCUMENTS' on the screen.

tutorial-en-045015

On the 'Upload a file' screen, select the collection 'livedoornews'.
Next, select the downloaded 'Livedoorニュースコーパス(mini)' JSON file on your computer, then click on 'Index Documents'.

tutorial-en-045016

When 'Documents successfully indexed.' is displayed, the document registration (indexing) is complete.

tutorial-en-045017

The maximum file size limit for documents that can be registered through KandaSearch's UI is 2 megabytes.
If you need to register documents with sizes exceeding this limit, you will need to use the API.
We will explain this method in the 'WebApp Development' chapter.

When an error occurs

If an error occurs, investigate the cause using the following methods:

  • Check the operating status and instance storage usage in the Instance Overview.
  • Review various log files from "File Manager" in the Instance View.
  • Access Solr Admin to check the instance status on the Dashboard or review Logging.

For estimates and details,
please feel free to contact our development team.

Contact Us
TOP