Web Page Crawler

Here, we will explain how to create (add) and manage a web crawler.

Page Sections

What is a Web Page Crawler?

To build a website search system, an index of the content within the website, such as text, is necessary.
A 'Web Page Crawler' is an application that collects data from a website, including text and images, performs appropriate processing such as removing HTML tags, and then registers (indexes) them into a collection.

KandaSearch provides functionality to create, manage, and execute web crawlers with just a few clicks from the interface. This is what we call 'Web Crawler' in KandaSearch.

Terms of Use for Crawlers

Here are the terms and conditions for using KandaSearch's 'Web Crawler':

  • Prepare a collection in advance for registering (indexing) the crawled data. Sample configurations for collection creation are available in the Extension Library.
  • Supports file types: HTML/PDF/Word/Excel/PowerPoint.
  • With the Standard plan or higher, you can create one crawler job for free.
  • The number of seeds that can be set varies depending on your plan and whether the web crawler is free or paid. (Seeds refer to the URLs that serve as the starting point for crawling.)
  • Additional jobs beyond the first one require the purchase of paid crawl jobs.
  • Daily scheduling is available for paid crawl jobs.
  • Regardless of whether it's free or paid, you cannot create crawl jobs specifying a collection from the Community plan.
  • The web crawler refers to the robots.txt of the target website.
  • If IP address information for the web crawler is required, please contact.
  • Free web crawlers are executed on the shared cluster within KandaSearch.
  • Paid web crawlers are executed on a dedicated instance for the user.

Creating a Crawler Job

Web crawlers are created and managed in units called crawler jobs.
Here's how to create a crawler job:

  1. Log in to KandaSearch.
  2. From the project list screen, click on the name of the project you want to work on. (To display the list of projects, click on the KandaSearch logo in the upper left corner of the screen.)
  3. From the left-side menu in the project view, click on Web Crawler'.
  4. Once the crawler job screen appears, click on '+ ADD CRAWLER JOB'.

The following steps will be explained for each subsequent screen.

1. Type Settings

Select 'Crawl Job Type' from the combo box. Paid job types are also available for selection.

2. Job Settings

Set the following items for the job:

  • Job Name: Specify a name to identify the job. Names can be in Japanese. This field is required.
  • User: Specify the managing user for the job. This field is required.
  • Job Description: Enter a description for the job.
  • Seeds: Specify the seed (the URL that serves as the starting point for crawling). This field is required.
  • Additional Seeds: Specify URLs that cannot be reached from the seed. They must start with the domain string specified in the seed. For example, if the seed is set to https://example.com/blog, you can specify https://example.com/doc. Specify one seed per line.

3. Additional Settings

Set the 'Include Items' and 'Exclude Items' for crawling and indexing.

Include in Crawl - Specify the files to include in the crawl using regular expressions. This field is required.

Example Settings

.*

Include in Index - Specify the files to include in the index using regular expressions. This field is required.

Example Settings

\.html$
\.htm$
\.pdf$
\.PDF$

Exclude From Crawl - Specify the files to exclude from the crawl using regular expressions.

Example Settings

\.tar\.gz$
\.png$
\.jpeg$
\.jpg$
\.zip$
\.gif$
\.css$
\.php$

Exclude From Index - Specify the files to exclude from the index using regular expressions.

Example Settings

https://example.com/category/.*

Exclude Content From Index - Specify the content to exclude from the index using regular expressions. Content that matches the regular expressions specified here will not be indexed.

4. Solr Settings

Here are the settings related to Solr:

  • Instance Name: Select the instance for indexing. This field is required.
  • Collection Name: Choose the collection for indexing. This field is required.
  • Unique Key: The unique key is automatically specified. This field is required.
  • Update Handler: Specify the update handler. This field is required.
  • Remove Handler: Specify the handler for removing pages from the index if they were present in previous crawls but are not in the current one. This field is required.
  • Status Handler: Specify the handler to be used for status checks. This field is required.
  • Commit Settings: Set the document commit time (ms), and specify whether to commit after every job.
  • Argument Settings: You can specify multiple sets of argument names and values.

Setting Example for Argument Settings
By setting 'literal.field_name', you can insert a fixed value specified in the field name. This can be used for faceting and filtering during searches.

literal.category=homepage
literal.author=<company_name etc.>

5. Schedule Settings

Specify the execution schedule of the job from options such as 'Manual', 'Daily' (paid jobs only), 'Weekly', and 'Monthly' and set the respective conditions.

6. Confirmation

Here you'll see the settings you've made so far. Confirm that everything is correct, then click 'CREATE CRAWLER JOB'.
Upon successful creation, the crawler job will be added to the list of crawler jobs.

Management of Crawler Jobs

In the list of crawler jobs, you can manage each job as follows:

  • Job Information: Displays job name, creator, seed, instance logo, collection name, and status.
  • Run job (PLAY icon): Clicking the job execution icon will immediately run the crawler job. While running, the status will be 'Job Processing'. During job execution, this icon will change to 'Stop' (STOP icon), and clicking it will immediately stop the job.
  • View job details: Displays detailed information about the crawler job.
  • View job status: Displays the status of the crawler job, allowing you to check its progress. Clicking 'Reload' within the 'Crawler Job Status' screen will update to the latest status.
  • Edit job: Allows editing of the crawler job.
  • Reset job: performs a reset process to enable index recreation, regardless of whether data has been updated in the crawl destination. Actual indexing will occur during the next crawler job execution.
  • Delete job: Allows deletion of the crawler job.

Notes.

  • When deleting a paid crawler job, no refunds are provided. To create a new job, you'll need to purchase a new one. If you wish to run it as another job without purchasing a new one, you can edit and use an existing crawler job.
  • Crawler jobs store the addresses (URLs) of crawled pages in a database. Once a page has been successfully indexed, it won't be reindexed unless the page is updated. You can clear the information stored in the database by 'Reset job'. This allows you to forcibly reindex pages, even if they haven't been updated, during the next job run.
  • If you're unable to delete a job due to an error, try rerunning and stopping the job again. This might temporarily resolve the error and allow you to delete the job.

When a crawler job encounters an error

Here are some troubleshooting tips for when a crawler job encounters an error.

If the crawl is failing due to issues on the 'target side (crawling destination system)', waiting some time and then retrying may resolve the error, as it could be due to temporary issues on the target side.

If the issue is not on the target side, there might be errors occurring on the Solr side causing the crawl to fail. In such cases, download the Solr logs from 'File Manager' and investigate.
Example of Solr logs:
Couldn't get multipart parts in order to delete them => java.lang.IllegalStateException: No multipart config for servlet

The standard processing of the web crawler or custom scraping plugins may attempt to index file types that are not supported. By limiting the indexing targets to specific types using 'Include in index' such as "\.html$" and "\.pdf$", and gradually expanding from there, you can manage the indexing process more effectively.

Utilizing Plugins for Crawlers

Processing tasks such as removing directly embedded JavaScript within HTML, which cannot be handled by the standard scraping process of a web crawler, can be achieved by creating a plugin and specifying it in the update handler.
Additionally, utilizing plugins allows for fine-tuning, such as weighting the text within <h1/> tags.
Sample plugins and configurations for experiencing these functionalities are available in the Extension Library.