BLOG

  • トップ
  • ブログ
  • Fine-tuning a multi-language model for information retrieval using the Solr Manual

Fine-tuning a multi-language model for information retrieval using the Solr Manual

著者: Elpidio Gonzalez Valbuena

     

投稿日: 2023年04月24日  更新日: 2023年08月05日

 
  • Semantic

Summary

We fine-tuned a pre-trained multilanguage SBERT model on the Solr Manual to test the feasibility of having a semantic search engine capable of retrieving documents written in a particular language, in this case, English, using non-english queries.

To experience the demo, please visit https://demo.rondhuit.com/en/solr-manual

Semantic search is a data searching technique that aims to understand the meaning and context of words and phrases in a search query, as opposed to just matching keywords. This means that it can provide more relevant and accurate results to a user's query, even if the search terms used aren't an exact match to the content being searched.

Traditional search engines use keyword-based search algorithms, where the search engine looks for the exact words and phrases that the user has entered in their search query. This approach is limited because it doesn't take into account the context or intent behind the search query, leading to inaccurate results or irrelevant content being displayed.

Semantic search, on the other hand, leverages document representation as text embeddings, as opposed to conventional inverted indexes in traditional search engines. Text embeddings are numerical representations of text data. They are created using natural language processing (NLP) techniques, and are designed to capture the meaning and context of words and phrases in text. This way, semantic search can analyze the relationships between words and identify synonyms, related terms, and concepts related to the search query.

At search time, a user query is embedded into the same vector space as the documents, and the closest embeddings from the target corpus are returned. These entries should have a high semantic overlap with the query.

2. Pretrained Models

Pretrained text models are deep learning models that have been pre-trained on large amounts of text data to learn the underlying patterns and relationships between words and phrases. These models are trained using techniques such as neural networks, and are designed to capture the meaning and context of text data. As such, they are the perfect tool to calculate a document's text embeddings. An example of such models is BERT.

A modification of BERT that has been designed specifically for sentence embeddings is called SBERT. It is trained to create high-quality sentence embeddings that can be used for a variety of NLP tasks, such as similarity search, clustering, and question answering.

SBERT uses a technique called siamese and triplet networks to learn to create sentence embeddings. These techniques allow SBERT to compare the similarity between two or more sentences and produce embeddings that capture their semantic meaning. This sounds a lot like what we do in semantic search: compare the similarity of a query intent and a document's meaning.

There are several reasons why people use pretrained models instead of creating their own models from scratch:

  • Time and Resource Savings: Pretrained models have already been trained on large amounts of data and fine-tuned. This saves a significant amount of time and resources, as creating a new model from scratch requires extensive data preparation, hyperparameter tuning, and training time.

  • High Performance: Pretrained models are typically state-of-the-art models that have been extensively evaluated and tested. This means that they have high performance and accuracy for a variety of tasks, and can be used as a baseline for comparison when developing new models.

  • Transfer Learning: Pretrained models can be fine-tuned on new data for specific NLP tasks, allowing for transfer learning. This means that the model can learn to perform a new task with less data and training time than if it were trained from scratch.

  • Accessible and Open Source: Many pretrained models are publicly available and open source, making them easily accessible to researchers and developers. This promotes collaboration and innovation in the NLP community, and allows for the development of new applications and use cases.

2.1 Multilingual pre-trained model

As we mentioned before, pre-trained models offer significant time and resource savings, high performance, and the ability to fine-tune for specific tasks with transfer learning. This is why for our particular task of searching the solr manual in several languages (or at the very least Japanese and English) we use sentence-transformers/paraphrase-multilingual-mpnet-base-v2, a pretrained multilanguage model from 🤗 hugging face.

This particular model was trained using over a dozen datasets of parallel data (multiple languages), notably WikiMatrix, a dataset of 135M Parallel Sentences in 1620 Language Pairs from Wikipedia. The authors only used pairs above a certain score, as pairs below that threshold were often of bad quality.

This model maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

The model's card is presented below:

Feature Value
Max Sequence Length 128
Dimensions 768
Normalized Embeddings false
Suitable Score Functions cosine-similarity (util.cos_sim)
Size 970 MB
Pooling Mean Pooling
Training Data Multi-lingual model of paraphrase-mpnet-base-v2, extended to 50+ languages

While this model was trained for symmetrical semantic search, meaning query and the entries in the corpus are of about the same length, we will use it for asymmetrical semantic search, in which the embeddings of small sentences are expected to match those of whole documents.

3. Solr Manual

The Solr manual is a comprehensive guide that provides detailed information on how to install, configure, use, and maintain the Solr search platform. It is the official Solr documentation, written and published by Solr committers, presented in adoc format.

The AsciiDoc (adoc) format is a lightweight markup language that is used to write technical documentation, such as software documentation, manuals, and books. It is similar to other markup languages like Markdown and reStructuredText, but it offers more features and flexibility.

Some of the features of the AsciiDoc format include:

  • Headings and subheadings
  • Lists and tables
  • Links and cross-references
  • Images and diagrams
  • Source code blocks with syntax highlighting

3.1 Cleaning the documents

Since the Solr Manual heavily utilizes the above features, it was necessary to clean up the syntax overhead; this way we can produce a better embedding representation of the manual's sections. We used regular expressions to remove code snippets, hyperlinks and images. Also, most of the documents include a license notice that needed to be removed to reduce redundancy. Tables provide some useful information, and the noise they introduce is manageable, so they were kept as they are.

The headings and subheadings were used to create sub-documents: Subheadings up to depth 3 were considered independent documents. Any subheadings below that were considered part of the closest subheading. For example, the root document 'Field Type Definitions and Properties would be split into:

  • Field Type Definitions and Properties / Field Type Definitions in the Schema
  • Field Type Definitions and Properties / Field Type Properties / General Properties

...

  • Field Type Definitions and Properties / Choosing Appropriate Numeric Types

...

Notice that General Properties is considered a depth 3 subheading and any subheadings under it will be considered part of that document instead of individual ones. This data processing resulted in a total of 2,972 documents.

3.2 Faceting

To provide a good search experience for the user, allowing them to refine and overview their search results, we wanted to introduce facets. We used the known words extractor included in the Rondhuit's Solr plugin to fill the solr_class category. We looked for class names extracted from the Solr JAR files inside all the manual documents.

4. Fine Tuning

Most of the time using a pre-trained model out of the box provides good enough performance. Depending on the purpose of our application we can use the model as it is. However, a fine-tuned model more often than not yields better results for a specific task, since it becomes customized for a distinct context.

A sentence-transformer model can be easily fine-tuned for the task of semantic search on a specific corpus using a data set consisting of pairs of query & relevant passages information. Enterprises very often use this approach, since they have access to query logs and user interactions with the search results. However, this information is not available when building something from scratch, like in our case. We do have the contents of the Solr manual itself, so, we need to devise an unsupervised approach to fine-tune our model on our dataset.

4.1 Query Generation

An interesting approach is presented in BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models: They propose using synthetic queries for unsupervised domain-adaptation approach for dense retrieval models. First, they fine-tuned a T5-base model to generate queries given a passage. Then, for retrieval, they fine-tuned the SBERT model on the synthetic queries and document pairs.

We follow a similar approach to achieve our goal. We take the contents of a page from the Solr manual and use a model to create possible user queries that would match that content.

If we consider queries are similar to the actual document titles, that is, users might search for 'Edismax parameters' when expecting to see that particular manual page, then we can use a summarization or title generation model on our content to use as synthetic queries.

For the task of generating the synthetic queries we used snrspeaks/t5-one-line-summary, a T5 model trained on 370,000 research papers that generate one line summary based on description/abstract of the papers. We generated 4 queries per document and included the original document's title as the fifth one.

What are we trying to achieve is to have the information present in paragraphs synthetized and represented as titles/queries, and then use this knowledge tuple to fine-tune an SBERT model that will capture the semantic and syntactic information mapping between them.

4.2 Translation

The original pre-trained model is a multilingual model. We generated English titles from English passages, so we still needed to create knowledge tuples in our target languages. A possibility would be to create combinations of titles and passages in multiple languages, however, translation is a costly operation, and for the purpose of this demo we decided to keep things simple: We set our target languages as English and Japanese, and the target passages will be in English only.

We translated each of the synthetic queries to Japanese and matched them with their corresponding manual page. We also kept the original English queries, ending up with 10 queries per document, for a grand total of 29,720 training examples.

To translate the queries we tried using the Google Translate API, however, REST API transactions introduce an overhead, and the operation proved to be costly, even for the relatively small number of short titles. Moreover, an error message occured after several calls to the API. Instead, we used yet another pre-trained model to do machine translation. The model can be found at facebook/nllb-200-distilled-600M The results were not as accurate as Google's API, but it was a trade-off we needed to make.

4.3 Fine-tuning a Bi-encoder model

SBERT is a siamese bi-encoder using mean pooling for encoding and cosine-similarity for retrieval. One advantage of the bi-encoder architecture is that it allows for efficient retrieval of similar text inputs. Instead of comparing each input to every other input, a bi-encoder can encode all inputs into vector representations and then compare them in an efficient manner.

Throughout our whole experiment we used the SentenceTransformers library. This library was designed in such a way that fine-tuning your own text embeddings models is very easy. Please take a look at How to train or fine-tune a Sentence Transformer model to read about the process in detail. Code samples and companion notebooks are provided as well.

4.3.1 Loss Function: Multiple Negatives Ranking Loss

The loss function plays a critical role when fine-tuning a model. It determines how well our embedding model will work for the specific downstream task. Sadly there is no “one size fits all” loss function. Which loss function is suitable depends on the available training data and on the target task.

Multiple Negatives Ranking Loss as defined by the SentenceTransofmers library is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language).

This loss expects as input a batch consisting of sentence pairs (a1, p1), (a2, p2)…, (an, pn) where we assume that (ai, pi) are a positive pair and (ai, pj) for i != j a negative pair. For each ai, it uses all other pj as negative samples, i.e.: For ai we have 1 positive example (pi) and n-1 negative examples (pj). It then minimizes the negative log-likelihood for softmax normalized scores.

This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly. The performance usually increases with increasing batch sizes. You can also provide one or multiple hard negatives per anchor-positive pair by structuring the data like this: (a1, p1, n1), (a2, p2, n2). Here, n1 is a hard negative for (a1, p1). For the pair (ai, pi) The loss will use all pj (j != i), and all nj as negatives.

We'd like to make a note that we don't use CosineSimilarityLoss because even though the easiest way to fine-tune a model is to have knowledge tuples annotated with a score indicating their similarity on a scale of 0 to 1, we don't have a labeled score. MultipleNegativesRankingLoss is much more intuitive and produces significantly better sentence representations.

Overall, fine-tuning a bi-encoder with multiple negatives ranking loss involves defining a suitable objective function that encourages the bi-encoder to rank relevant examples higher than irrelevant ones.

5. Evaluation

To evaluate our model we use the InformationRetrievalEvaluator in the SentenceTransformers library. It is a class that provides a way to evaluate the performance of a model on an information retrieval task. It computes standard IR metrics such as recall, precision, and mean average precision (MAP). Here's how the InformationRetrievalEvaluator works:

  • Input Data : The input to the InformationRetrievalEvaluator is a list of queries and a corresponding list of documents for each query. Each document is represented by a text string, and each query is represented by a query string.

  • Model Inference: The model is used to encode the queries and documents into fixed-length vector representations.

  • Similarity Scoring: The vector representations of the queries and documents are compared using a similarity function such as cosine similarity or dot product. The similarity scores are used to rank the documents for each query in descending order of relevance.

  • Metrics Calculation: The ranked list of documents for each query is compared to the ground truth relevance labels to calculate metrics such as recall, precision, and MAP.

  • Output: The output of the InformationRetrievalEvaluator is a dictionary containing the computed metrics, including recall, precision, and MAP.

The results obtained by our fine-tuned model are as follows:

Metric Value Metric Value
Accuracy @ 1 0.890 Recall @ 1 0.890
Accuracy @ 3 0.993 Recall @ 3 0.993
Accuracy @ 5 0.999 Recall @ 5 0.999
Accuracy @ 10 0.999 Recall @ 10 0.999
Precision @ 1 0.890 NDCG @ 10 0.956
Precision @ 3 0.331 MRR @ 10 0.941
Precision @ 5 0.199 mAP@ 100 0.941
Precision @ 10 0.099

The results above look very good, however, it is worth mentioning that the evaluation is run on the entire corpus, that is, we didn't work with a train/test split. There are a couple of reasons we decided to proceed like this:

  1. The manual will not change drastically over time. The existing documents might undergo small amendments, but we can consider them pretty much static. New documents will definitely be added over time, but...

  2. Re-training the model using new data is not a computationally expensive operation. we can re-use the generated queries and generate new ones only for any new additions to the corpus. Given that the training can be finished in under 5 hours (20 epochs @ Google Colab), we can update the model every time a new Solr version comes out.

Basically, we don't care about overfitting the model (high variance, low bias), because the model won't work with unseen data at all.

6. Improvements

Some things we can consider to improve our model:

  1. Symmetric Vs. Asymmetric Search Models: Carefully choosing a pre-trained model for our task. Models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents. Depending on your task, the models of one or the other type are preferable. We didn't give much thought to this step of the process, as our application was just intended as a demo.

  2. Better Synthetic Queries : Finding a more suitable model or fine-tuning our own in order to produce queries more similar to the real user input we expect. The same concept applies to translation of the existing generated queries: better translations will most likely yield better search results in the end.

  3. Real User Queries : Activity logs are paramount for smart information retrieval systems. By logging real user queries, impressions (search results) and interactions (clicks) we can infer query-document knowledge pairs that would allow us to scrap synthetic queries altogether and create a real-life dataset to perform supervised training of our model.

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks https://arxiv.org/abs/1908.10084

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation https://arxiv.org/abs/2004.09813

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia https://arxiv.org/abs/1907.05791

Synthetic Query Generation https://arxiv.org/abs/2104.08663

Multiple Negative Ranking Loss https://arxiv.org/pdf/1705.00652.pdf

How to use BERT for finding similar sentences or similar news? https://github.com/huggingface/transformers/issues/876

お見積もり・詳細は KandaSearch チームに
お気軽にお問い合わせください。

お問い合わせ