Recoll semantic searches


The Recoll Python API gives easy access to the text contents of all the documents in the index,
which makes it well fitted as a language model input. In turn there are two directions for using
this: Retrieval Augmented Generation (RAG), where a generative LLM output is informed by the result
documents from another (probably keyword-based) search, and semantic queries, where we look for
documents matching the concepts in the query, and not necessarily the exact terms.

RAG operations are centered in the LLM interface and probably need little from Recoll apart from
getting data from the index.

There is a new branch in the Recoll source with slight modification to allow interfacing with a
language model for the purpose of running semantic searches.

In practise, all the specific language model work is performed by Python programs outside of the
main recoll source. One of the scripts is a worker for the Recoll GUI, receiving questions and
providing results. In consequence, the Recoll main code needed very minimal modifications, mostly
handling the additional search type in the GUI.

ollama is used to run the models, and chromadb to store the embeddings.

The general process is as follows:

  • The index is created or updated by recollindex, as always.

  • You then run the rclsem_embed.py script. This extracts the texts for new documents, splits it
    into segments and asks a language model to generate embeddings. The model which is used by default
    is a relatively small nomic-embed-text (137M parameters). Running this on a CPU is already
    orders of magnitude slower than the original indexing. The embedding vectors are stored in
    chromadb, along with the Recoll document identifier. See the sem_rclquery configuration
    variable below for limiting the part of the index which will actually be processed.

  • At query time, an embedding is generated for the question, and chromadb is asked for
    neighbours. The Recoll identifiers are used to retrieve the document and the relevant text
    context.

  • An rclsem_query.py script allows running queries on the command line, and an
    equivalent rclsem_talk.py script is used to communicate with the GUI, when it executes a simple
    search in the new Semantic mode.

The Python part runs in a virtual environment. A simple shell script allows building it easily.

I only have a modest PC to run this, with no GPU. Running a reranking model on this has proven
impossible (see comments in rclsem_query.py), so that the current implementation is rather
primitive and mostly provides the scaffolding for experimentation.

The modification of the main Recoll GUI code is minimal, so it has been merged into the master
branch (it used to be on a semantic branch, the same code is now in master).
In practise, if you want to try this:

  • Clone the default branch of the source tree:

    git clone https://framagit.org/medoc92/recoll.git
  • Then build and install as per
    the usual method. You
    need to add a -Dsemantic=true option to the meson setup command to enable the semantic query
    option code.

  • Go to the src/semantic directory and run the initsemenv.sh script:

    sh initsemenv.sh /path/to/where/I/want/the/scripdir

This will install ollama if it’s not already there, pull the nomic-embed-text model, create the
Python virtual environment, install ollama and chromadb in it, and copy the Recoll scripts and
Python module.

There are a number of configuration variables which must be set in the main configuration file for
the index you will be using:

  • sem_venv: the location of the Python virtualenv directory. This is for use by the main recoll code
    and mandatory.

  • sem_rclquery: the query which will select the documents to process. By default, this is “mime:*”,
    which will select all documents, but you may want to restrict this (e.g. with dir: clauses),
    because the embedding operation is very slow on a CPU. You really want to begin experimenting
    with a small set. There is no problem with running embedding multiple times on different sets, it
    tests if a document is already there before doing actual work.

  • sem_chromadbdir: where the chromadb data will be stored. By default, this will be the chromadb
    directory inside the Recoll configuration directory.

  • sem_embedmodel: nomic-embed-text by default, you may want to experiment if your configuration
    allows it. Of course, you need to delete the chromadb directory when changing models.

  • sem_embedsegsize: the target character size of the segments created for embedding.

Voila. If you have a GPU, and can code a little Python, I think that the most interesting direction
would be experimenting with reranking.

As a vaguely interesting aside, and from my initial testing on the works of Jane Austen, it appears
that horseriding is a much more frequent subject in Mansfield Park than in the other
books…​ It’s all the more interesting that horseriding is of course not an English word, and a
normal Recoll search would yield no results at all. It remains to be seen if the feature can be
useful for anything beyond litterary ‘research’.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *