New implementation of mass indexer using JSR 352

Hi, I’m Mincong, an engineering student from France. I’m glad to present my Google Summer of Code 2016 project, which provides an alternative to the current mass indexer implementation of Hibernate Search, using the Java Batch architecture (JSR 352). I’ve been working on this project for 4 months. Before getting started, I want to thank Google for sponsoring the project, the Hibernate team for accepting my proposal and my mentors Gunnar and Emmanuel for their help during this period. Now, let’s begin!

What is it about?

Hibernate Search brings full-text search capabilities to your Hibernate/JPA applications by synchronizing the state of your entities with a search index maintained by Lucene (or Elasticsearch as of Hibernate Search 5.6!). Index synchronization usually happens on the fly as entities are modified, but there may be cases where an entire index needs to be re-built, e.g. when enabling indexing for an existing entity type or after changes have been applied directly to the database, bypassing Hibernate (Search).

Hibernate Search provides the mass indexer for this purpose. It was the goal of my GSoC project to develop an alternative using the API for Java batch applications standardized by JSR 352.

What do we gain from JSR 352?

Implementing the mass indexing functionality using the standardized batching API allows you to use the existing tools of your runtime environment for starting/stopping and monitoring the status of the indexing process. E.g. in WildFly you can use the CLI to do so.

Also JSR 352 provides a way to restart specific job runs. This is very useful if re-indexing of an entity type failed mid-way, for instance due to connectivity issues with the database. Once the problem is solved, the batch job will continue where it left off, not processing again those items already processed successfully.

As JSR 352 defines common concepts of batch-oriented applications such as item readers, processors and writers, the job architecture and workflow is very easy to follow. In JSR 352, the workflow is written in an XML file (the "job XML"), which is used to specify a job, its steps and directs their execution. So you can understand the process without jumping into the code.

<job id="massIndex">
    <step id="beforeChunk" next="produceLuceneDoc">
        <batchlet ref="beforeChunkBatchlet"/>
    </step>

    <step id="produceLuceneDoc" next="afterChunk">
        <chunk checkpoint-policy="custom">
            <reader ref="entityReader"/>
            <processor ref="luceneDocProducer"/>
            <writer ref="luceneDocWriter"/>
            <checkpoint-algorithm ref="checkpointAlgorithm"/>
        </chunk>
        ...
    </step>

    <step id="afterChunk">
        <batchlet ref="afterChunkBatchlet"/>
    </step>
</job>

As you see, it brings a pervasive batch-processing workload. Anyone who has experience in ETL processes should have no difficulty to understand our new implementation.

Example usages

Here are the example usages of the new mass indexer under a draft version. It allows you to add one or multiple class types. If you have more than one root entity to index, then you can use the addRootEntities(Class<?>…) method.

How to use the new MassIndexer

long executionId = new MassIndexer()
        .addRootEntity( Company.class )
        .start();

Another example with a more customized configuration:

long executionId = new MassIndexer()
        .addRootEntity( Company.class, Employee.class )
        .cacheable( false )
        .checkpointFreq( 1000 )
        .rowsPerPartition( 100000 )
        .maxThreads( 10 )
        .purgeAtStart( true )
        .optimizeAfterPurge( true )
        .optimizeAtEnd( true )
        .start();

Parallelism

In order to maximize the performance, we highly recommend you to speed up the mass indexer using parallelism. Parallelism is activated by default. Under the JSR 352 standard, the exact word is "partitioning". The indexing step may run as multiple partitions, one per thread. Each partition has its own partition ID and parameters. If there are more partitions than threads, partitions are considered as a queue to consume: each thread can only run one partition at a time and won’t consume the next partition until the previous one is finished.

massIndexer = massIndexer.rowsPerPartition( 500 );

Checkpoints

Mass indexer supports checkpoint algorithm. If the job is interrupted for any reason, mass indexer can be restarted from the last checkpoint, stored by the batch runtime. And the entities already indexed won’t be lost, because they are already flushed to the directory provider. Assume that N is the value of checkpoint frequency, then a partition will reach at checkpoint every N items processed inside the partition. You can overwrite it to adapt your business requirements.

massIndexer = massIndexer.checkpointFreq( 1000 );

Run

For further usage, please check my GitHub repo gsoc-hsearch. If you want to play with it, you can download the code and build it with Maven:

$ git clone -b 1.0 git://github.com/mincong-h/gsoc-hsearch.git
$ cd gsoc-hsearch
$ mvn install

Current status and next steps

Currently, the new implementation accepts different types of entity as entry, provides high level of customization of the job properties and parallel indexation. The job periodically saves its current progress to enable restart from the last point of consistency. Load balancing has been considered to avoid overload of any single thread. This indexing batch job is available under Java SE and Java EE.

There are still many things to do, e.g. related to performance improvements, integration into WildFly, monitoring, more fine-grained selection of entities to be re-indexed etc. Here are some of the ideas:

Core: partition mapping for composite ID
Integration: package the batch job as a WildFly module
Integration: start the indexing batch job from FullTextSession and FullTextEntityManager
Integration: embed this project into Hibernate Search
Monitoring: enhance the basic monitoring, e.g. progress status for restarted job
Performance: Ensure a great performance of this implementation

These tasks are tracked as GitHub issues, you can check the complete TODO list here.

Feedback

If you are using Hibernate Search and ever wished for a more standardized approach to mass indexing, this project clearly is for you.

We still need to apply some improvements and polishing before integrating it as a module into the Hibernate Search core code base, but any bug reports or comments on the project will be very helpful. So please give it a try and let us know about your feedback. Just drop a comment below or raise an issue on GitHub.

Looking forward to hearing from you!

In Relation To