Hibernate Search 3.2: fast index rebuild

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

Data existed before you introduced Hibernate Search
New features are developed, index mapping changed
A database backup is restored
Batch changes are applied on the database
...you get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();
massIndexer.startAndWait();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:

fullTextSession.createIndexer().start();

Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

fullTextSession.createIndexer()
   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default
   .start();

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

fullTextSession.createIndexer()
   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE
   .startAndWait();

Caching

When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
Check for network latency
The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

In Relation To