Red Hat



In Relation To Sanne Grinovero

In Relation To Sanne Grinovero

Hibernate Search reached 3.4.0.CR1

Posted by    |       |    Tagged as Hibernate Search

One week after 3.4.0.Beta1, two weeks after 3.4.0.Alpha1, we're on a run for 3.4.0.Final!

So this is your last chance to report issues on JIRA before the final release, which is of course planned for next week.

Download it from Sourceforge or via Maven from the JBoss repository, discuss about it on the forums.

Changes

As with all good candidate releases should have, there where no big changes since previous beta: we just made it a bit easier to embed the Hibernate Search engine into other frameworks, as Infinispan does with the Query API.

Infinispan Query

Since first releases Infinispan is providing a Query module, which is reusing the Hibernate Search engine. Infinispan v.4 provided a technology preview, for version 5 we're polishing the API and making sure it's simpler to setup and use. Expect some API updates in the next Alpha of Infinispan 5, and be aware that you can reuse all Hibernate Search annotations and extensions!

We did a big refactoring of the query engine, hence the alpha tag. If you could focus your tests in this area and see if there are any issues, we would be forever grateful.

This release comes with the usual mix of bug fixes, optimizations and new features.

Query engine refactoring

This was a biggie and the objective is that it does not affect you :) We have a medium term goal to make Hibernate Search not depend on Hibernate Core. Extracting the query engine core into well defined SPIs (Service Provider Interfaces) is a milestone for this. The first beneficiary is the Infinispan query module which will become much more resilient to Hibernate Search version changes.

As said earlier, test your applications queries. They should still run without any API change. Let us know otherwise.

Update index only when an indexed property changes

With previous version, when an @Indexed entity changes in Hibernate Core, it is reindexed by Hibernate Search even if none of the indexed properties have effectively changed (after all not all properties are indexed). Hibernate Search 3.4 is smarter in this area and tries not to reindex entities whose indexed properties are unchanged. In some situations like dynamic boosting or class-level bridges, Hibernate Search cannot be certain and always reindex to be safer.

This optimization should speed things up quite significantly for some applications. Check the documentation for more information.

Look for entities in the second level cache

In some environments, most if not all of the searched entities are in the second level cache. This is especially true when you use a distributed second level cache like Infinispan. You can ask Hibernate Search to look in the second level cache before trying to fetch data from the database.

FullTextQuery query = session.createFullTextQuery(luceneQuery, User.class);
query.initializeObjectWith(
    ObjectLookupMethod.SECOND_LEVEL_CACHE,
    DatabaseRetrievalMethod.QUERY
);
List results = query.getResultList();

Of course your entities must be cacheable :) If your entities are most likely in the persistence context (the Hibernate session), you can use ObjectLookupMethod.PERSISTENCE_CONTEXT instead.

MassIndexer improvements

The mass indexer module has had a few improvementse:

  • multithreading of the text analysis phase
  • improve monitoring the mass indexer state and progress by letting you plug a custom implementation (see MassIndexerProgressMonitor)

Faceting

Very requested and useful feature, you can now play with the first alpha preview of the new Faceting engine API.

Field Caching

It's now possible to use Lucene's FieldCache to provide an extra boost to query performance: see the reference documentation.

Other performance improvements

We have found a couple of tricks to improve overall performances. We more aggressively cache some metadata to lower the reflection overhead and we push some additional buttons in Lucene for you to reduce query time.

Give us feedback

Please give us feedback, we have an aggressive release schedule ahead as we are planning the GA version for end of this month.

Check out the new release on JBoss.org's Maven repository or download the distribution. You can also read the documentation.

If you find an issue, shoot.

Hibernate Search 3.2: fast index rebuild

Posted by    |       |    Tagged as Hibernate Search

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

  • Data existed before you introduced Hibernate Search
  • New features are developed, index mapping changed
  • A database backup is restored
  • Batch changes are applied on the database
  • ...you get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();
massIndexer.startAndWait();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:

fullTextSession.createIndexer().start();

Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

fullTextSession.createIndexer()
   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default
   .start();

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

  1. A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
  2. The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
  3. At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
  4. The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
  5. The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
  6. The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
  7. The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

fullTextSession.createIndexer()
   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE
   .startAndWait();

Caching

When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

  1. Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
  2. Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
  3. Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
  4. Check for network latency
  5. The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
  6. Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
  7. If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

back to top