Red Hat

In Relation To Hibernate Search

In Relation To Hibernate Search

Hibernate Search monitoring and statistics

Posted by    |       |    Tagged as Hibernate Search

Emmanuel mentioned in his previous Search post the new Statistics interface which is new in Hibernate search 3.3 (latest version 3.3.0.Beta1). I thought it is time to write a little bit more about it. The API is actually self-explanatory:

public interface Statistics {
     * Reset all statistics.
    void clear();

     * Get global number of executed search queries
     * @return search query execution count
    long getSearchQueryExecutionCount();

     * Get the total search time in nanoseconds.
    long getSearchQueryTotalTime();

     * Get the time in nanoseconds of the slowest search.
    long getSearchQueryExecutionMaxTime();

     * Get the average search time in nanoseconds.
    long getSearchQueryExecutionAvgTime();

     * Get the query string for the slowest query.
    String getSearchQueryExecutionMaxTimeQueryString();

     * Get the total object loading in nanoseconds.
    long getObjectLoadingTotalTime();

     * Get the time in nanoseconds for the slowest object load.
    long getObjectLoadingExecutionMaxTime();

     * Get the average object loading time in nanoseconds.
    long getObjectLoadingExecutionAvgTime();

     * Gets the total number of objects loaded
    long getObjectsLoadedCount();

     * Are statistics logged
    public boolean isStatisticsEnabled();

     * Enable statistics logs (this is a dynamic parameter)
    public void setStatisticsEnabled(boolean b);

     * Returns the Hibernate Search version.
     * @return the Hibernate Search version
    String getSearchVersion();

     * Returns a list of all indexed classes.
     * @return list of all indexed classes
    Set<String> getIndexedClassNames();

     * Returns the number of documents for the given entity.
     * @param entity the fqc of the entity
     * @return number of documents for the specified entity name
     * @throws IllegalArgumentException in case the entity name is not valid
    int getNumberOfIndexedEntities(String entity);

     * Returns a map of all indexed entities and their document count in the index.
     * @return a map of all indexed entities and their document count. The map key is the fqc of the entity and
     *         the map value is the document count.
    Map<String, Integer> indexedEntitiesCount();

Access to the statistics is via SearchFactory.getStatistics(). The information about which classes are indexed and how many entities are in the index will always be available. However, the query and object loading timings will only be collected if the property is set in your configuration. I am thinking about introducing an additional interface in order to make this separation more obvious. WDYT?

The new statistic and monitoring functionality does not end here. You can also enable access to the statistics via JMX. Setting the property will automatically register the StatisticsInfoMBean with the MBeanServer. On top of this MBean there are two more MBeans - IndexControlMBean and IndexingProgressMonitorMBean - which will or will not be available depending on your configuration and the state of the application.

The IndexControlMBean allows you to build, optimize and purge the index for a given entity. Indexing occurs via the mass indexing API. A requirement for this bean to be registered in JMX is, that the Hibernate SessionFactory is bound to JNDI via the hibernate.session_factory_name property. Refer to the Hibernate Core manual for more information on how to configure JNDI. The IndexControlMBean API are for now just experimental.

Last but not least, the IndexingProgressMonitorMBean. This MBean is an implementation of the MassIndexerProgressMonitor interface. If is set and the mass indexer API is used the indexing progress can be followed via this MBean. The bean will only be bound to JMX while indexing is in progress. Once indexing is completed the MBean is not longer available. Again, this API is for now experimental.

Do you think this new monitoring and statistic API it is valuable? Are you missing any functionality? Let us know and use the Search forum or Jira to suggest new features or to report a bug.


First Beta of Hibernate Search 3.3: query DSL and more

Posted by    |       |    Tagged as Hibernate Search

The first beta of Hibernate Search 3.3 is out. We had several goals in mind.

One of them is to morph the project into a more independent piece of software for Infinispan. We have a lot of exciting developments around Infinispan, search and persistence. But that's the subject of another post. On to the meat now.

Hibernate Search query DSL

Probably the most visible feature is the new Hibernate Search query DSL.

Writing Lucene queries is not easy, either you use the query parser limiting yourself to fairly simple queries and manipulating strings or you use the Lucene programmatic query API which is quite verbose and contains a myriad of settings and alternatives.

On top of that, you need to make sure you apply the same magic at query and indexing time: if you don't the index key you look for will not match and you will return no result. This is particularly true in an object world where two transformations occur:

  • the object is transformed in a string via the Hibernate Search FieldBridge
  • the string is transformed into terms via the analyzer

Each property can have different combinations of field bridge and analyzer.

Hibernate Search solves these problems by transparently applying the appropriate FieldBridge and analyzer of a given researched property. It's also built around a fluent API to make queries very easy to write and more importantly easier to read.

QueryBuilder mythQB = searchFactory.buildQueryBuilder().forEntity( Myth.class ).get();

//look for popular modern myths that are not urban
Date twentiethCentury = ...;
Query luceneQuery = mythQB
      .must( mythQB.keyword().onField("description_stem").matching("urban").createQuery() )
      .must( mythQB
        .createQuery() )
      .must( mythQB
        .createQuery() )

This example shows many things:

  • the fluent API in action (you've got to admit that it's more readable than a raw Lucene query)
  • you pass objects and not string representations (Date and number in this case)
  • description_stem uses a stemming analyzer (eg. transforming loving in its root word love): no need to apply it yourself before passing the matching string, the query DSL does that for you.

I will blog in more details about Hibernate Seach query DSL shortly.

Hibernate Core 3.6

This release is compatible with Hibernate Core 3.6 (in Beta3 at the time of writing). A side effect is that manual configuration of the event listeners is no longer necessary even when only using hbm.xml files.


Hardy has been busy designing a statistics API (available from the SearchFactory). It gives you a lot of information about Hibernate Search:

  • average and max time for a Lucene query execution
  • average and max time for the object loading process following a Lucene query execution
  • slowest query
  • number of entities indexed of a given type
  • and many more

Again, a more detailed blog post should come soon.

Integration tests for JTA and Spring

We have added integration tests for both Spring Framework and JTA. On the JTA side, we are testing against Bitronix and JBoss Transactions standalone.

Mutable SearchFactory

While not a public feature yet, Hibernate Search now has the ability to add new entity types on the fly.

From the ground up, we have made sure that Hibernate Search is extremely fast and efficient at runtime. To achieve that we have been using an immutable design for the SearchFactory: we pre-compute and store metadata to make indexing and querying efficient. and that forced us to know the list of indexed entity types ahead of time.

Infinispan, however, does not know necessarily knows the list of entities ahead of time. The new design uses a copy-on-change approach to keep the benefits of the immutable model while offering the ability to add new entities. As a user, you won't notice it but as a framework using Hibernate Search, you will :)

Bug fixes

Of course we also fix bugs :)

Check out the new release on's Maven repository or download the distribution. You can also read the documentation here. Be aware that this version breaks a couple of SPIs, make sure to check the migration guide.

Hibernate Search 3.2.1 is out: please upgrade

Posted by    |       |    Tagged as Hibernate Search

While working on Hibernate Search 3.3, we have discovered a critical issue with Hibernate 3.2. If you use Hibernate Core 3.5 in a JTA environment (recommended), the way Hibernate Search 3.2 registers itself can lead to inconsistent indexing and generate assertion failures. All this is fixed in Hibernate 3.2.1 which you can get here and ported to trunk as well. We have also added tests to cover the JTA area.

We highly recommend people using Hibernate Search 3.2.0 to migrate to 3.2.1.

For more information on this bug and others fixed in 3.2.1, check out the changelog.

Many thanks to Tom Waterhouse for helping us all along the discovery, fix and testing process.

Hibernate Search 3.2 has been in development close to a year and now we are releasing it :) Instead of giving you a list of new features, let me highlight a couple of use cases we now cover:

Defining index mappings depending on customer / deployment

The primary way to express index mappings in Hibernate Search is via annotations. This works 95% of the time, but in some cases you want to adjust what gets indexed and how in a more fine grained way. For example:

  • you deliver the same application to different customers and want to give them the opportunity to configure some of the available indexed properties
  • you deploy the same domain model in different apps where each needs specific search capabilities

To achieve this, we have introduced an easy to use, easy to read fluent programmatic API to express index mappings. Check out the programmatic API in the reference documentation or this blog entry.

Index/reindex my data easy and fast

In Hibernate Search, there has been a couple of best practices to initially index your data. You needed to read your data from the database in batch, call the index operation, flush and clear the session and move on to the next batch.

Forget that now. We have a super easy API to index or reindex your data (as simple as two method calls). You can also configure how indexing is done via an intuitive fluent API (yes we've catched the fluent API virus and you haven't seen the end of it). Not only is the new approach easy to use but it's also massively faster than the previously recommended best practices. We highly recommend people to move to this new approach.

Check out the reference documentation or this blog entry for more info.

I can't use JMS, but I need index clustering

Let me first state that setting up a JMS queue is super simple and trivial in any of the modern application servers ( esp JBoss AS :) ) and you get tons of benefits from it (reliability etc). Of course, if you like to waste time and build your own stack on top of Tomcat or equivalent, too bad for you.

Anyways, you can now use an alternative approach to JMS for clustering. We now support raw JGroups communications between cluster members.

My sysadmin needs a way to see what indexing operation have failed and restart them

Luckily that does not happen often but when indexing failures happen, we need to do something about them.

You now have access to an API to listen to indexing errors and process them as you please. The default implementation logs the error but you can easily decide to push the errors to some queue for display or replay, send a message via SNMP etc etc. The actual error is provided as well as the list of entities that should have been processed (quite handy for replay).

I have a single instance updating the index, can I make it faster?

Yes, if a single instance of Hibernate Search is responsible to update the index, we can speed up things. Simply add[default|index name].exclusive_index_use true

What else?

Hibernate Search 3.2 runs on Hibernate Core 3.5 and JPA 2. And as always we did many more things for this release including various optimizations, bug fixes, simplifying the Hibernate Search settings (especially for dependencies), adding a simpler API for bridges.

Check out on the web site, download Hibernate Search or browse the reference documentation. We also have a migration guide from earlier versions of Hibernate Search.

PS: For the Maven users, JBoss has migrated to a new maven repository. Read this user guide to know more.

PPS: We are already on Hibernate Search 3.3. Stay tuned.

Hibernate Search 3.2 CR1: the release train is on

Posted by    |       |    Tagged as Hibernate Search

I am happy to announce the release of Hibernate Search 3.2 CR1. Crossing fingers, this is the latest release before the final version targeted in a few days.

A good 75% of our time has been spent on bug fixes fresh and old (some even fossilized). But we have also added a few interesting new features:

  • we have polished the one we introduced in 3.2 Beta1
  • we moved to Lucene 2.9 APIs as a first step towards Lucene 3.0's migration and we also have upgraded to Solr 1.4 for the declarative analyzer framework
  • Amin, Sanne and I have been working on a new API to catch and process indexing errors: the default implementation logs the errors but you can write your own custom callback. You could log the failing indexing process to a DB, send a message to a sysadmin, queue the issues for automatic reprocessing etc).
  • Hibernate Search 3.2 targets Hibernate Core 3.5 and use some of the new APIs
  • a simpler API is at your disposal to add fields to a Lucene document in your custom bridges (thanks Sanne!)

You can download the release from sourceforce or our maven repository and read the documentation. Try it out!

Many thanks for the bug reports / feature requests you have send us: they helped polish this release. Atop the usual suspects, I would like to thank Gustavo Nalle Fernandez and Amin Mohammed-Coleman for their contribution.

Hibernate Search 3.2: fast index rebuild

Posted by    |       |    Tagged as Hibernate Search

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

  • Data existed before you introduced Hibernate Search
  • New features are developed, index mapping changed
  • A database backup is restored
  • Batch changes are applied on the database
  • get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:


Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

  1. A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
  2. The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
  3. At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
  4. The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
  5. The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
  6. The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
  7. The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE


When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

  1. Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
  2. Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
  3. Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
  4. Check for network latency
  5. The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
  6. Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
  7. If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

Hibernate Search 3.2: programmatic mapping API

Posted by    |       |    Tagged as Hibernate Search

One of the innovations we have brought to Hibernate Search is an alternative way to define the mapping information: a programmatic API.

The traditional way to map an entity into Hibernate Search is to use annotations. And it's perfectly fine for 95% of the use cases. In some cases though, some people had had a need for a more dynamic approach:

  • they use a metamodel to generate or customize what is indexed in their entities and need to reconfigure things either on redeployment or on the fly based on some contextual information.
  • they ship a product to multiple customers that require some customization.

What people asked for: the XML Way(tm)

For a while, people with this requirement have asked for an XML format equivalent to what annotations could do. Now the problem with XML is that:

  • it's very verbose in it's way to duplicate the structural information of your code
<class name="Address">
  <property name="street1">
      <analyzer definition="ngram"/>
   <!-- ... -->
  • while XML itself is type-safe, XML editors are still close to stone age, and developers writing XML in notepad are unfortunately quite common
  • even if XML is type-safe, one cannot refactor the Java code and expect to get compile time errors or even better automatic integrated refactoring. For example, if I rename Address to Location, I still need to remember to change this in my xml file
  • and finally, dynamically generating an XML stream to cope with the dynamic reconfiguration use case is not what I would call an intuitive solution

So we took a different road.

What they get: a fluent programmatic API

Instead of writing the mapping in XML, let's write it in Java. And to make things easier let's use a fluent contextual API (have intuitive method names, only expose the relevant operations).

SearchMapping mapping = new SearchMapping();

    .analyzerDef( "ngram", StandardTokenizerFactory.class )
        .filter( LowerCaseFilterFactory.class )
        .filter( NGramFilterFactory.class )
            .param( "minGramSize", "3" )
            .param( "maxGramSize", "3" )

        .property("addressId", METHOD)
        .property("street1", METHOD)
        .property("country", METHOD)
        .property("movedIn", METHOD)

As you can see, it's very easy to figure out what is going on here. But something you cannot see in this example is that your IDE only offers the relevant methods contextually. For example, unless you have just declared a property(), you won't be able to add a field() to it. Likewise, you can set an analyzer on a field, only if you are defining a field. It's like the dynamic languages fluent APIs be better ;)

The next step is to associate the programmatic mapping object to the Hibernate configuration.

//in Hibernate native
Configuration configuration = ...;
configuration.setProperty( "", mapping );
SessionFactory factory = configuration.buildSessionFactory();
//in JPA
Map<String,String> properties = new HashMap<String,String)(1);
properties.put( "", mapping );
EntityManagerFactory emf = Persistence.createEntityManagerFactory( "userPU", properties );

And voila!


The beauty of this API is that it's very easy for XML fan boys to create their own XML schema descriptors and use the programmatic API when parsing the XML stream. More interestingly, an application can expose specific configuration options (via a simple configuration file, a UI or any other form) and use this configuration to customize the mapping programmatically.

Please give this API a try, tell us what works and what does not, we are still figuring out things to make it as awesome as possible :)


Many thanks to Amin Mohammed-Coleman for taking my half done initiative and polishing it up.

You can get Hibernate Search 3.2 Beta 1 here, the complete API documentation is present in the distribution; chapter 4.4.

Hibernate Search 3.2.0 Beta1

Posted by    |       |    Tagged as Hibernate Search

It has been quite some time since the latest Hibernate Search release, but we are happy to announce the first beta release of version 3.2.0 with tons of bug fixes and exciting new features. In fact there are so many new features that we are planning to write a whole series of blog entries covering the following topics:

  • The new API for programmatic configuration of Hibernate Search via
  • Ability to rebuild the index in parallel threads using the MassIndexer API. This can be as simple as fullTextSession.createIndexer().startAndWait(), but of course there are plenty of options to fine-tune the behavior.
  • Clustering via JGroups as an alternative to the existing JMS solution. The values for the option are jgroupsSlave and jgroupsMaster in this case.
  • Dynamic boosting via the new @DynamicBoost annotation.

Most of these new features are already documented in the Hibernate Search documentation available in the distribution packages. However, there might be still some gaps in the documentation. If you find any let us know via the Forum or Jira. Smaller new features are:

  • New built-in field bridges for the java.util.Calendar and java.lang.Character
  • Ability to configure Lucene's LockFactory using<index>.locking_strategy with the values simple, native, single or none.
  • Ability to share IndexWriter instances across transactions. This can be activated via the<indexname>.exclusive_index_use flag.

Of course we also fixed several bugs of which the following are worth mentioning explicitly:

  • HSEARCH-391 Multi level embedded objects don't get an index update
  • HSEARCH-353 Removing an entity and adding another with same PK (in same TX) will not add second entity to index

For a full changelog see the Jira release notes. Last but not least, Hibernate Search depends now on Hibernate Core 3.5 beta2 and Lucene 2.4.1 and is aligned with JPA 2.0 CR1.

Special thanks to our contributors Sanne Grinovero and Amin Mohammed-Coleman who put a lot of effort into this release.


A few people have asked me to publish my slides on Bean Validation and Hibernate Search. Here we go :)

Speaking of conferences, I will be presenting Hibernate Search and the magic of analyzers at Jazoon (Zurich) on Thursday 25th at 11:30. See you there.

Aaron and I will be talking about Hibernate Search and how it can complement your database when you need to scale big, like in... ahem a cloud. It's on Wednesday, June 3 at 9:45 am in Hall 134. I know it's early, someone in the program committee did not like us so much ;)

I will also do an author signing session of Hibernate Search in Action the same day Wed, June 3 at the JavaOne bookstore.

I will also discuss Bean Validation (JSR-303), what it does and how it integrates in Java EE 6 (which I will demo on stage) and any other architecture. This will be Thursday, June 4 at 13:30 in Hall 134. The latest version of the spec is always available here at least till we make it final. Hibernate Validator 4, the reference implementation is well underway, give it a try.

back to top