Red Hat

In Relation To Hibernate Search

In Relation To Hibernate Search

Hibernate Search 3.2 has been in development close to a year and now we are releasing it :) Instead of giving you a list of new features, let me highlight a couple of use cases we now cover:

Defining index mappings depending on customer / deployment

The primary way to express index mappings in Hibernate Search is via annotations. This works 95% of the time, but in some cases you want to adjust what gets indexed and how in a more fine grained way. For example:

  • you deliver the same application to different customers and want to give them the opportunity to configure some of the available indexed properties
  • you deploy the same domain model in different apps where each needs specific search capabilities

To achieve this, we have introduced an easy to use, easy to read fluent programmatic API to express index mappings. Check out the programmatic API in the reference documentation or this blog entry.

Index/reindex my data easy and fast

In Hibernate Search, there has been a couple of best practices to initially index your data. You needed to read your data from the database in batch, call the index operation, flush and clear the session and move on to the next batch.

Forget that now. We have a super easy API to index or reindex your data (as simple as two method calls). You can also configure how indexing is done via an intuitive fluent API (yes we've catched the fluent API virus and you haven't seen the end of it). Not only is the new approach easy to use but it's also massively faster than the previously recommended best practices. We highly recommend people to move to this new approach.

Check out the reference documentation or this blog entry for more info.

I can't use JMS, but I need index clustering

Let me first state that setting up a JMS queue is super simple and trivial in any of the modern application servers ( esp JBoss AS :) ) and you get tons of benefits from it (reliability etc). Of course, if you like to waste time and build your own stack on top of Tomcat or equivalent, too bad for you.

Anyways, you can now use an alternative approach to JMS for clustering. We now support raw JGroups communications between cluster members.

My sysadmin needs a way to see what indexing operation have failed and restart them

Luckily that does not happen often but when indexing failures happen, we need to do something about them.

You now have access to an API to listen to indexing errors and process them as you please. The default implementation logs the error but you can easily decide to push the errors to some queue for display or replay, send a message via SNMP etc etc. The actual error is provided as well as the list of entities that should have been processed (quite handy for replay).

I have a single instance updating the index, can I make it faster?

Yes, if a single instance of Hibernate Search is responsible to update the index, we can speed up things. Simply add

hibernate.search.[default|index name].exclusive_index_use true

What else?

Hibernate Search 3.2 runs on Hibernate Core 3.5 and JPA 2. And as always we did many more things for this release including various optimizations, bug fixes, simplifying the Hibernate Search settings (especially for dependencies), adding a simpler API for bridges.

Check out on the web site, download Hibernate Search or browse the reference documentation. We also have a migration guide from earlier versions of Hibernate Search.

PS: For the Maven users, JBoss has migrated to a new maven repository. Read this user guide to know more.

PPS: We are already on Hibernate Search 3.3. Stay tuned.

Hibernate Search 3.2 CR1: the release train is on

Posted by    |       |    Tagged as Hibernate Search

I am happy to announce the release of Hibernate Search 3.2 CR1. Crossing fingers, this is the latest release before the final version targeted in a few days.

A good 75% of our time has been spent on bug fixes fresh and old (some even fossilized). But we have also added a few interesting new features:

  • we have polished the one we introduced in 3.2 Beta1
  • we moved to Lucene 2.9 APIs as a first step towards Lucene 3.0's migration and we also have upgraded to Solr 1.4 for the declarative analyzer framework
  • Amin, Sanne and I have been working on a new API to catch and process indexing errors: the default implementation logs the errors but you can write your own custom callback. You could log the failing indexing process to a DB, send a message to a sysadmin, queue the issues for automatic reprocessing etc).
  • Hibernate Search 3.2 targets Hibernate Core 3.5 and use some of the new APIs
  • a simpler API is at your disposal to add fields to a Lucene document in your custom bridges (thanks Sanne!)

You can download the release from sourceforce or our maven repository and read the documentation. Try it out!

Many thanks for the bug reports / feature requests you have send us: they helped polish this release. Atop the usual suspects, I would like to thank Gustavo Nalle Fernandez and Amin Mohammed-Coleman for their contribution.

Hibernate Search 3.2: fast index rebuild

Posted by    |       |    Tagged as Hibernate Search

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

  • Data existed before you introduced Hibernate Search
  • New features are developed, index mapping changed
  • A database backup is restored
  • Batch changes are applied on the database
  • ...you get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();
massIndexer.startAndWait();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:

fullTextSession.createIndexer().start();

Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

fullTextSession.createIndexer()
   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default
   .start();

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

  1. A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
  2. The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
  3. At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
  4. The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
  5. The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
  6. The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
  7. The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

fullTextSession.createIndexer()
   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE
   .startAndWait();

Caching

When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

  1. Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
  2. Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
  3. Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
  4. Check for network latency
  5. The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
  6. Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
  7. If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

Hibernate Search 3.2: programmatic mapping API

Posted by    |       |    Tagged as Hibernate Search

One of the innovations we have brought to Hibernate Search is an alternative way to define the mapping information: a programmatic API.

The traditional way to map an entity into Hibernate Search is to use annotations. And it's perfectly fine for 95% of the use cases. In some cases though, some people had had a need for a more dynamic approach:

  • they use a metamodel to generate or customize what is indexed in their entities and need to reconfigure things either on redeployment or on the fly based on some contextual information.
  • they ship a product to multiple customers that require some customization.

What people asked for: the XML Way(tm)

For a while, people with this requirement have asked for an XML format equivalent to what annotations could do. Now the problem with XML is that:

  • it's very verbose in it's way to duplicate the structural information of your code
<class name="Address">
  <property name="street1">
    <field>
      <analyzer definition="ngram"/>
    </field>
   </property>
   <!-- ... -->
</class>
  • while XML itself is type-safe, XML editors are still close to stone age, and developers writing XML in notepad are unfortunately quite common
  • even if XML is type-safe, one cannot refactor the Java code and expect to get compile time errors or even better automatic integrated refactoring. For example, if I rename Address to Location, I still need to remember to change this in my xml file
  • and finally, dynamically generating an XML stream to cope with the dynamic reconfiguration use case is not what I would call an intuitive solution

So we took a different road.

What they get: a fluent programmatic API

Instead of writing the mapping in XML, let's write it in Java. And to make things easier let's use a fluent contextual API (have intuitive method names, only expose the relevant operations).

SearchMapping mapping = new SearchMapping();

mapping
    .analyzerDef( "ngram", StandardTokenizerFactory.class )
        .filter( LowerCaseFilterFactory.class )
        .filter( NGramFilterFactory.class )
            .param( "minGramSize", "3" )
            .param( "maxGramSize", "3" )

    .entity(Address.class)
        .indexed()
        .property("addressId", METHOD)
            .documentId()
        .property("street1", METHOD)
            .field()
            .field()
                .name("street1_ngram")
                .analyzer("ngram")
        .property("country", METHOD)
            .indexedEmbedded()
        .property("movedIn", METHOD)
            .dateBridge(Resolution.DAY);

As you can see, it's very easy to figure out what is going on here. But something you cannot see in this example is that your IDE only offers the relevant methods contextually. For example, unless you have just declared a property(), you won't be able to add a field() to it. Likewise, you can set an analyzer on a field, only if you are defining a field. It's like the dynamic languages fluent APIs be better ;)

The next step is to associate the programmatic mapping object to the Hibernate configuration.

//in Hibernate native
Configuration configuration = ...;
configuration.setProperty( "hibernate.search.model_mapping", mapping );
SessionFactory factory = configuration.buildSessionFactory();
//in JPA
Map<String,String> properties = new HashMap<String,String)(1);
properties.put( "hibernate.search.model_mapping", mapping );
EntityManagerFactory emf = Persistence.createEntityManagerFactory( "userPU", properties );

And voila!

Extensibility

The beauty of this API is that it's very easy for XML fan boys to create their own XML schema descriptors and use the programmatic API when parsing the XML stream. More interestingly, an application can expose specific configuration options (via a simple configuration file, a UI or any other form) and use this configuration to customize the mapping programmatically.

Please give this API a try, tell us what works and what does not, we are still figuring out things to make it as awesome as possible :)

Download

Many thanks to Amin Mohammed-Coleman for taking my half done initiative and polishing it up.

You can get Hibernate Search 3.2 Beta 1 here, the complete API documentation is present in the distribution; chapter 4.4.

Hibernate Search 3.2.0 Beta1

Posted by    |       |    Tagged as Hibernate Search

It has been quite some time since the latest Hibernate Search release, but we are happy to announce the first beta release of version 3.2.0 with tons of bug fixes and exciting new features. In fact there are so many new features that we are planning to write a whole series of blog entries covering the following topics:

  • The new API for programmatic configuration of Hibernate Search via org.hibernate.search.cfg.SearchMapping.
  • Ability to rebuild the index in parallel threads using the MassIndexer API. This can be as simple as fullTextSession.createIndexer().startAndWait(), but of course there are plenty of options to fine-tune the behavior.
  • Clustering via JGroups as an alternative to the existing JMS solution. The values for the hibernate.search.worker.backend option are jgroupsSlave and jgroupsMaster in this case.
  • Dynamic boosting via the new @DynamicBoost annotation.

Most of these new features are already documented in the Hibernate Search documentation available in the distribution packages. However, there might be still some gaps in the documentation. If you find any let us know via the Forum or Jira. Smaller new features are:

  • New built-in field bridges for the java.util.Calendar and java.lang.Character
  • Ability to configure Lucene's LockFactory using hibernate.search.<index>.locking_strategy with the values simple, native, single or none.
  • Ability to share IndexWriter instances across transactions. This can be activated via the hibernate.search.<indexname>.exclusive_index_use flag.

Of course we also fixed several bugs of which the following are worth mentioning explicitly:

  • HSEARCH-391 Multi level embedded objects don't get an index update
  • HSEARCH-353 Removing an entity and adding another with same PK (in same TX) will not add second entity to index

For a full changelog see the Jira release notes. Last but not least, Hibernate Search depends now on Hibernate Core 3.5 beta2 and Lucene 2.4.1 and is aligned with JPA 2.0 CR1.

Special thanks to our contributors Sanne Grinovero and Amin Mohammed-Coleman who put a lot of effort into this release.

enjoy!

A few people have asked me to publish my slides on Bean Validation and Hibernate Search. Here we go :)

Speaking of conferences, I will be presenting Hibernate Search and the magic of analyzers at Jazoon (Zurich) on Thursday 25th at 11:30. See you there.

Aaron and I will be talking about Hibernate Search and how it can complement your database when you need to scale big, like in... ahem a cloud. It's on Wednesday, June 3 at 9:45 am in Hall 134. I know it's early, someone in the program committee did not like us so much ;)

I will also do an author signing session of Hibernate Search in Action the same day Wed, June 3 at the JavaOne bookstore.

I will also discuss Bean Validation (JSR-303), what it does and how it integrates in Java EE 6 (which I will demo on stage) and any other architecture. This will be Thursday, June 4 at 13:30 in Hall 134. The latest version of the spec is always available here at least till we make it final. Hibernate Validator 4, the reference implementation is well underway, give it a try.

Hibernate Search 3.1.1 GA

Posted by    |       |    Tagged as Hibernate Search

With work on version 3.2 of Hibernate Search well underway and a range of very interesting features in the pipeline (eg programmatic configuration API, bulk indexing and dynamic boosting), we decided to provide some of the bug fixes also for the 3.1 version of Hibernate Search. Hence here is Hibernate Search 3.1.1 GA. On top of several bug fixes which are listed in the release notes we also upgraded Lucene from 2.4 to 2.4.1.

We recommend users of version 3.1 to upgrade to 3.1.1 to come into the benefits of these bug fixes.

You can download the release here.

enjoy!

I am please to announce the GA release of Hibernate Search 3.1. This release focuses on performance improvement and code robustness but also add interesting new features focused on usability:

  • An analyzer configuration model to declaratively use and compose features like phonetic approximation, n-gram approximation, search by synonyms, stop words filtering, elision correction, unaccented search and many more.
  • A lot of performance improvements at indexing time (including reduced lock contention, parallel execution).
  • A lot of performance improvements at query time (including I/O reduction both for Lucene and SQL, and better concurrency).
  • Additional new features both for indexing and querying (including support for term vector, access to scoped analyzer at query time and access to results explanation).

A more detailed overlook of the features follows

Analyzer

  • Support for declarative analyzer composition through the Solr library

Analyzers can now be declaratively composed as a tokenizer and a set of filters. This enable easy composition of the following features: phonetic approximation, n-gram approximation, search by synonyms, stop words filtering, elision correction, unaccented search and so on.

  • Support for dynamic analyzers

Allows a given entity to defined of the analyzer used at runtime. A typical use case is multi-language support where the language varies from one entity instance to an other.

Indexing

Indexing performance has been enhanced and new controls have been added

New features

  • Better control over massive manual indexing (flushToIndexes())
  • Support for term vector
  • Support for custom similarity
  • Better control over index writing (RAM consumption, non-compound file format flag, ...)
  • Better support for large index replication

Performance improvements

  • Improve contention and lock window during indexing
  • Reduce the number of index opening/closing
  • Indexing is done in parallel per directory

Querying

New useful features have been added to queries and performance has been improved.

New features

  • Expose entity-scoped and named analyzer for easy reuse at query time
  • Filter results (DocIdSet) can now be cached declaratively (default)
  • Query results Explanation is now exposed for better debugging information

Performance improvements

  • Reduces the number of database roundtrips on multi-entity searches
  • Faster Lucene queries on indexes containing a single entity type (generally the case)
  • Better performance on projected properties (no noticeable overhead compared to raw Lucene calls)
  • Reduction of I/O consumption on Lucene by reading only the necessary fields of a document (when possible)
  • Reduction of document reads (during pagination and on getResultSize() calls)
  • Faster index reopening (keeps unchanged segments opened)
  • Better index reader concurrency (use of the read only flag)

Libraries

  • Migration to Lucene 2.4 (and its performance improvements)
  • Upgrade to Hibernate Core 3.3
  • Use SLF4J as log facade

Bug fixes

  • Fix a few race conditions on multi core machines
  • Resources are properly discarded at SessionFactory.close()
  • Fix bug related to embedded entities and @Indexed (HSEARCH-142)
  • Filter instances are properly cached now
  • And more (see the change logs)

Download and all

You can download the release here. Changelogs are available on JIRA. Migrating to this version is recommended for all users (see the Migration guide).

Many thanks to everyone who contributed to the development and test of this release. Especially, many thanks to Sanne and Hardy who worked tirelessly pushing new features and enhancements and fixing bugs while John and I where finishing Hibernate Search in Action.

We have some features on the work that we could not put in 3.1, so stay tuned.

Hibernate Search 3.1 enters release candidate phase

Posted by    |       |    Tagged as Hibernate Search

Hibernate Search 3.1.0.CR1 is out. Download it here.

One of the main work was to align more closely with the new Lucene features:

  • read-only IndexReader for better concurrency at search time
  • use of DocIdSet rather than BitSet in filter implementations for greater flexibility
  • explicit use of Lucene's commit()

We have also added a few performance tricks:

  • avoid reading unnecessary fields from Lucene when possible
  • use Hibernate queries when projecting the object instance (as opposed to rely on batch size)

@DocumentId is now optional and defaults to the property marked as @Id. Scores are no longer normalized (ie no longer <= 1)

The full changelog is available here. Expect a GA in the next two weeks. Help us iron this release and provide issue reports in JIRA.

Many thanks to Hardy for literally hammering out fixes after fixes and making the release happen on time.

back to top