Help

Hibernate Search 4.4.0.Beta1 is ready for downloads! You can get it either from Maven repositories or from Sourceforge.

Index Sharding

Sharding is a common practice among Apache Lucene users, and Hibernate Search supports it since years. It means that we split the index storage in multiple Lucene indexes, while hiding the logical complexity. This is most commonly used to:

  • Keep individual index sizes reasonable: handy for backups and performance
  • Specialize individual indexes for different language / terminology (more on this below)
  • Separate master nodes to scale writing throughput on multiple nodes
  • Legal requirements to store some data in physically independent media

So far however you would need to configure the number of shards you need in the Hibernate Search configuration, basically requiring advance knowledge of which shards your application would use.

Dynamic Sharding

With the new feature added in this 4.4.0.Beta1 release you don't have to know in advance which shards you might need at runtime. So for example if you are sharding your entities according to description languages, just storing an entity in a new language can trigger the creation of the new index infrastructure, on the fly.

All details can be found on the reference documentation.

With the previous Sharding feature, which we now call static sharding and is deprecated, you might have been used to deal with an array of indexes. Shards where identified by their position in the array. In the new model, shards are identified by a name: a simple String which maps to their IndexManager name.

Implementors will need to create a ShardIdentifierProvider, which fullfills the following needs:

Discover existing shards at boot time

Since the shards are not defined in the configuration, you need to provide a list of known shards via some code. A new mechanism was setup to allow for example to query the database using an Hibernate Session during the initialization phase. See also the AnimalShardIdentifierProvider example implementation.

Discover new shards at runtime

The second operation that a ShardIdentifierProvider needs to provide, is to watch for new shard identifiers and eventually notify the framework.

List the known shard identifiers

Finally the ShardIdentifierProvider implementation will need to keep the record of known shard names; that requires a bit of concurrent code, hopefully the example is going to be of inspiration.

Optionally you can also make your implementation really smart by watching for your custom FullTextFilters being applied to queries, to narrow down to which shards a query should be executed on. See more at Using filters in a sharded environment.

As usual the issue tracker is JIRA and all code is on GitHub: pull requests and feedback welcome.

For a detailed list of all changes in this release, see the release notes.

The next goal is to work towards a 4.4.0.Final release. If you can help us getting there fast, then we'll finally branch towards the next mayor release and start the transformations needed to support Apache Lucene version 4.

5 comments:
 
28. Sep 2013, 14:07 CET | Link
Adrian

This is great news! Is there support for the mass indexer? We would like to be able to mass index on a per shard basis. In other words we would like to have a shard per customer and the ability to reindex per customer.

Thanks

ReplyQuote
 
29. Sep 2013, 12:24 CET | Link
Adrian wrote on Sep 28, 2013 08:07:
This is great news! Is there support for the mass indexer? We would like to be able to mass index on a per shard basis. In other words we would like to have a shard per customer and the ability to reindex per customer.

Hi Adrian, no we hadn't thought about that, but it seems like an excellent idea! Please file it on JIRA as a feature request.

 
02. Oct 2013, 12:31 CET | Link
Adrian

It should be there! I remember commenting on it previously but maybe it was an actual jira ticket.

 
02. Oct 2013, 12:36 CET | Link
Adrian

Okay it was a comment related to https://hibernate.atlassian.net/browse/HSEARCH-499. Would this feature cover sharding? Both would be useful to me as hibernate search often goes out of sync with the database (I thought this was supposed to be impossible with transactions?).

 
02. Oct 2013, 13:43 CET | Link
... as hibernate search often goes out of sync with the database (I thought this was supposed to be impossible with transactions?).

Maybe not impossible but it should be very unfortunate, yes. I would be very interested to know more about how you get them out of sync, if you happen to find some clues please start a thread on the forums we can try thinking about it.

Post Comment