Red Hat



In Relation To Sanne Grinovero

In Relation To Sanne Grinovero

New paths to indexing: do we know better?

Posted by    |       |    Tagged as Hibernate Search

Hibernate Search 4.1.0.Beta2 is released, and contains a very interesting improvement: it is now possible to precisely express which paths will be indexed when using @IndexedEmbedded.

Previously, when using @IndexedEmbedded, we would walk the entity graph up to the specified depth to index all the traversed branches. We would index to the same depth all paths, until the specified maximum depth is reached or a smaller value for depth was encountered. In a complex model it could become complex to control what exactly would get indexed.

On the forums Zach Kurey, who was having this problem, asked me just out of curiosity why we didn't provide an explicit paths-to-be-included option. Surely, he wrote, there must be a reason. Truth be told, there was no reason: we just hadn't thought about it.

So, if you have suggestions, don't think we know better. Get in touch! Our role is to protect the quality of the code and catalyse the experience of many clever users: we need to hear from you to keep on improving.

After a long discussion about the API and implementation details, this release makes the new @IndexedEmbedded(includePaths) feature available for everyone to use. Thanks to Zach and Davide D'Alto, as after contributing to the design they also provided the patches and tests, making this brilliant idea available to everyone.

How does it work?

In the following Indexed Entity we declare that when indexing each Person we want to index the name and surname fields, and its parents as well by using the well known @IndexedEmbedded annotation:

@Entity
@Indexed
public class Person {

   @Id
   public int id;

   @Field
   public String name;

   @Field
   public String surname;

   @OneToMany
   @IndexedEmbedded(includePaths = { "name" })
   public Set<Person> parents;

   @ContainedIn
   @ManyToOne
   public Person child;

    ...//other fields omitted
}

The news is the attribute includePaths of the annotation, which points out that we don't want to recursively index all fields for the parent Person, but only its name field.

This was a very simple example; the reference documentation contains more examples and details. In short, it provides better control on which fields will be indexed, avoiding to index unnecessary objects. Of course this improves overall performance.

Hibernate Search 4.1.0.Beta2 awaits you!

Of course this release contains some more bugfixes and improvements, for more details check the release notes.

Hibernate Search version 4.1.0.Beta1 was tagged; the most essential change compared to January's release 4.1.0.Alpha1 was HSEARCH-1034, made to allow Infinispan Query to use the fluent Programmatic Mapping API as already available to Hibernate users.

More changes are being developed: stay tuned for new MassIndexer improvements, some new performance improving tricks, and a fierce discussion is going on to provide a new pragmatic way to define index mappings starting from the Query use cases.

Integrations with Infinispan

The Infinispan project released a new milestone version 5.1.1.FINAL, which is relevant to Hibernate Search users in many ways:

  • Hibernate Search can use Infinispan to distribute the index among several clustered nodes.
  • JBoss AS 7.1 will use this version as the fundamental clustering technology.
  • Hibernate OGM can map JPA entities to Infinispan instead of a database, and use Hibernate Search as query engine and replicate the indexes storing them in Infinispan.
  • Infinispan Query uses the Hibernate Search Engine component to make it possible to search across the values stored in Infinispan. All you need to do is add the dependency to infinispan-query, enable indexing in the configuration and either annotate the objects you store in the grid like you would do with Hibernate Search entitites, or define the mappings using the programmatic API.

More details on Infinispan Query can be found in the Infinispan reference, but if you're familiar with Hibernate Search there's not much to learn as they share most features and configuration options as defined on the Hibernate Search reference manual.

Hibernate Search 4.1 is coming

Posted by    |       |    Tagged as Hibernate Search

We tagged Hibernate Search 4.1.0.Alpha1, and artifacts are now ready to be downloaded. 4.1 is meant to mainly upgrade the core dependencies and will have a quick development cycle.

Upgraded dependencies

  • Apache Lucene 3.5
  • Infinispan 5.1
  • JGroups 3.0

To use the above versions, upgrading is required as each of the mentioned projects changed some of its API used by Hibernate Search. Of course Hibernate Search shields you from these changes being fully backwards compatible.

MassIndexer performance

The MassIndexer is quick again! To be honest this is not an improvement but is a bugfix of a performance regression. If you noticed a performance drop in mass indexing using 4.0.0.Final, please try again with this new release and you will see a significant improvement. While working towards 4.1 final we're going to improve it's features and possibly performance even more, finally taking advantage of the new internal design provided by 4.0.

Great contributions

Guillaume Smet identified and fixed a regression for which dirty collections would not be re-indexed when having a custom FieldBridge instead of the standard @IndexedEmbedded.

Davide D'Alto improved the algorithm identifying the elements which need to be loaded and re-indexed: it's now able to avoid some unnecessary database loads in specific use cases having complex relations, consequently also reducing the index size.

The usual links

As always distribution bundles are available from Sourceforge, or you can get it via Maven artifacts. User questions are welcome on the forums, bugs and improvements can be discussed on the mailing list or posted to JIRA directly, possibly with unit tests.

Complete details of all changes are tracked on JIRA.

After Devoxx, JBug Newcastle

Posted by    |       |    Tagged as Events

Infinispan team at Devoxx

Two weeks ago we where at Devoxx, with Pete Muir and Mircea Markus we had a three hours long workshop about using Infinispan in a real world JEE application. All our notes for the presentation are available here, and it includes the source code used for the demo and all slides.
The instructions contain both a zip of the source code or pointers to a Git repository; if you're familiar with Git the history contains each step from the guide so you can try follow the workshop chapter by chapter: we hope it's clear enough for anyone not familiar with Infinispan, if not questions and suggestions for improvements are welcome.

Hibernate OGM and Search updates

At the same conference as a member of the Hibernate OGM team we met Greg Luck from EHCache fame and we started some concrete plans to support EHCache as a data store for Hibernate OGM. If anyone wants to write a custom module for OGM, please note that we have now an experimental integration layer and Infinispan is no longer a dependency: we have instead an example implementation using a HashMap, so it should be easy to integrate with any other NoSQL database. Some interest was shown around Neo4J, MongoDB and HBase integration, but we need a volunteer to start working on it... feel free to join!

In a different area, same conference we met Karel Maesen of Hibernate Spatial, so stay tuned for a better integration in that area; if you're interested in geolocation you might want to have a look at the draft for integration in Hibernate Search being proposed by Emmanuel and Nicholas Helleringer at HSEARCH-923.

Next week: Arquillian at JBug Newcastle

Next week I'll be in the Newcastle office introducing Arquillian and Shrinkwrap together with Paul Robinson, the lead of the Web Service Transactions project. The talk is named Testing JEE Applications in the container using Arquillian: after an introduction on the coolest testing tools we plan to run a workshop and have everyone try it out.

The workshop is scheduled for Tuesday 13th December in the University of Newcastle, and as always discussions and questions are welcome on any JBoss technology. Full details of the event can be found here.

The OpenBlend conference in Ljubljana, Slovenia will be held the 15th September in the fabulous setting of the Ljubljana Castle.

Since it was incredibly complex to plan my travel to get there, I'll make it worth the effort by having two talks:

  1. Introducing Hibernate OGM: porting JPA applications to NoSQL
  2. Introduction to Byteman and The Jokre

Both projects are very young, in fact I think this is going to be the first time we reveal (1) the Jokre - a very innovative optimisation engine - and Hibernate OGM is definitely a hot topic.

I also look forward to see the other talks of the day, meet team mates such as Bela Ban from JGroups and Infinispan, Adam Warski the creator of Hibernate Envers (but presenting Torquebox & CDI), Aleš Justin the Weld lead and master of the conference, and everyone else meeting there: above all, it's always nice to hear what people do or would like to do with the tools we build, and meeting more people willing to join the open source effort.

1- please don't cheat by downloading the source code yet: it's pointless, you won't understand it. If you do, please add some comments to the code.

A much requested Hibernate Search 3.4.1 released

Posted by    |       |    Tagged as Hibernate Search

While our focus has been on the exciting new improvements in Hibernate Search 4, since the release of the last stable release 3.4.0.Final we had much interesting feedback from the community, including bugreports and patches.

Since some contributors have asked for a bugfix release, here comes Hibernate Search 3.4.1.Final!

What's new

  • Some tricky indexing issues with @IndexedEmbedded entities in a @ManyToOne relation fixed
  • Faceting was a new feature, several bugs where fixed
  • 3.4 introduced dirty checking of collections: both a bug was solved and performance was improved even more

All details are tracked on JIRA.

A sad warning

As we now mention on the documentation too, Java 7 is not a recommended VM to use yet.

The usual links

As always distribution bundles are available from Sourceforge, or you can get it via Maven artifact. Questions can be posted on the forums, bugs can be discussed with us or posted to JIRA directly, possibly with unit tests.

Many thanks to Mathieu Perez, Nikita D, Kyrill Alyoshin, Elmer van Chastelet, Guillaume Smet and Samppa Saarela for their code analysis, tests and fixes.

Hibernate Search 4 is coming

Posted by    |       |    Tagged as Hibernate Search

The release cycle of Hibernate Search 4 has begun. Alpha 1 is out. We already have many things implemented so this change is consistent and more releases will come quickly.

Hibernate Search 4 goals are two folds:

  • Be compatible with the new Hibernate Core 4 releases.
  • Make the necessary architecture change to reach the future goals of Hibernate Search.

In particular, making Hibernate Search independent from Hibernate Core and allowing more scalable cloud-tainted backends.

This release already includes lots of changes

Split between API / SPI and implementation classes

Each class is now categorized into either an API, a SPI or an implementation class.

  • APIs (in regular packages) are safe to be used in your application and we try very hard to not break these contracts
  • SPI (in .spi packages) are classes that are used by frameworks integrating with Hibernate Search (like Infinispan's search module). These contract are pretty stable but might change more often than APIs
  • Implementations (in .impl packages) are implementation details. Don't let your application depend on these.

If you were a good citizen and already used the API only, you should not be affected. If you were using SPI or internal classes, you will have to adjust. Check our wiki page for the migration guide.

Move to JBoss Logging and error codes

JBoss Logging has some nice features including error internationalization and error codes in messages. You will be able to Google HSEARCH00043 and see why you have such problem.

Integration with Hibernate Core 4

Nuf said.

Move to the per index backend architecture

This will give you more flexibility on how you want your entities indexed and us the possibility for additional optimizations down the road. You can use different technologies for each index, for example use a Lucene backend for some indexes and an Infinispan index for others which need real-time clustering. Also it's now possible to configure the performance parameters of each index separately, from the async/sync option to the number of Worker threads and queue sized in the backend executor.

MassIndexer is no longer an exclusive mode

The MassIndexer no longer locks out the main backend listening for Hibernate events, so it can be started while other transactions run. Until it finished however some results might be missing from the index.

New binary format of communication between remote backends

Lucene no longer guarantees the Serializable contract for Documents and Fields. This is a problem when you use a clustered model for Hibernate Search.

So we have introduced a new communication protocol in the JMS and JGroups backends so they can pass along Lucene works in a safe way. This aligns with our quest to shield you from incompatible changes Lucene may make in future versions. We also want to make it easier to upgrade a cluster of Hibernate Search nodes and let them interact without issue when possible.

So now you can use NumericField in clustered environment, which was previously not possible as it has never been Serializable.

Aggressive on performance

The backend is now quite aggressive in write performance, enabling exclusive_index_use by default and having merged some of the performance tricks from the MassIndexer back into the standard backend to take advantage of them all the time. For example the new backend design allows us to analyse Documents in multiple threads while still guaranteeing writes happen in the proper order. This is configured with the worker.thread_pool.size property, defaulting to one, and applies even to backends configured for synchronous updates.

Get the release

That's all for now, check out the release and make sure to read the Migration Guide.

Many thanks to the community and particularly Davide D'Alto for his contributions and Adam Harris, Samppa Saarela and Elmer van Chastelet for their suggestions for performance and design improvements.

As every month, we're having a JBoss user group meeting in Newcastle. Next Tuesday 12th July at 6pm I'll be presenting

Hibernate Search and OGM: taking advantage of NoSQL leaving out the complexity

and as always I'm looking forward to talk to all attendees: we'll follow up with free drinks and food to discuss about anything: open questions, suggestions, opportunities to tell your use cases in person, meet other developers.

Discussing code changes is also an option, so don't miss this opportunity to make sure your favourite tools are better fit for your needs!

full abstract and venue details here

Reminder: the location is in a highly secured building, if you arrive late make sure to get in touch so that someone can open you the door.

For more cool developer oriented events in Newcastle, see the JBug homepage

Visualizing data structures is not easy, and I'm confident that a great deal of success of the exceptionally well received demo we presented at the JBoss World 2011 keynote originated from the nice web UIs projected on the multiple big screens. These web applications were effectively visualizing the tweets flowing, the voted hashtags highlighted in the tagcloud, and the animated Infinispan grid while the nodes were dancing on an ideal hashweel visualizing the data distribution among the nodes.

So I bet that everybody in the room got a clear picture of the fact that the data was stored in Infinispan, and by live unplugging a random server everybody could see the data reorganize itself, making it seem a simple and natural way to process huge amounts of data. Not all technical details were explained, so in this and the following post we're going to detail what you did not see: how was the data stored, how could Drools filter the data, how could all visualizations load the grid stored data, and still be developed in record time?

JPA on a Grid

All those applications were simply using JPA: Java Persistence API. Think about the name: it's clearly meant to address all persistence needs of Java applications; in fact while it's traditionally associated with JDBC databases, we have just shown it's not necessarily limited to those databases: our applications were running an early preview of the Hibernate Object/Grid Mapper: Hibernate OGM, and the simple JPA model was mapped to Infinispan, a key/value store.

Collecting requirements

The initial plan didn't include Hibernate OGM, as it was very experimental yet, it was never released nor even tagged before, but it was clear that we wanted to use Infinispan: to store and to search the tweets. Kevin Conner was the architect envisioning the technical needs, who managed to push each involved developer to do its part, and finally assembled it all into a working application in record time; so he came to Emmanuel and me with a simple list of requirements:

  • we want to showcase Infinispan
  • we want to store Tweets, many of them, in real time coming in from a live stream from Twitter
  • we need to be able to iterate them all in time order, to rollback the stream and re-process it again (as you can see in the demo recording, we had a fake cheater and want to apply stricter rules to filter out invalid tweets at a second stage, without loosing the originally collected tweets).
  • we need to know which projects are voted the most: people are going to express preferences via hashtags in their tweets
  • we want to know who's voting the most
  • it has to perform well, potentially on a big set of data

Using Lucene

So, to implement these search requirements, you have to consider that being Infinispan a key/value store, performing queries is not as natural as you would do on a database. Infinispan currently provides two main approaches: to use the Query module or to define some simple Map/Reduce tasks.

Also, consider those requirements. Using SQL, how were we going to count all tweets containing a specific hashtag, extract this count for all collected hashtags, and sort them by frequency? On a relational database, that would have been a very inefficient query which involves at least a full table scan, possibly a scan per hashtag, and it would have required a prior list of hashtags to look for. We wanted to extract the most frequently mentioned tags, we didn't really know what to look for as people were free to vote for anything.

A totally different approach is to use an inverted index: every time you save a tweet, you tokenize it, extract all terms and so keep a list of terms with pointers to the containing tweets, and store the frequency as well. That's exactly how full-text search engines like Lucene work; in addition to that Lucene is able to apply state-of-the-art optimizations, caches and filtering capabilities. Both our Infinispan Query and Hibernate Search provide nice and easy integrations with Lucene (they actually share the same engine, one geared towards Infinispan users and one to Hibernate and JPA users).

To count for who voted the most is a problem which is technically comparable to counting for term frequencies, so again Lucene would be perfect. Sorting all data on a timestamp is not a good reason to introduce Lucene, but still it's able to do that pretty well too, so Lucene would indeed solve all query needs for this application.

Hibernate OGM with Hibernate Search

So Infinispan Query could have been a good choice. But we opted for Hibernate OGM with Search as they would provide the same indexing features, but on top of that using a nice JPA interface. Also I have to admit that Hibernate OGM was initially discarded as it was lacking an HQL query parser: my fault, being late with implementing it, but in this case it was not a problem as all queries we needed were better solved using the full text queries, which are not defined via HQL.

Model

So how does our model look like? Very simple, it's a single JPA entity, enhanced with some Hibernate Search annotations.

@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
	tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
	filters = {
		@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
		@TokenFilterDef(factory = LowerCaseFilterFactory.class),
		@TokenFilterDef(factory = StopFilterFactory.class, params = {
				@Parameter(name = "words", value = "stoplist.properties"),
				@Parameter(name = "resource_charset", value = "UTF-8"),
				@Parameter(name = "ignoreCase", value = "true")
		})
})
@Entity
public class Tweet {
	
	private String id;
	private String message = "";
	private String sender = "";
	private long timestamp = 0L;
	
	public Tweet() {}
	
	public Tweet(String message, String sender, long timestamp) {
		this.message = message;
		this.sender = sender;
		this.timestamp = timestamp;
	}

	@Id
	@GeneratedValue(generator = "uuid")
	@GenericGenerator(name = "uuid", strategy = "uuid2")
	public String getId() { return id; }
	public void setId(String id) { this.id = id; }

	@Field
	public String getMessage() { return message; }
	public void setMessage(String message) { this.message = message; }

	@Field(index=Index.UN_TOKENIZED)
	public String getSender() { return sender; }
	public void setSender(String sender) { this.sender = sender; }

	@Field
	@NumericField
	public long getTimestamp() { return timestamp; }
	public void setTimestamp(long timestamp) { this.timestamp = timestamp; }

}

Note the uuid generator for the identifier: that's currently the most efficient one to use in a distributed environment. On top of the standard @Entity, @Indexed enables the Lucene indexing engine, the @AnalyzerDef and Analyzer specifies the text cleanup we want to apply to the indexed tweets, @Field selects the property to be indexed, @NumericField makes sure the numeric sort will be performed efficiently treating the indexed value really as a number and not as an additional keyword: always remember that Lucene is focused on natural language matching.

Example queries

This is going to look like a bit verbose as I'm expanding all functions for clarity:

public List<Tweet> allTweetsSortedByTime() {

	//this is needed only once but we want to show it in this context:
	QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity( Tweet.class ).get();

	//Define a Lucene query which is going to return all tweets:
	Query query = queryBuilder.all().createQuery();

	//Make a JPA Query out of it:
	FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery( query );

	//Currently needed to have Hibernate Search work with OGM:
	fullTextQuery.initializeObjectsWith( ObjectLookupMethod.PERSISTENCE_CONTEXT, DatabaseRetrievalMethod.FIND_BY_ID );

	//Specify the desired sort:
	fullTextQuery.setSort( new Sort( new SortField( "timestamp", SortField.LONG ) ) );

	//Run the query (or alternatively open a scrollable result):
	return fullTextQuery.getResultList();
}

Download it

To see the full example, I pushed a complete Maven project to github. It includes a test of all queries and all project details needed to get it running, such as Infinispan and JGroups configurations, the persistence.xml needed to enable HibernateOGM.

Please clone it to start playing with OGM: https://github.com/Sanne/tweets-ogm

And see you on IRC, the Hibernate Search forums, or the brand new Hibernate OGM forums for any clarification.

How are entities persisted in the grid?

Emmanuel is going to blog about that soon, keep an eye on the blog! Update: blog on OGM published

Hibernate Search 3.4.0.CR2

Posted by    |       |    Tagged as Hibernate Search Infinispan

We decided to insert another candidate release in the roadmap for two improvements which where too good to leave out

  • Lucene 3.1
  • Smart object graph analysis to skip expensive operations

As usual download links are here, as are instructions for Maven users. In case you spot some issue, the issue tracker didn't move either, or use the forums for questions.

using Apache Lucene 3.1

Finally released, we've been waiting long for it so that in just a week we where able to provide you with a compatible version of Hibernate Search.

As it seems the usual business with Lucene, many APIs changed. The good news is that it seems Hibernate Search was able to shield users from all breaking changes: code-wise, it's a drop-in replacement to previous versions.

Some things to consider during the migration:

  • It's possible that some Analyzers from Lucene and Solr extensions where moved around to other jars, but if you're depending to hibernate-search-analyzers via Maven, again it looks like you shouldn't need to change anything.
  • The max_field_length option is not meaningful anymore, see the docs on how to implement something similar if needed.
  • Hibernate Search 3.4.0.CR2 actually requires Lucene 3.1

more performance optimizations

Besides the nice boost inherited from the updated Lucene, our internal engine also got smarter.

It figures possible work to skip in the objects graph, being now much better when reacting to collections update events. See HSEARCH-679 for the hairy details, and many thanks to Tom Waterhouse for a complex functional test and the hard work of convincing me of the importance of this improvement.

Infinispan integration

There are several interactions between Hibernate Search and Infinispan, above the most obvious usage of Infinispan as a second level cache you can also:

Cluster indexes via Lucene

Nothing changed in our code, just a reminder that it's possible to replicate or distribute the Lucene indexes on an Infinispan grid, and it is compatible with both Infinispan 4.2.1.FINAL and with 5.0.0.BETA1

Infinispan Query

In Infinispan 5 the query engine is Hibernate Search itself: the integration just got much better, making it easier to use and exposing all features from latest Search versions, including for example Faceting and clustering via Infinispan itself. More improvements coming, especially documentation.

join us for JUDCon!

I'm going to talk about this integration at JUDCon 2011, in Boston, May 2-3 during the talk Advanced Queries on the Infinispan Data Grid, see you there!

back to top