Red Hat



In Relation To Sanne Grinovero

In Relation To Sanne Grinovero

Visualizing data structures is not easy, and I'm confident that a great deal of success of the exceptionally well received demo we presented at the JBoss World 2011 keynote originated from the nice web UIs projected on the multiple big screens. These web applications were effectively visualizing the tweets flowing, the voted hashtags highlighted in the tagcloud, and the animated Infinispan grid while the nodes were dancing on an ideal hashweel visualizing the data distribution among the nodes.

So I bet that everybody in the room got a clear picture of the fact that the data was stored in Infinispan, and by live unplugging a random server everybody could see the data reorganize itself, making it seem a simple and natural way to process huge amounts of data. Not all technical details were explained, so in this and the following post we're going to detail what you did not see: how was the data stored, how could Drools filter the data, how could all visualizations load the grid stored data, and still be developed in record time?

JPA on a Grid

All those applications were simply using JPA: Java Persistence API. Think about the name: it's clearly meant to address all persistence needs of Java applications; in fact while it's traditionally associated with JDBC databases, we have just shown it's not necessarily limited to those databases: our applications were running an early preview of the Hibernate Object/Grid Mapper: Hibernate OGM, and the simple JPA model was mapped to Infinispan, a key/value store.

Collecting requirements

The initial plan didn't include Hibernate OGM, as it was very experimental yet, it was never released nor even tagged before, but it was clear that we wanted to use Infinispan: to store and to search the tweets. Kevin Conner was the architect envisioning the technical needs, who managed to push each involved developer to do its part, and finally assembled it all into a working application in record time; so he came to Emmanuel and me with a simple list of requirements:

  • we want to showcase Infinispan
  • we want to store Tweets, many of them, in real time coming in from a live stream from Twitter
  • we need to be able to iterate them all in time order, to rollback the stream and re-process it again (as you can see in the demo recording, we had a fake cheater and want to apply stricter rules to filter out invalid tweets at a second stage, without loosing the originally collected tweets).
  • we need to know which projects are voted the most: people are going to express preferences via hashtags in their tweets
  • we want to know who's voting the most
  • it has to perform well, potentially on a big set of data

Using Lucene

So, to implement these search requirements, you have to consider that being Infinispan a key/value store, performing queries is not as natural as you would do on a database. Infinispan currently provides two main approaches: to use the Query module or to define some simple Map/Reduce tasks.

Also, consider those requirements. Using SQL, how were we going to count all tweets containing a specific hashtag, extract this count for all collected hashtags, and sort them by frequency? On a relational database, that would have been a very inefficient query which involves at least a full table scan, possibly a scan per hashtag, and it would have required a prior list of hashtags to look for. We wanted to extract the most frequently mentioned tags, we didn't really know what to look for as people were free to vote for anything.

A totally different approach is to use an inverted index: every time you save a tweet, you tokenize it, extract all terms and so keep a list of terms with pointers to the containing tweets, and store the frequency as well. That's exactly how full-text search engines like Lucene work; in addition to that Lucene is able to apply state-of-the-art optimizations, caches and filtering capabilities. Both our Infinispan Query and Hibernate Search provide nice and easy integrations with Lucene (they actually share the same engine, one geared towards Infinispan users and one to Hibernate and JPA users).

To count for who voted the most is a problem which is technically comparable to counting for term frequencies, so again Lucene would be perfect. Sorting all data on a timestamp is not a good reason to introduce Lucene, but still it's able to do that pretty well too, so Lucene would indeed solve all query needs for this application.

Hibernate OGM with Hibernate Search

So Infinispan Query could have been a good choice. But we opted for Hibernate OGM with Search as they would provide the same indexing features, but on top of that using a nice JPA interface. Also I have to admit that Hibernate OGM was initially discarded as it was lacking an HQL query parser: my fault, being late with implementing it, but in this case it was not a problem as all queries we needed were better solved using the full text queries, which are not defined via HQL.

Model

So how does our model look like? Very simple, it's a single JPA entity, enhanced with some Hibernate Search annotations.

@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
	tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
	filters = {
		@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
		@TokenFilterDef(factory = LowerCaseFilterFactory.class),
		@TokenFilterDef(factory = StopFilterFactory.class, params = {
				@Parameter(name = "words", value = "stoplist.properties"),
				@Parameter(name = "resource_charset", value = "UTF-8"),
				@Parameter(name = "ignoreCase", value = "true")
		})
})
@Entity
public class Tweet {
	
	private String id;
	private String message = "";
	private String sender = "";
	private long timestamp = 0L;
	
	public Tweet() {}
	
	public Tweet(String message, String sender, long timestamp) {
		this.message = message;
		this.sender = sender;
		this.timestamp = timestamp;
	}

	@Id
	@GeneratedValue(generator = "uuid")
	@GenericGenerator(name = "uuid", strategy = "uuid2")
	public String getId() { return id; }
	public void setId(String id) { this.id = id; }

	@Field
	public String getMessage() { return message; }
	public void setMessage(String message) { this.message = message; }

	@Field(index=Index.UN_TOKENIZED)
	public String getSender() { return sender; }
	public void setSender(String sender) { this.sender = sender; }

	@Field
	@NumericField
	public long getTimestamp() { return timestamp; }
	public void setTimestamp(long timestamp) { this.timestamp = timestamp; }

}

Note the uuid generator for the identifier: that's currently the most efficient one to use in a distributed environment. On top of the standard @Entity, @Indexed enables the Lucene indexing engine, the @AnalyzerDef and Analyzer specifies the text cleanup we want to apply to the indexed tweets, @Field selects the property to be indexed, @NumericField makes sure the numeric sort will be performed efficiently treating the indexed value really as a number and not as an additional keyword: always remember that Lucene is focused on natural language matching.

Example queries

This is going to look like a bit verbose as I'm expanding all functions for clarity:

public List<Tweet> allTweetsSortedByTime() {

	//this is needed only once but we want to show it in this context:
	QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity( Tweet.class ).get();

	//Define a Lucene query which is going to return all tweets:
	Query query = queryBuilder.all().createQuery();

	//Make a JPA Query out of it:
	FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery( query );

	//Currently needed to have Hibernate Search work with OGM:
	fullTextQuery.initializeObjectsWith( ObjectLookupMethod.PERSISTENCE_CONTEXT, DatabaseRetrievalMethod.FIND_BY_ID );

	//Specify the desired sort:
	fullTextQuery.setSort( new Sort( new SortField( "timestamp", SortField.LONG ) ) );

	//Run the query (or alternatively open a scrollable result):
	return fullTextQuery.getResultList();
}

Download it

To see the full example, I pushed a complete Maven project to github. It includes a test of all queries and all project details needed to get it running, such as Infinispan and JGroups configurations, the persistence.xml needed to enable HibernateOGM.

Please clone it to start playing with OGM: https://github.com/Sanne/tweets-ogm

And see you on IRC, the Hibernate Search forums, or the brand new Hibernate OGM forums for any clarification.

How are entities persisted in the grid?

Emmanuel is going to blog about that soon, keep an eye on the blog! Update: blog on OGM published

Hibernate Search 3.4.0.CR2

Posted by    |       |    Tagged as Hibernate Search Infinispan

We decided to insert another candidate release in the roadmap for two improvements which where too good to leave out

  • Lucene 3.1
  • Smart object graph analysis to skip expensive operations

As usual download links are here, as are instructions for Maven users. In case you spot some issue, the issue tracker didn't move either, or use the forums for questions.

using Apache Lucene 3.1

Finally released, we've been waiting long for it so that in just a week we where able to provide you with a compatible version of Hibernate Search.

As it seems the usual business with Lucene, many APIs changed. The good news is that it seems Hibernate Search was able to shield users from all breaking changes: code-wise, it's a drop-in replacement to previous versions.

Some things to consider during the migration:

  • It's possible that some Analyzers from Lucene and Solr extensions where moved around to other jars, but if you're depending to hibernate-search-analyzers via Maven, again it looks like you shouldn't need to change anything.
  • The max_field_length option is not meaningful anymore, see the docs on how to implement something similar if needed.
  • Hibernate Search 3.4.0.CR2 actually requires Lucene 3.1

more performance optimizations

Besides the nice boost inherited from the updated Lucene, our internal engine also got smarter.

It figures possible work to skip in the objects graph, being now much better when reacting to collections update events. See HSEARCH-679 for the hairy details, and many thanks to Tom Waterhouse for a complex functional test and the hard work of convincing me of the importance of this improvement.

Infinispan integration

There are several interactions between Hibernate Search and Infinispan, above the most obvious usage of Infinispan as a second level cache you can also:

Cluster indexes via Lucene

Nothing changed in our code, just a reminder that it's possible to replicate or distribute the Lucene indexes on an Infinispan grid, and it is compatible with both Infinispan 4.2.1.FINAL and with 5.0.0.BETA1

Infinispan Query

In Infinispan 5 the query engine is Hibernate Search itself: the integration just got much better, making it easier to use and exposing all features from latest Search versions, including for example Faceting and clustering via Infinispan itself. More improvements coming, especially documentation.

join us for JUDCon!

I'm going to talk about this integration at JUDCon 2011, in Boston, May 2-3 during the talk Advanced Queries on the Infinispan Data Grid, see you there!

Hibernate Search reached 3.4.0.CR1

Posted by    |       |    Tagged as Hibernate Search

One week after 3.4.0.Beta1, two weeks after 3.4.0.Alpha1, we're on a run for 3.4.0.Final!

So this is your last chance to report issues on JIRA before the final release, which is of course planned for next week.

Download it from Sourceforge or via Maven from the JBoss repository, discuss about it on the forums.

Changes

As with all good candidate releases should have, there where no big changes since previous beta: we just made it a bit easier to embed the Hibernate Search engine into other frameworks, as Infinispan does with the Query API.

Infinispan Query

Since first releases Infinispan is providing a Query module, which is reusing the Hibernate Search engine. Infinispan v.4 provided a technology preview, for version 5 we're polishing the API and making sure it's simpler to setup and use. Expect some API updates in the next Alpha of Infinispan 5, and be aware that you can reuse all Hibernate Search annotations and extensions!

We did a big refactoring of the query engine, hence the alpha tag. If you could focus your tests in this area and see if there are any issues, we would be forever grateful.

This release comes with the usual mix of bug fixes, optimizations and new features.

Query engine refactoring

This was a biggie and the objective is that it does not affect you :) We have a medium term goal to make Hibernate Search not depend on Hibernate Core. Extracting the query engine core into well defined SPIs (Service Provider Interfaces) is a milestone for this. The first beneficiary is the Infinispan query module which will become much more resilient to Hibernate Search version changes.

As said earlier, test your applications queries. They should still run without any API change. Let us know otherwise.

Update index only when an indexed property changes

With previous version, when an @Indexed entity changes in Hibernate Core, it is reindexed by Hibernate Search even if none of the indexed properties have effectively changed (after all not all properties are indexed). Hibernate Search 3.4 is smarter in this area and tries not to reindex entities whose indexed properties are unchanged. In some situations like dynamic boosting or class-level bridges, Hibernate Search cannot be certain and always reindex to be safer.

This optimization should speed things up quite significantly for some applications. Check the documentation for more information.

Look for entities in the second level cache

In some environments, most if not all of the searched entities are in the second level cache. This is especially true when you use a distributed second level cache like Infinispan. You can ask Hibernate Search to look in the second level cache before trying to fetch data from the database.

FullTextQuery query = session.createFullTextQuery(luceneQuery, User.class);
query.initializeObjectWith(
    ObjectLookupMethod.SECOND_LEVEL_CACHE,
    DatabaseRetrievalMethod.QUERY
);
List results = query.getResultList();

Of course your entities must be cacheable :) If your entities are most likely in the persistence context (the Hibernate session), you can use ObjectLookupMethod.PERSISTENCE_CONTEXT instead.

MassIndexer improvements

The mass indexer module has had a few improvementse:

  • multithreading of the text analysis phase
  • improve monitoring the mass indexer state and progress by letting you plug a custom implementation (see MassIndexerProgressMonitor)

Faceting

Very requested and useful feature, you can now play with the first alpha preview of the new Faceting engine API.

Field Caching

It's now possible to use Lucene's FieldCache to provide an extra boost to query performance: see the reference documentation.

Other performance improvements

We have found a couple of tricks to improve overall performances. We more aggressively cache some metadata to lower the reflection overhead and we push some additional buttons in Lucene for you to reduce query time.

Give us feedback

Please give us feedback, we have an aggressive release schedule ahead as we are planning the GA version for end of this month.

Check out the new release on JBoss.org's Maven repository or download the distribution. You can also read the documentation.

If you find an issue, shoot.

Hibernate Search 3.2: fast index rebuild

Posted by    |       |    Tagged as Hibernate Search

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

  • Data existed before you introduced Hibernate Search
  • New features are developed, index mapping changed
  • A database backup is restored
  • Batch changes are applied on the database
  • ...you get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();
massIndexer.startAndWait();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:

fullTextSession.createIndexer().start();

Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

fullTextSession.createIndexer()
   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default
   .start();

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

  1. A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
  2. The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
  3. At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
  4. The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
  5. The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
  6. The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
  7. The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

fullTextSession.createIndexer()
   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE
   .startAndWait();

Caching

When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

  1. Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
  2. Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
  3. Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
  4. Check for network latency
  5. The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
  6. Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
  7. If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

back to top