Visualizing data structures is not easy, and I'm confident that a great deal of success of the exceptionally well received demo we presented at the JBoss World 2011 keynote originated from the nice web UIs projected on the multiple big screens. These web applications were effectively visualizing the tweets flowing, the voted hashtags highlighted in the tagcloud, and the animated Infinispan grid while the nodes were dancing on an ideal hashweel visualizing the data distribution among the nodes.

So I bet that everybody in the room got a clear picture of the fact that the data was stored in Infinispan, and by live unplugging a random server everybody could see the data reorganize itself, making it seem a simple and natural way to process huge amounts of data. Not all technical details were explained, so in this and the following post we're going to detail what you did not see: how was the data stored, how could Drools filter the data, how could all visualizations load the grid stored data, and still be developed in record time?

JPA on a Grid

All those applications were simply using JPA: Java Persistence API. Think about the name: it's clearly meant to address all persistence needs of Java applications; in fact while it's traditionally associated with JDBC databases, we have just shown it's not necessarily limited to those databases: our applications were running an early preview of the Hibernate Object/Grid Mapper: Hibernate OGM, and the simple JPA model was mapped to Infinispan, a key/value store.

Collecting requirements

The initial plan didn't include Hibernate OGM, as it was very experimental yet, it was never released nor even tagged before, but it was clear that we wanted to use Infinispan: to store and to search the tweets. Kevin Conner was the architect envisioning the technical needs, who managed to push each involved developer to do its part, and finally assembled it all into a working application in record time; so he came to Emmanuel and me with a simple list of requirements:

we want to showcase Infinispan
we want to store Tweets, many of them, in real time coming in from a live stream from Twitter
we need to be able to iterate them all in time order, to rollback the stream and re-process it again (as you can see in the demo recording, we had a fake cheater and want to apply stricter rules to filter out invalid tweets at a second stage, without loosing the originally collected tweets).
we need to know which projects are voted the most: people are going to express preferences via hashtags in their tweets
we want to know who's voting the most
it has to perform well, potentially on a big set of data

Using Lucene

So, to implement these search requirements, you have to consider that being Infinispan a key/value store, performing queries is not as natural as you would do on a database. Infinispan currently provides two main approaches: to use the Query module or to define some simple Map/Reduce tasks.

Also, consider those requirements. Using SQL, how were we going to count all tweets containing a specific hashtag, extract this count for all collected hashtags, and sort them by frequency? On a relational database, that would have been a very inefficient query which involves at least a full table scan, possibly a scan per hashtag, and it would have required a prior list of hashtags to look for. We wanted to extract the most frequently mentioned tags, we didn't really know what to look for as people were free to vote for anything.

A totally different approach is to use an inverted index: every time you save a tweet, you tokenize it, extract all terms and so keep a list of terms with pointers to the containing tweets, and store the frequency as well. That's exactly how full-text search engines like Lucene work; in addition to that Lucene is able to apply state-of-the-art optimizations, caches and filtering capabilities. Both our Infinispan Query and Hibernate Search provide nice and easy integrations with Lucene (they actually share the same engine, one geared towards Infinispan users and one to Hibernate and JPA users).

To count for who voted the most is a problem which is technically comparable to counting for term frequencies, so again Lucene would be perfect. Sorting all data on a timestamp is not a good reason to introduce Lucene, but still it's able to do that pretty well too, so Lucene would indeed solve all query needs for this application.

Hibernate OGM with Hibernate Search

So Infinispan Query could have been a good choice. But we opted for Hibernate OGM with Search as they would provide the same indexing features, but on top of that using a nice JPA interface. Also I have to admit that Hibernate OGM was initially discarded as it was lacking an HQL query parser: my fault, being late with implementing it, but in this case it was not a problem as all queries we needed were better solved using the full text queries, which are not defined via HQL.

Model

So how does our model look like? Very simple, it's a single JPA entity, enhanced with some Hibernate Search annotations.

@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
	tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
	filters = {
		@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
		@TokenFilterDef(factory = LowerCaseFilterFactory.class),
		@TokenFilterDef(factory = StopFilterFactory.class, params = {
				@Parameter(name = "words", value = "stoplist.properties"),
				@Parameter(name = "resource_charset", value = "UTF-8"),
				@Parameter(name = "ignoreCase", value = "true")
		})
})
@Entity
public class Tweet {
	
	private String id;
	private String message = "";
	private String sender = "";
	private long timestamp = 0L;
	
	public Tweet() {}
	
	public Tweet(String message, String sender, long timestamp) {
		this.message = message;
		this.sender = sender;
		this.timestamp = timestamp;
	}

	@Id
	@GeneratedValue(generator = "uuid")
	@GenericGenerator(name = "uuid", strategy = "uuid2")
	public String getId() { return id; }
	public void setId(String id) { this.id = id; }

	@Field
	public String getMessage() { return message; }
	public void setMessage(String message) { this.message = message; }

	@Field(index=Index.UN_TOKENIZED)
	public String getSender() { return sender; }
	public void setSender(String sender) { this.sender = sender; }

	@Field
	@NumericField
	public long getTimestamp() { return timestamp; }
	public void setTimestamp(long timestamp) { this.timestamp = timestamp; }

}

Note the uuid generator for the identifier: that's currently the most efficient one to use in a distributed environment. On top of the standard @Entity, @Indexed enables the Lucene indexing engine, the @AnalyzerDef and Analyzer specifies the text cleanup we want to apply to the indexed tweets, @Field selects the property to be indexed, @NumericField makes sure the numeric sort will be performed efficiently treating the indexed value really as a number and not as an additional keyword: always remember that Lucene is focused on natural language matching.

Example queries

This is going to look like a bit verbose as I'm expanding all functions for clarity:

public List<Tweet> allTweetsSortedByTime() {

	//this is needed only once but we want to show it in this context:
	QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity( Tweet.class ).get();

	//Define a Lucene query which is going to return all tweets:
	Query query = queryBuilder.all().createQuery();

	//Make a JPA Query out of it:
	FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery( query );

	//Currently needed to have Hibernate Search work with OGM:
	fullTextQuery.initializeObjectsWith( ObjectLookupMethod.PERSISTENCE_CONTEXT, DatabaseRetrievalMethod.FIND_BY_ID );

	//Specify the desired sort:
	fullTextQuery.setSort( new Sort( new SortField( "timestamp", SortField.LONG ) ) );

	//Run the query (or alternatively open a scrollable result):
	return fullTextQuery.getResultList();
}

Download it

To see the full example, I pushed a complete Maven project to github. It includes a test of all queries and all project details needed to get it running, such as Infinispan and JGroups configurations, the persistence.xml needed to enable HibernateOGM.

Please clone it to start playing with OGM: https://github.com/Sanne/tweets-ogm

And see you on IRC, the Hibernate Search forums, or the brand new Hibernate OGM forums for any clarification.

How are entities persisted in the grid?

Emmanuel is going to blog about that soon, keep an eye on the blog! Update: blog on OGM published

In Relation To

What you did not see at the JBoss World 2011 keynote demo

Hibernate OGM is not maintained anymore