Hibernate Search is a library that integrates Hibernate ORM with Apache Lucene or Elasticsearch by automatically indexing entities, enabling advanced search functionality: full-text, geospatial, aggregations and more. For more information, see Hibernate Search on hibernate.org.

Inspired by these questions on the Search forum and stackoverflow I decided to blog about different solutions for the problem using only the tools available in Search right now (4.1.1.Final). But let's start with the problem.

The Problem

How can I define a custom analyzer for a field added in a custom class bridge? Let's look at an example. Below the class Foo defines a custom bridge FooBridge. How can I specify a custom analyzer for the field added by this bridge?

@Entity
@Indexed
@ClassBridge(impl = FooBridge.class)
public static class Foo {
	@Id
	@GeneratedValue
	private Integer id;
}

Solution 1

The straight forward approach looks something like that.

@Entity
@Indexed
@ClassBridge(name = "myCustomField", impl = FooBridge.class, analyzer = @Analyzer(impl = MyCustomAnalyzer.class))
public static class Foo {
	@Id
	@GeneratedValue
	private Integer id;
}

This works fine, provided the field you are adding in FooBridge has the name myCustomField. In case you are adding a field (or even multiple fields) with a different name this approach does not work anymore. In Lucene analyzers are specified per field identified by field name. Since from your @ClassBridge definition we cannot tell which fields you are adding, there is no way of registering and applying the right analyzers. See also the related issue HSEARCH-904.

Solution 2

In the second solution you are managing the analyzers on your own. The relevant part is in the bridge implementation:

public class FooBridge implements FieldBridge {

	@Override
	public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
		Field field = new Field(
				name,  
				"",
				luceneOptions.getStore(),
				luceneOptions.getIndex(),
				luceneOptions.getTermVector()
		);
		try {
			String text = "whatever you want to index";
			MyCustomAnalyzer analyzer = new MyCustomAnalyzer( );
			field.setTokenStream( analyzer.reusableTokenStream( name, new StringReader( text ) ) );
			document.add( field );
		}
		catch ( IOException e ) {
			// error handling
		}
	}
}
As you can see you need to instantiate your analyzer yourself and then set the token stream for the field you want to add. This will work, but it does not work with @AnalyzerDef which is often used to define and reuse analyzers globally for your application. Lets have a look at this solution.

Solution 3

Let's start directly with the code:

	@Entity
	@Indexed
	@AnalyzerDefs({
			@AnalyzerDef(name = "analyzer1",
					tokenizer = @TokenizerDef(factory = MyFirstTokenizer.class),
					filters = {
							@TokenFilterDef(factory = MyFirstFilter.class)
					}),
			@AnalyzerDef(name = "analyzer2",
					tokenizer = @TokenizerDef(factory = MySecondTokenizer.class),
					filters = {
							@TokenFilterDef(factory = MySecondFilter.class)
					}),
			@AnalyzerDef(name = "analyzer3",
					tokenizer = @TokenizerDef(factory = MyThirdTokenizer.class),
					filters = {
							@TokenFilterDef(factory = MyThirdFilter.class)
					})
	})
	@ClassBridge(impl = FooBridge.class)
	@AnalyzerDiscriminator(impl = FooBridge.class)
	public static class Foo {
		@Id
		@GeneratedValue
		private Integer id;
	}

	public static class FooBridge implements Discriminator, FieldBridge {

		public static final String[] fieldNames = new String[] { "field1", "field2", "field3" };
		public static final String[] analyzerNames = new String[] { "analyzer1", "analyzer2", "analyzer3" };

		@Override
		public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
			for ( String fieldName : fieldNames ) {
				Fieldable field = new Field(
						fieldName,
						"Your text to analyze and index",
						luceneOptions.getStore(),
						luceneOptions.getIndex(),
						luceneOptions.getTermVector()
				);
				field.setBoost( luceneOptions.getBoost() );
				document.add( field );
			}
		}

		public String getAnalyzerDefinitionName(Object value, Object entity, String field) {
			for ( int i = 0; i < fieldNames.length; i++ ) {
				if ( fieldNames[i].equals( field ) ) {
					return analyzerNames[i];
				}
			}
			return null;
		}
	}
A lot is going on here and the example shows many useful features of Search. First @AnalyzerDefs is used to declaratively define and build your analyzers. This analyzers are globally available under their given names and can be reused across the application (see also SearchFactory#getAnalyzer(String)). You build an analyzer by first specifying its tokenizer and then a list of filters to apply. Have a look at named analyzers in the online documentation for more information.

Next the example uses @ClassBridge(impl = FooBridge.class) to define the custom class bridge. Nothing special there. Which fields you are adding in the implementation is up to you.

Last but not least, we have @AnalyzerDiscriminator(impl = FooBridge.class). This annotation is normally used for dynamic analyzer selection based on the entity state. However, it can easily be used in this context as well. To make things easy I let the field bridge directly implement the required Discriminator interface. Discriminator#getAnalyzerDefinitionName will now be called for each field being added to the index. You also get passed the entity itself and the value (in case the field bridge is defined on a property), but this is not important in this case. All which remains to be done is to make sure to return the right analyzer name based on the passed field name.

Solution 3 seems maybe longer, but it is a declarative approach and it also allows you to reuse analyzer configurations. It also shows that the current API might still have its shortcomings (eg HSEARCH-904), but using the tools already available, there are often work-arounds.

Happy analyzing,

Hardy


Back to top