Hibernate search custom stop words list - java

I need to customize stopwords list for search by Document title.
I have the following mapping:
#Entity
#Indexed
#AnalyzerDef(
name = "documentAnalyzer",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(
factory = StopFilterFactory.class,
params = {
#Parameter(name = "words", value = "stoplist.properties"),
#Parameter(name = "ignoreCase", value = "true")
}
)
}
)
public class Document {
...
#Field(analyzer = #Analyzer(definition = "documentAnalyzer"))
private String title;
...
}
stoplist.properties file is in resources directory and contains stopwords that are different from StandardAnalyzer defaults.
But the search doesn't return any results if I use stopwords that are enabled by default but don't exist in my stoplist.properties file, e.g. the word will.
What is wrong with current configuration?
How can I make hibernate search use custom stopwords list?
I use hibernate-search-orm 5.6.1 version.
Results are validated in an integration test with index created on-the-fly:
#Before
public void setUpLuceneIndex() throws InterruptedException {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
fullTextEntityManager.createIndexer().startAndWait();
}

Your configuration looks sane as far as I can see.
Did you reindex your entities after having changed the stop words configuration? You need that for the new configuration to be taken into account at index time.
If you did and it still does not work, try to add a breakpoint in StopFilterFactory constructor and inform method to see what's going on!

Related

How to define custom analyzer to do global search with hibernate-search and elasticsearch

I have an implementation of hibernate-search-orm (5.9.0.Final) with hibernate-search-elasticsearch (5.9.0.Final).
I defined a custom analyzer on an entity (see beelow) and I indexed two entities :
id: "1"
title: "Médiatiques : récit et société"
abstract:...
id: "2"
title: "Mediatique Com'7"
abstract:...
The search works fine when I search on title field :
"title:médiatique" => 2 results.
"title:mediatique" => 2 results.
My problem is when I do a global search with accents (or not) :
search on "médiatique => 1 result (id:1)
search on "mediatique => 1 result (id:2)
Is there a way to resolve this?
Thanks.
Entity definition:
#Entity
#Table(name="bibliographic")
#DynamicUpdate
#DynamicInsert
#Indexed(index = "bibliographic")
#FullTextFilterDefs({
#FullTextFilterDef(name = "fieldsElasticsearchFilter",
impl = FieldsElasticsearchFilter.class)
})
#AnalyzerDef(name = "customAnalyzer",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
})
#Analyzer(definition = "customAnalyzer")
public class BibliographicHibernate implements Bibliographic {
...
#Column(name="title", updatable = false)
#Fields( {
#Field,
#Field(name = "titleSort", analyze = Analyze.NO, store = Store.YES)
})
#SortableField(forField = "titleSort")
private String title;
...
}
Search method :
FullTextEntityManager ftem = Search.getFullTextEntityManager(entityManager);
QueryBuilder qb = ftem.getSearchFactory().buildQueryBuilder().forEntity(Bibliographic.class).get();
QueryDescriptor q = ElasticsearchQueries.fromQueryString(queryString);
FullTextQuery query = ftem.createFullTextQuery(q, Bibliographic.class).setFirstResult(start).setMaxResults(rows);
if (filters!=null){
filters.stream().map((filter) -> filter.split(":")).forEach((f) -> {
query.enableFullTextFilter("fieldsElasticsearchFilter")
.setParameter("field", f[0])
.setParameter("value", f[1]);
}
);
}
if (facetFields!=null){
facetFields.stream().map((facet) -> facet.split(":")).forEach((f) ->{
query.getFacetManager()
.enableFaceting(qb.facet()
.name(f[0])
.onField(f[0])
.discrete()
.orderedBy(FacetSortOrder.COUNT_DESC)
.includeZeroCounts(false)
.maxFacetCount(10)
.createFacetingRequest() );
}
);
}
List<Bibliographic> bibs = query.getResultList();
To be honest I'm more surprised document 1 would match at all, since there's a trailing "s" on "Médiatiques" and you don't use any stemmer.
You are in a special case here: you are using a query string and passing it directly to Elasticsearch (that's what ElasticsearchQueries.fromQueryString(queryString) does). Hibernate Search has very little impact on the query being run, it only impacts the indexed content and the Elasticsearch mapping here.
When you run a QueryString query on Elasticsearch and you don't specify any field, it uses all fields in the document. I wouldn't bet that the analyzer used when analyzing your query is the same analyzer that you defined on your "title" field. In particular, it may not be removing accents.
An alternative solution would be to build a simple query string query using the QueryBuilder. The syntax of queries is a bit more limited, but is generally enough for end users. The code would look like this:
FullTextEntityManager ftem = Search.getFullTextEntityManager(entityManager);
QueryBuilder qb = ftem.getSearchFactory().buildQueryBuilder().forEntity(Bibliographic.class).get();
Query q = qb.simpleQueryString()
.onFields("title", "abstract")
.matching(queryString)
.createQuery();
FullTextQuery query = ftem.createFullTextQuery(q, Bibliographic.class).setFirstResult(start).setMaxResults(rows);
Users would still be able to target specific fields, but only in the list you provided (which, by the way, is probably safer, otherwise they could target sort fields and so on, which you probably don't want to allow). By default, all the fields in that list would be targeted.
This may lead to the exact same result as the query string, but the advantage is, you can override the analyzer being used for the query. For instance:
FullTextEntityManager ftem = Search.getFullTextEntityManager(entityManager);
QueryBuilder qb = ftem.getSearchFactory().buildQueryBuilder().forEntity(Bibliographic.class)
.overridesForField("title", "customAnalyzer")
.overridesForField("abstract", "customAnalyzer")
.get();
Query q = qb.simpleQueryString()
.onFields("title", "abstract")
.matching(queryString)
.createQuery();
FullTextQuery query = ftem.createFullTextQuery(q, Bibliographic.class).setFirstResult(start).setMaxResults(rows);
... and this will use your analyzer when querying.
As an alternative, you can also use a more advanced JSON query by replacing ElasticsearchQueries.fromQueryString(queryString) with ElasticsearchQueries.fromJsonQuery(json). You will have to craft the JSON yourself, though, taking some precautions to avoid any injection from the user (use Gson to build the Json), and taking care to follow the Elasticsearch query syntax.
You can find more information about simple query string queries in the official documentation.
Note: you may want to add FrenchMinimalStemFilterFactory to your list of token filters in your custom analyzer. It's not the cause of your problem, but once you manage to use your analyzer in search queries, you will very soon find it useful.

alfresco buildonly indexer for searching the properties created on the fly

I am using the latest version of alfresco 5.1 version.
one of my requirement is to create properties (key / value) where user enter the key as well as the value.
so I have done that like this
Map<QName, Serializable> props = new HashMap<QName, Serializable>();
props.put(QName.createQName("customProp1"), "prop1");
props.put(QName.createQName("customProp2"), "prop2");
ChildAssociationRef associationRef = nodeService.createNode(nodeService.getRootNode(storeRef), ContentModel.ASSOC_CHILDREN, QName.createQName(GUID.generate()), ContentModel.TYPE_CMOBJECT, props);
Now what I want to do is search the nodes with these newly created properties. I was able to search the newly created property like this.
public List<NodeRef> findNodes() throws Exception {
authenticate("admin", "admin");
StoreRef storeRef = new StoreRef(StoreRef.PROTOCOL_WORKSPACE, "SpacesStore");
List<NodeRef> nodeList = null;
Map<QName, Serializable> props = new HashMap<QName, Serializable>();
props.put(QName.createQName("customProp1"), "prop1");
props.put(QName.createQName("customProp2"), "prop2");
ChildAssociationRef associationRef = nodeService.createNode(nodeService.getRootNode(storeRef), ContentModel.ASSOC_CHILDREN, QName.createQName(GUID.generate()), ContentModel.TYPE_CMOBJECT, props);
NodeRef nodeRef = associationRef.getChildRef();
String query = "#cm\\:customProp1:\"prop1\"";
SearchParameters sp = new SearchParameters();
sp.addStore(storeRef);
sp.setLanguage(SearchService.LANGUAGE_LUCENE);
sp.setQuery(query);
try {
ResultSet results = serviceRegistry.getSearchService().query(sp);
nodeList = new ArrayList<NodeRef>();
for (ResultSetRow row : results) {
nodeList.add(row.getNodeRef());
System.out.println(row.getNodeRef());
}
System.out.println(nodeList.size());
} catch (Exception e) {
e.printStackTrace();
}
return nodeList;
}
The alfresco-global.properties indexer configuration is
index.subsystem.name=buildonly
index.recovery.mode=AUTO
dir.keystore=${dir.root}/keystore
Now my question is
Is it possible to achieve the same using the solr4 indexer ?
Or Is there any way to use buildonly indexer for a particular query ?
In your query
String query = "#cm\\:customProp1:\"prop1\"";
remove cm as you are building the QName on the fly so it does not come under cm i.e. (ContentModel) properties. So your query will be
String query = "#\\:customProp1:\"prop1\"";
Hope this will work for you
First, double check if you're simply experiencing eventual consistency, as described below. If you are, and if this presents a problem for you, consider switching to CMIS queries while staying on SOLR.
http://docs.alfresco.com/5.1/concepts/solr-event-consistency.html
Other than this, check if the node has been indexed at all. If it has, take a closer look at how you build your query.
How to find List of unindexed file in alfresco

how to search case insensitive in hibernate search using lucene query?

I am using two analyzers while indexing such as StandardAnalyzer for some fields and WhitespaceAnalyzer for some fields holding value as special character like c++ but I am writing query as
QueryBuilder qb = fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Professional.class).get();
BooleanQuery booleanQuery = new BooleanQuery();
query = qb .keyword().wildcard().onField(fieldName).ignoreFieldBridge().matching(fieldValue+"*").createQuery();
booleanQuery.add(query, BooleanClause.Occur.MUST);
the above query returns results are different as case sensitive like c++ And C++
so i want acheive case insensitve for results,because of i am not using same analyzer while indexing as well as searching ,so am i wrong
plz help me because i get strucked from 1 week plz...
thanks in advance
I had the same issue. I was using keyword().wildcard() for one field and faced the issue that not in lowercase written word could not be found.
The solution was very simple - instead of implementing Analyzer I converted search term to lower case before writing any query. In your case it would look like this:
fieldValue = fieldValue.toLowerCase();
QueryBuilder qb = fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Professional.class).get();
BooleanQuery booleanQuery = new BooleanQuery();
query = qb.keyword().wildcard().onField(fieldName).ignoreFieldBridge().matching(fieldValue+"*").createQuery();
booleanQuery.add(query, BooleanClause.Occur.MUST);
You should use a custom analyzer and add LowerCaseFilter after the WhitespaceTokenizer. Like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
Tokenizer source = new WhitespaceTokenizer();
TokenStream filter = new LowerCaseAnalyzer(source);
return new TokenStreamComponents(source, filter);
}
}
As of Hibernate 5.10.3, the syntax has slightly changed for creating custom lucene analyzer:
public class CustomAuthorAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
Tokenizer source = new WhitespaceTokenizer();
TokenStream filter = new LowerCaseFilter(source);
return new TokenStreamComponents(source, filter);
}
}
Then in order to use this analyzer on a custom field, we just need to specify it through #Analyzer annotation:
#Analyzer(impl = CustomAuthorAnalyzer.class)
#Field(index = Index.YES, analyze = Analyze.YES, store = Store.YES)
private String author;
Hope that helps.
Alternatively, lucene also provides a mechanism to easily compose a new custom analyzer:
public static Analyzer create() throws IOException {
Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
.withTokenizer(WhitespaceTokenizerFactory.class)
.addTokenFilter(LowerCaseFilterFactory.class)
.addTokenFilter(StopFilterFactory.class, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();
return ana;
}
The builder class is pretty easy to use for creating a composite analyzers.
More information here: https://lucene.apache.org/core/6_4_2/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html

Search users under organization in Liferay

I have to search users under the specific organization in liferay. At present we have a search available with
UserLocalService.search()
which is based on the companyId . I was wondering if there is any otherway even using the DynamicQueryFactoryUtil do fetch along with organization filter.
The dynamic query looks good, but I found an another way. We can pass the organization id using Map.
params.put("usersOrgs", orgId);
List<User> searchResult = liferayUserLocalService.search(companyId, keyword, WorkflowConstants.STATUS_APPROVED, params, 0, -1, "");
which will filter the users based on organization.
Of Course You can use DynamicQuery for achieving this.
This can be done in two phase ,
Fetch User Id associated with the given Organization.
Use search criterion along with the id received in first phase.
So, the code will look as following ,
// Fetch userId List form Organization id
long[] organiztionIds = UserLocalServiceUtil.getOrganizationUserIds(orgId);
DynamicQuery searchQuery = DynamicQueryFactoryUtil.forClass(User.class, UserLocalServiceUtil.class.getClassLoader());
Criterion searchCriteria = PropertyFactoryUtil.forName("companyId").eq(companyid);
//Add Organization Id in Criterion
if (organiztionIds.length != 0) {
searchCriteria =
RestrictionsFactoryUtil.and(RestrictionsFactoryUtil.in("userId", ArrayUtils.toObject(organiztionIds)), searchCriteria);
}
if (!firstName.isEmpty()) {
searchCriteria = RestrictionsFactoryUtil.or(RestrictionsFactoryUtil.eq("firstName", firstName), searchCriteria);
}
if (!middleName.isEmpty()) {
searchCriteria = RestrictionsFactoryUtil.or(RestrictionsFactoryUtil.eq("middleName", middleName), searchCriteria);
}
if (!lastName.isEmpty()) {
searchCriteria = RestrictionsFactoryUtil.or(RestrictionsFactoryUtil.eq("lastName", lastName), searchCriteria);
}
if (!screenName.isEmpty()) {
searchCriteria = RestrictionsFactoryUtil.or(RestrictionsFactoryUtil.eq("screenName", screenName), searchCriteria);
}
searchQuery.add(searchCriteria);
UserLocalServiceUtil.dynamicQuery(searchQuery);
P.S
I haven't tested this code. But this is the way to do it.
I Hope it helps you.

Lucene : Changing the default facet delimiter?

First post on this wonderful site!
My goal is to use hierarchical facets for searching an index using Lucene. However, my facets need to be delimited by a character other than '/', (in this case, '~'). Example:
Categories
Categories~Category1
Categories~Category2
I have created a class that implements FacetIndexingParams interface (a copy of DefaultFacetIndexingParams with the DEFAULT_FACET_DELIM_CHAR param set to '~').
Paraphrased indexing code : (using FSDirectory for both index and taxonomy)
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34)
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer)
IndexWriter writer = new IndexWriter(indexDir, config)
TaxonomyWriter taxo = new LuceneTaxonomyWriter(taxDir, OpenMode.CREATE)
Document doc = new Document()
// Add bunch of Fields... hidden for the sake of brevity
List<CategoryPath> categories = new ArrayList<CategoryPath>()
row.tags.split('\\|').each{ tag ->
def cp = new CategoryPath()
tag.split('~').each{
cp.add(it)
}
categories.add(cp)
}
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo, facetIndexingParams)
categoryDocBuilder.setCategoryPaths(categories).build(doc)
writer.addDocument(doc)
// Commit and close both writer and taxo.
Search code paraphrased:
// Create index and taxonomoy readers to get info from index and taxonomy
IndexReader indexReader = IndexReader.open(indexDir)
TaxonomyReader taxo = new LuceneTaxonomyReader(taxDir)
Searcher searcher = new IndexSearcher(indexReader)
QueryParser parser = new QueryParser(Version.LUCENE_34, "content", new StandardAnalyzer(Version.LUCENE_34))
parser.setAllowLeadingWildcard(true)
Query q = parser.parse(query)
TopScoreDocCollector tdc = TopScoreDocCollector.create(10, true)
List<FacetResult> res = null
NewFacetIndexingParams facetIndexingParams = new NewFacetIndexingParams()
FacetSearchParams facetSearchParams = new FacetSearchParams(facetIndexingParams)
CountFacetRequest cfr = new CountFacetRequest(new CategoryPath(""), 99)
cfr.setDepth(2)
cfr.setSortBy(SortBy.VALUE)
facetSearchParams.addFacetRequest(cfr)
FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader, taxo)
def cp = new CategoryPath("Category~Category1", (char)'~')
searcher.search(DrillDown.query(q, cp), MultiCollector.wrap(tdc, facetsCollector))
The results always return a list of facets in the form of "Category/Category1".
I have used the Luke tool to look at the index and it appears the facets are being delimited by the '~' character in the index.
What is the best route to do this? Any help is greatly appreciated!
I have figured out the issue. The search and indexing are working as they are supposed to. It is how I have been getting the facet results that is the issue. I was using :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString()
}
What I needed to use was :
res = facetsCollector.getFacetResults()
res.each{ result ->
result.getFacetResultNode().getLabel().toString((char)'~')
}
The difference being the paramter sent to the toString function!
Easy to overlook, tough to find.
Hope this helps others.

Categories