renaming jcr nodes customly (in CQ/AEM) - java

Authors make some comments once a month.
It is stored in "content" in jcr under node "remarks". each comment
is stored in a child node which is named as"remarks_xxxx" where
xxxx are random alphabets and numbers.
I need to rename all the current nodes to "remarks_mmddyy"
and also assign future names in a similar fashion.
Thanks

The best approach is to write the date of the remark into a property (of type Date) instead of writing it into the node name. This will eliminate the need to rename nodes and also improve your chances to leverage jcr queries to your advantage.
In order to retrieve remarks for a certain date and time use the jcr query api, which allows to search for properties (including Date format of course). Since AEM 6 and jackrabbit oak, you can define a custom index to make sure that a given property query is blazing fast in terms of performance. Note that "order by" is supported as well, in case that ordering is an issue.
In case that you absolutely must stick with the detrimental data model of renaming nodes and sticking dates into node-names, check out the following article how to do it: How can you change the name of a JCR node?

Related

Lucene - KeyWord Filed confusion

I started learning Lucene, so I am reading Lucene in Action. An excerpt from this book regarding fields is:
Keyword—Isn’t analyzed, but is indexed and stored in the index verbatim.
This type is suitable for fields whose original value should be preserved in
its entirety, such as URLs, file system paths, dates, personal names, Social
Security numbers, telephone numbers, and so on
What I understood from this is, if a text is indexed with Keyword field it is not analyzed (not split into tokens) but is indexed. However, what I don't understand is where and stored in the index verbatim.
I am confused about storing in the index. I assumed that if the text is indexed it will get stored in the index data structure.
Can any one please explain me with an example?
I think you must be reading the first edition of Lucene in Action. That book is 11 years old and hopelessly outdated. I wouldn't be inclined to worry too much about understanding the conventions of Lucene 1.4.
The Second Edition is available. It's five years old and is based on Lucene 3.0, so it's definitely somewhat outdated, especially since the big changes in lucene version 4.0, but not hopelessly so. Reading that would certainly be much more useful.
The difference between storing and indexing a field does still exist though. In Lucene parlance:
Index - The field is indexed, and can be searched for. Keyword fields (Or, more recently, StringField) are not analyzed, but they are indexed, so their complete content can searched without tokenization.
Store - The field is stored, in it's entirety, separately from the indexed form for later retrieval. When you get a search result from Lucene (for instance, from IndexSearcher.doc(int)), the document you get back will only have stored fields in it.
As such, you can have a field that you can search on, but won't be returned in results, or a field that is returned in results but can't be searched.

Finding number of unique terms over multiple fields

I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!
If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Build in library's to perform effective searching on 100GB files

Is there any build-in library in Java for searching strings in large files of about 100GB in java. I am currently using binary-search but it is not that efficient.
As far as I know Java does not contain any file search engine, with or without an index. There is a very good reason for that too: search engine implementations are intrinsically tied to both the input data set and the search pattern format. A minor variation in either could result in massive changes in the search engine.
For us to be able to provide a more concrete answer you need to:
Describe exactly the data set: the number, path structure and average size of files, the format of each entry and the format of each contained token.
Describe exactly your search patterns: are those fixed strings, glob patterns or, say, regular expressions? Do you expect the pattern to match a full line or a specific token in each line?
Describe exactly your desired search results: do you want exact or approximate matches? Do you want to get a position in a file, or extract specific tokens?
Describe exactly your requirements: are you able to build an index beforehand? Is the data set expected to be modified in real time?
Explain why can't you use third party libraries such as Lucene that are designed exactly for this kind of work.
Explain why your current binary search, which should have a complexity of O(logn) is not efficient enough. The only thing that might be be faster, with a constant complexity would involve the use of a hash table.
It might be best if you described your problem in broader terms. For example, one might assume from your sample data set that what you have is a set of words and associated offset or document identifier lists. A simple method to approach searching in such a set would be to store an word/file-position index in a hash table to be able to access each associated list in constant time.
If u doesn't want to use the tools built for search, then store the data in DB and use sql.

JCR 170 Data modeling: Node names

The situation:
Lets say we are implementing a blog engine based on JCR with support for localization.
The content structure looks something like this /blogname/content/[node name]
The problem:
What is the best way to name the content nodes (/blogname/content/[nodename]) to satisfy the following requirements:
The node name must be usable in HTML to support REST like URLs i.e.: blogname.com/content/nodename should point to a single content item.
The above requirement must not produce ugly URLs i.e.: /content/node_name is good, /content/node%20name is bad.
Programmatic retrieval should be easy given the node name i.e.: //content[#node_name=some-name]
The naming scheme must guarantee node name uniqueness.
PS: The JCR implementation used is JackRabbit
For 1. to 3. the answer is simple: just use characters you want to see in the node name, ie. escape whatever input string you have (eg. the blog post title) against a restricted character set such as the one for URIs.
For example, do not allow spaces (which are allowed for JCR node names, but would produce the ugly %20 in URLs) and other chars that must be encoded in URLs. You can remove those chars or simply replace them with a underscore, because that looks good in most cases.
Regarding unique names (4.), you can either include the current time incl. milliseconds into it or you explicitly check for collisions. The first might look a bit ugly, but should probably never fail for a blog scenario. The latter can be done by reacting upon the exception thrown if a node with such a name already exists and adding eg. an incrementing counter and try again (eg. my_great_post1, my_great_post2, etc.). You can also lock the parent node so that only one session can actually add a node at the same time, which avoids a trial loop, but comes at the cost of blocking.
Note: //content[#node_name=some-name] is not a valid JCR Xpath query. You probably want to use /jcr:root/content//some-name for that.
Regarding item 3. I recently learned that xpath queries do not allow items to start with a number. If your node name starts with a number it can still be queried by escaping the first byte of the name, but your queries will be more straightforward if you start all node names with a letter.
(I'm not sure about property names. Haven't ever seen one that didn't start with a letter.)
Unique names: To quickly generate a unique name from the first characters of a title plus a random number (to resolve conflicts), you could use the following algorithm:
String title = "JCR 170 Data modeling: Node names";
String name = title.substring(0, Math.min(title.length(), 10)).trim().replace(' ', '_');
if (name is not unique) {
name += "_";
Random r = new Random();
while (name is not unqiue) {
name += Integer.toString(r.nextInt(10));
}
}
The advantage to use a random number is: even if you have many similar names, this will resolve conflicts very quickly.

Categories