I am new to elastic search(ES) and have gone through basic tutorials like
this mykong tutorial
I have question on create part of any document
CREATE Operation Example
To insert a new Document with /mkyong/posts/1001 and the following Request Data:
{
"title": "Java 8 Optional In Depth",
"category":"Java",
"published_date":"23-FEB-2017",
"author":"Rambabu Posa"
}
Question 1 :- Will ES create the inverted index on all attributes of above document i.e. title/category/published/author by default and provide
full text search or I need to mention it explicitly ?
Question 2 :- In above example we already have unique_id i.e. 1001. That's fine if I am already storing it in DB and generate ID. What if
I need to generate the ID through ES engine and do not have any DB ?
Update :-
Got the answer for question 1 from Specify which fields are indexed in ElasticSearch
Question 1 :- Yes, by default ES will index your field twice as two separate types. Once as "text" and once as "keyword" as a sub-field like "title.keyword". The "text" type runs through language analyzers to support the standard search case (remove stop words, stem words, etc). The "keyword" type makes no changes and indexes the data exactly as it is support exact match and aggregations. You can explicitly tell ES a mapping for any field, but if you don't this is the default behavior.
Here is some information on the text vs keyword behavior:
https://www.elastic.co/blog/strings-are-dead-long-live-strings
Question 2 :- ES will automatically create it's own internal ID for every document you index in a field called "_id". You can technically replace this with your own ID, but typically you don't want to do that because it can impact performance by making ES's hashing algorithm to spread out the data preform poorly. It is usually better to just to add any ID's you would like as new fields in the document and let ES index them for you, ideally as the keyword type.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html
Related
So for a hobby project of mine, I would like to create an application that translates an HTTP call and request between two services.
The application does that based on a configuration that can be set by the user. The idea is that the application listens to an incoming API call translates the call and then forwards it.
Then the application waits for a response then translates the response and sends it back to the caller.
A translation can be as simple as renaming a field value in a body object or replace a header field to the body.
I think a translation should begin with mapping the correct URL so here is an example of what I was thinking of a configuration should look like:
//request mapping
incoming URL = outgoing URL(
//Rename header value
header.someobject.renameto = "somevalue"
//Replace body object to header
body.someobject.replaceto.header
)
I was thinking that the configuration should be placed in a .txt file and read by the application.
My question is, are there other similar systems that use a configuration file for a configuration like this? And are there other/better ways to declare a configuration?
I have done something sort-of-similar in a different context (generate code from an input specification), so I will provide an outline of what I did to provide some food for thought. I used Config4* (disclosure: I developed that). If the approach I describe below is of interest to you, then I suggest you read Chapters 2 and 3 of the Config4* Getting Started Guide to get an overview of the Config4* syntax and API. Alternatively, express the concepts below in a different configuration syntax, such as XML.
Config4* is a configuration syntax, and the subset of syntax relevant to this discussion is as follows:
# this is a comment
name1 = "simple value";
name2 = ["a", "list of", "values"];
# a list can be laid out in columns to simulate a table of information
name3 = [
# item colour
#------------------
"car", "red",
"jeans", "blue",
"roses", "red",
];
In a code generator application, I used a table to provide rules to specify how to generate code for assigning values to fields of messages. If no rule was specified for a particular field, then some built-in rules provided default behaviour. The table looked something like the following:
field_rules = [
# wildcarded message.field instruction
#----------------------------------------------------------------
"Msg1.username", "#config:username",
"Msg1.password", "#config:password",
"Msg3.price", "#order:price",
"*.account", "#string:foobar",
"*.secondary_account", "#ignore",
"*.heartbeat_interval", "#expr:_heartbeatInterval * 1000",
"*.send_timestamp", "#now",
];
When my code generator wanted to generate code to assign a value to a field, the code generator constructed a string of the form "<message-name>.<field-name>", for example, Msg3.price. Then it examined the field_rules table line-by-line (starting from the top) to find a line in which the first column matched "<message-name>.<field-name>". The matching logic permitted * as a wildcard character that could match zero or more characters. (Conveniently, Config4* provides a patternMatch() utility operation that provides this functionality.)
If a match was found, then the value in the instruction column told the code generator what sort of code to generate. (If no match was found, then built-in rules were used, and if none of those applied, then no code was generated for the field.)
Each instruction was a string of the form "#<keyword>:optional,arguments". That was tokenized to provide the keyword and the optional arguments. The keyword was converted to an enum, and that drove a switch statement for generating code. For example:
The #config:username instruction specified that code should be
generated to assign the value of the username variable in a runtime
configuration file to the field.
The #order:price instruction specified that code should be generated
to assign the value returned from calling orderObj->getPrice() to the field.
The #string:foobar instruction specified the string literal foobar
should be assigned to the field.
The #expr:_heartbeatInterval * 1000 instruction specified that code should
be generated to assign the value of the expression _heartbeatInterval * 1000
to the field.
The #ignore instruction specified that no code should be generated to
assign a value to the field.
The #now instruction specified that code should be generated to assign
the current clock time to the field.
I have used the above technique in several projects, and each time I have invented instructions specific to the needs of the particular project. If you decide to use this technique, then obviously you will need to invent instructions to specify runtime translations rather than instructions to generate code. Also, don't feel you have to shoehorn all of your translation-based configuration into a single table. For example, you might use one table to provide a source URL -> destination URL mapping, and a different table to provide instructions for translating fields within messages.
If this technique works as well for you as it has worked for me on my projects, then you will end up with your translation application being an "engine" whose behaviour is driven entirely by a configuration file that, in effect, is a DSL (domain-specific language). That DSL file is likely to be quite compact (less than 100 lines), and will be the part of the application that is visible to users. Because of this, it is worthwhile investing effort to make the DSL as intuitive and easy-to-read/modify as possible, because doing that will make the translation application: (1) user friendly, and (2) easy to document in a user manual.
I want to filter results by a specific value in the aggregated array in the query.
Here is a little description of the problem.
Section belongs to the garden. Garden belongs to District and District belongs to the province.
Users have multiple sections. Those sections belong to their gardens and they are to their Districts and them to Province.
I want to get user ids that have value 2 in district array.
I tried to use any operator but it doesn't work properly. (syntax error)
Any help would be appreciated.
ps: This is possible writing using plain SQL
rs = dslContext.select(
field("user_id"),
field("gardens_array"),
field("province_array"),
field("district_array"))
.from(table(select(
arrayAggDistinct(field("garden")).as("gardens_array"),
arrayAggDistinct(field("province")).as("province_array"),
arrayAggDistinct(field("distict")).as("district_array"))
.from(table("lst.user"))
.leftJoin(table(select(
field("section.user_id").as("user_id"),
field("garden.garden").as("garden"),
field("garden.province").as("province"),
field("garden.distict").as("distict"))
.from(table("lst.section"))
.leftJoin("lst.garden")
.on(field("section.garden").eq(field("garden.garden")))
.leftJoin("lst.district")
.on(field("district.district").eq(field("garden.district")))).as("lo"))
.on(field("user.user_id").eq(field("lo.user_id")))
.groupBy(field("user.user_id"))).as("joined_table"))
.where(val(2).equal(DSL.any("district_array"))
.fetch()
.intoResultSet();
Your code is calling DSL.any(T...), which corresponds to the expression any(?) in PostgreSQL, where the bind value is a String[] in your case. But you don't want "district_array" to be a bind value, you want it to be a column reference. So, either, you assign your arrayAggDistinct() expression to a local variable and reuse that, or you re-use your field("district_array") expression or replicate it:
val(2).equal(DSL.any(field("district_array", Integer[].class)))
Notice that it's usually a good idea to be explicit about data types (e.g. Integer[].class) when working with the plain SQL templating API, or even better, use the code generator.
I have installed MongoDB in my system and I inserted 10 documents into the username collection. That document contains name, roll no, city fields. I need to count the number of fields in the username collection.
I except 3... How to get the 3 from java program?
If you are using Java in your application you are manipulating DBObject ( http://api.mongodb.org/java/2.6/com/mongodb/DBObject.html ) and you can get the KeySet from this and the size will be the number of attributes. (in your case it will be 4 since you have the _id attribute)
But this is PER DOCUMENT, remember that in a collection, each document can have its own "structure", in your case one user could have 4 attributes, another could have 10... and some of them could have sub documents with their own structure. MongoDB does not have any "catalog".
Some system are "sampling" the data to analyze the global structure of the documents and provide a catalog but this is will not be exact.
Try this:
System.out.println(coll.getCount());
More Detail here: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-java-driver/
I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong
I want to use a single field to index the document's title and body, in an effort to improve performance.
The idea was to do something like this:
Field title = new Field("text", "alpha bravo charlie", Field.Store.NO, Field.Index.ANALYZED);
title.setBoost(3)
Field body = new Field("text", "delta echo foxtrot", Field.Store.NO, Field.Index.ANALYZED);
Document doc = new Document();
doc.add(title);
doc.add(body);
And then I could just do a single TermQuery instead of a BooleanQuery for two separate fields.
However, it turns out that a field boost is the multiple of all the boost of fields of the same name in the document. In my case, it means that both fields have a boost of 3.
Is there a way I can get what I want without resorting to using two different fields? One way would be to add the title field several times to the document, which increases the term frequency. This works, but seems incredibly brain-dead.
I also know about payloads, but that seems like an overkill for what I'm after.
Any ideas?
If you want to take a page out of Google's book (at least their old book), then you may want to create separate indexes: one for document bodies, another for titles. I'm assuming there is a field stored that points to a true UID for each actual document.
The alternative answer is to write custom implementations of [Similarity][1] to get the behavior you want. Unfortunately I find that Lucene often needs this customization unique problems arise.
[1]: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)
You can index title and body separately with title field boosted by a desired value. Then, you can use MultiFieldQueryParser to search multiple fields.
While, technically, searching multiple fields takes longer time, typically even with this overhead, Lucene tends to be extremely fast (of the order of few tens or hundreds of milliseconds.)