How to use queryString() in elasticsearch (java API)? - java

I am working on elastic-search v1.1.1
I faced a problem with search queries .I want to know How solve below obstacle
Here is my mapping
{
"token" : {
"type" : "string"
}
}
Data in indexed record is
{
token : "4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd"
}
My search is
4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd
I want exact match of the record ,which query I need to use to get exact match of the record
can it be done
QueryBuilders.queryString() ?
I checked with queryString() ,then I finalized its not useful for exact match
Please suggest me

You can put quotes around the string to do an exact match:
QueryBuilders.queryString("\"4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd\"");
If you don't want partial matches on the above string index an untokenized version of the value and search on that. In you mapping add:
"token": {
"type": "multi_field",
"fields": {
"untouched": {
"type": "string",
"index": "not_analyzed"
}
}
}
Then search:
{
"query": {
"match": {
"token.untouched": "4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd"
}
}
}

Change the mapping so ElasticSearch doesn't touch your data while indexing like so to:
{
"token" : {
"type" : "string",
"index": "not_analyzed"
}
}
And then run a TermQuery from java like this
QueryBuilders.termQuery("token", "4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd");
That should give you your exact match.

Related

Google DLP - Can I use a delimiter to instruct DLP infotype detectors to search only inside that for sensitive text?

I have an issue while trying to deidentify some data with DLP using an object mapper to parse the object into string - send it to DLP for deidentification - getting back the deidentified string and using the object mapper to parse the string back to the initial object. Sometimes DLP will return a string that cannot be parsed back to the initial object (it breaks the json format of the object mapper)
I use an objectMapper to parse an Address object to string like this:
Address(
val postal_code: String,
val street: String,
val city: String,
val provence: String
)
and my objectmapper will transform this object into a string eg: "{\"postal_code\":\"123ABC\",\"street\":\"Street Name\",\"city\":\"My City\",\"provence\":\"My Provence\"}" which is sent to DLP and deidentified (using LOCATION or STREET_ADDRESS detectors).
The issue is that my object mapper would expect to take back the deidentified string and parse it back to my Address object using the same json format eg:
"{\"postal_code\":\"LOCATION_TOKEN(10):asdf\",\"street\":\"LOCATION_TOKEN(10):asdf\",\"city\":\"LOCATION_TOKEN(10):asdf\",\"provence\":\"LOCATION_TOKEN(10):asdf\"}"
But there are a lot of times that DLP will return something like
"{"LOCATION_TOKEN(25):asdfasdfasdf)\",\"provence\":\"LOCATION_TOKEN(10):asdf\"}" - basically breaking the json format and i am unable to parse back the string from DLP to my initial object
Is there a way to instruct DLP infotype detectors to keep the json format, or to look for sensitive text only inside \" * \"?
Thanks
There are some options here using a custom regex and a detection ruleset in order to define a boundary on matches.
The general idea is that you require that findings must match both an infoType (e.g. STREET_ADDRESS, LOCATION, PERSON_NAME, etc.) and your custom infoType before reporting as a finding or for redaction. By requiring that both match, you can set bounds on where the infoType can detect.
Here is an example.
{
"item": {
"value": "{\"postal_code\":\"123ABC\",\"street\":\"Street Name\",\"city\":\"My City\",\"provence\":\"My Provence\"}"
},
"inspectConfig": {
"customInfoTypes": [
{
"infoType": {
"name": "CUSTOM_BLOCK"
},
"regex": {
"pattern": "(:\")([^,]*)(\")",
"groupIndexes": [
2
]
},
"exclusionType": "EXCLUSION_TYPE_EXCLUDE"
}
],
"infoTypes": [
{
"name": "EMAIL_ADDRESS"
},
{
"name": "LOCATION"
},
{
"name": "PERSON_NAME"
}
],
"ruleSet": [
{
"infoTypes": [
{
"name": "LOCATION"
}
],
"rules": [
{
"exclusionRule": {
"excludeInfoTypes": {
"infoTypes": [
{
"name": "CUSTOM_BLOCK"
}
]
},
"matchingType": "MATCHING_TYPE_INVERSE_MATCH"
}
}
]
}
]
},
"deidentifyConfig": {
"infoTypeTransformations": {
"transformations": [
{
"primitiveTransformation": {
"replaceWithInfoTypeConfig": {}
}
}
]
}
}
}
Example output:
"item": {
"value": "{\"postal_code\":\"123ABC\",\"street\":\"Street Name\",\"city\":\"My City\",\"provence\":\"My [LOCATION]\"}"
},
By setting "groupIndexes" to 2 we are indicating that we only want the custom infoType to match the middle (or second) regex group and not allow the :" or " to be part of the match. Also, in this example we mark the custom infoType as EXCLUSION_TYPE_EXCLUDE so that it does not report itself:
"exclusionType": "EXCLUSION_TYPE_EXCLUDE"
If you remove this line, anything matching your infoType could also get redacted. This can be useful for testing though - example output:
"item": {
"value": "{\"postal_code\":\"[CUSTOM_BLOCK]\",\"street\":\"[CUSTOM_BLOCK]\",\"city\":\"[CUSTOM_BLOCK]\",\"provence\":\"[CUSTOM_BLOCK][LOCATION]\"}"
},
...
Hope this helps.

Handing of "JSON does not allow non-finite numbers" in elastic search

Hi I am trying to do aggregation on a field using MAX function inside my index, the issue is when I do aggregation with group by it fails whenever I don't have that value inside of a time frame
POST /_opendistro/_sql
{
"query": "SELECT date(timeStamp) time_unit, MAX(testField) test_field_alias FROM my_index where orgId = 'xyz' and timeStamp <= '2020-07-04T23:59:59' and timeStamp >= '2020-07-01T00:00' group by date(timeStamp) order by time_unit desc"
}
My index mapping is as given below
"mappings" : {
"properties" : {
"orgId" : {
"type" : "text"
},
"testField" : {
"type" : "long",
"null_value" : 0
},
"timeStamp": {
"type":"date"
}
}
When I try the above query I get
{
"error": {
"root_cause": [
{
"type": "j_s_o_n_exception",
"reason": "JSON does not allow non-finite numbers."
}
],
"type": "j_s_o_n_exception",
"reason": "JSON does not allow non-finite numbers."
},
"status": 500
}
I have understood this happens for time frames where I don't have the above-mentioned field in the documents in my index, SO I added "null_value":0 to my mapping but still of no use.
Thing is I want to do aggregation using MAX function grouped by time scales if there is another approach that works, that's enough for me it doesn't have to be in SQL format only.

Elastic Search multi match gets wrong result

I am sending a query to Elastic Search to find all segments which has a field matching the query.
We are implementing a "free search" which the user could write any text he wants and we build a query which search this text throw all the segments fields.
Each segment which one (or more) of it's fields has this text should return
For example:
I would like to get all the segments which with the name "tony lopez".
Each segment has a field of "first_name" and a field of "last_name".
The query our service builds:
"multi_match" : {
"query": "tony lopez",
"type": "best_fields"
"fields": [],
"operator": "OR"
}
The result from the Elastic using this query is a segment which includes "first_name" field "tony" and "last_name" field "lopez", but also a segment when the "first_name" field is "joe" and "last_name" is "tony".
In this type of query, I would like to recive only the segments which it's name is "tony (first_name) lopez (last_name)"
How can I fix that issue?
Hope i'm not jumping into conclusions too soon but if you want to get only tony and lopez as firstname and lastname use this:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"first": "tony"
}
},
{
"match": {
"last": "lopez"
}
}
]
}
}
}
But if one of your indexed documents contains for example tony s as firstname, the query above will return it too.
Why? firstname is a text datatype
A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.
More Details
If you run this query via kibana:
POST my_index/_analyze
{
"field": "first",
"text": ["tony s"]
}
You will see that tony s is analyzed as two tokens tony and s.
passed through an analyzer to convert the string into a list of individual terms (tony as a term and s as a term).
That is why the above query returns tony s in results, it matches tony.
If you want to get only tony and lopez exact match then you should use this query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"first.keyword": {
"value": "tony"
}
}
},
{
"term": {
"last.keyword": {
"value": "lopez"
}
}
}
]
}
}
}
Read about keyword datatype
UPDATE
Try this query - it is not perfect same issue with my tony s example and if you have a document with firstname lopez and lastname tony it will find it.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "tony lopez",
"fields": [],
"type": "cross_fields",
"operator":"AND",
"analyzer": "standard"
}
}
}
The cross_fields type is particularly useful with structured documents where multiple fields should match. For instance, when querying the first_name and last_name fields for “Will Smith”, the best match is likely to have “Will” in one field and “Smith” in the other
cross fields
Hope it helps

Elasticsearch lowercase tokenizer quirk?

I am testing mapping for url-s in elasticsearch.
I want to be able to search for entry both by domain name with tld (e.g. example.com)
and without tld (e.g example) and for full domain document to be returned
(like, http://example.com and www.example.com and similar)
I PUT this mapping to ES - in Sense:
PUT /en_docs
{
"mappings": {
"url": {
"properties": {
"content": {
"type": "string",
"analyzer" : "urlzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"urlzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter" ]
}
},
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www"]
}
}
}
}
}
Now, when I index an url document, e.g
POST /en_docs/url
{
"content": "http://example.com"
}
I can get it by searching example.com but just example doesnt return anything.
The lowercase tokenizer im using in my analyzer, as docs say, and as direct testing of my analyzer shows, gives example and com tokens, but when I do the search for indexed document, example returns nothing:
GET /en_docs/url/_search?q=example
gets no results, but if the query is example.com, result is returned.
What am I missing?

Update nested field in an index of ElasticSearch with Java API

I am using Java API for CRUD operation on elasticsearch.
I have an typewith a nested field and I want to update this field.
Here is my mapping for the type:
"enduser": {
"properties": {
"location": {
"type": "nested",
"properties":{
"point":{"type":"geo_point"}
}
}
}
}
Of course my enduser type will have other parameters.
Now I want to add this document in my nested field:
"location":{
"name": "London",
"point": "44.5, 5.2"
}
I was searching in documentation on how to update nested document but I couldn't find anything. For example I have in a string the previous JSON obect (let's call this string json). I tried the following code but seems to not working:
params.put("location", json);
client.prepareUpdate(index, ElasticSearchConstants.TYPE_END_USER,id).setScript("ctx._source.location = location").setScriptParams(params).execute().actionGet();
I have got a parsing error from elasticsearch. Anyone knows what I am doing wrong ?
You don't need the script, just update it.
UpdateRequestBuilder br = client.prepareUpdate("index", "enduser", "1");
br.setDoc("{\"location\":{ \"name\": \"london\", \"point\": \"44.5,5.2\" }}".getBytes());
br.execute();
I tried to recreate your situation and i solved it by using an other way the .setScript method.
Your updating request now would looks like :
client.prepareUpdate(index, ElasticSearchConstants.TYPE_END_USER,id).setScript("ctx._source.location =" + json).execute().actionGet()
Hope it will help you.
I am not sure which ES version you were using, but the below solution worked perfectly for me on 2.2.0. I had to store information about named entities for news articles. I guess if you wish to have multiple locations in your case, it would also suit you.
This is the nested object I wanted to update:
"entities" : [
{
"disambiguated" : {
"entitySubTypes" : [],
"disambiguatedName" : "NameX"
},
"frequency" : 1,
"entityType" : "Organization",
"quotations" : ["...", "..."],
"name" : "entityX"
},
{
"disambiguated" : {
"entitySubType" : ["a", "b" ],
"disambiguatedName" : "NameQ"
},
"frequency" : 5,
"entityType" : "secondTypeTest",
"quotations" : [ "...", "..."],
"name" : "entityY"
}
],
and this is the code:
UpdateRequest updateRequest = new UpdateRequest();
updateRequest.index(indexName);
updateRequest.type(mappingName);
updateRequest.id(url); // docID is a url
XContentBuilder jb = XContentFactory.jsonBuilder();
jb.startObject(); // article
jb.startArray("entities"); // multiple entities
for ( /*each namedEntity*/) {
jb.startObject() // entity
.field("name", name)
.field("frequency",n)
.field("entityType", entityType)
.startObject("disambiguated") // disambiguation
.field("disambiguatedName", disambiguatedNameStr)
.field("entitySubTypes", entitySubTypeArray) // multi value field
.endObject() // disambiguation
.field("quotations", quotationsArray) // multi value field
.endObject(); // entity
}
jb.endArray(); // array of nested objects
b.endObject(); // article
updateRequest.doc(jb);
Blblblblblblbl's answer couldn't work for me atm, because scripts are not enabled in our server. I didn't try Bask's answer yet - Alcanzar's gave me a hard time, because I supposedly couldn't formulate the json string correctly that setDoc receives. I was constantly getting errors that either I am using objects instead of fields or vice versa. I also tried wrapping the json string with doc{} as indicated here, but I didn't manage to make it work. As you mentioned it is difficult to understand how to formulate a curl statement at ES's java API.
A simple way to update the arraylist and object value using Java API.
UpdateResponse update = client.prepareUpdate("indexname","type",""+id)
.addScriptParam("param1", arrayvalue)
.addScriptParam("param2", objectvalue)
.setScript("ctx._source.field1=param1;ctx._source.field2=param2").execute()
.actionGet();
arrayvalue-[
{
"text": "stackoverflow",
"datetime": "2010-07-27T05:41:52.763Z",
"obj1": {
"id": 1,
"email": "sa#gmail.com",
"name": "bass"
},
"id": 1,
}
object value -
"obj1": {
"id": 1,
"email": "sa#gmail.com",
"name": "bass"
}

Categories