how to programmatically dump data(bulk write) to elastic search - java

I am trying to learn the ropes of elastic search. As part of QA testing, I want to write massive number of records to ES(say 10K records). Each record is a geo location (x,y) coordinate. Each write will arbitrarily increase the value of (x,y). I can have a counter that I can update in every loop operation in Java and write to ES. But I am guessing there may be a better way (because in ES documentation, I come across _bulk keyword).
Is there any ES way of doing massive programmatic writes to ES

You can use the bulk API in Elasticsearch to add multiple documents at once. The API and examples are described in Elasticsearch: The Definitive guide which can be found here https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html
I'm guessing you would want to build a string like this:
{ "index": { "_index": "locationstorage", "_type": "geolocation" }} \n
{ "xcoord": 1, "ycoord": 2 } \n
{ "index": { "_index": "locationstorage", "_type": "geolocation" }} \n
{ "xcoord": 2, "ycoord": 3 } \n
and POST it to the /_bulk API

Related

Looking for Java code for Mongo aggregation query

Can somebody guide me on the aggregation query in Java for the following Mongo query. I am trying to sum up the distance covered every day by the vehicle. There are some duplicate records (which I cannot eliminate) so I have to use group by to filter them out.
db.collection1.aggregate({ $match: { "vehicleId": "ABCDEFGH", $and: [{ "timestamp": { $gt: ISODate("2022-08-24T00:00:00.000+0000") } }, { "timestamp": { $lt: ISODate("2022-08-25T00:00:00.000+0000") } }, { "distanceMiles": { "$gt": 0 } }] } }, { $group: {"_id": {vehicleId: "$vehicleId", "distanceMiles" : "$distanceMiles" } } }, { $group: { _id: null, distance: { $sum: "$_id.distanceMiles" } } })
If possible can you also suggest some references? I am stuck at the last group by involving $_id part.
The Java code that I have except the last group by is:
Criteria criteria = new Criteria();
criteria.andOperator(Criteria.where("timestamp").gte(start).lte(end),
Criteria.where("vehicleId").in(vehicleIdList));
Aggregation aggregation = Aggregation.newAggregation(Aggregation.match(criteria),
Aggregation.sort(Direction.DESC, "timestamp"),
Aggregation.project("distanceMiles", "vehicleId", "timestamp").and("timestamp")
.dateAsFormattedString("%Y-%m-%d").as("yearMonthDay"),
Aggregation.group("vehicleId", "yearMonthDay").first("vehicleId").as("vehicleId").
first("timestamp").as("lastReported").sum("distanceMiles").as("distanceMiles"));
Note. there is a slight difference between the raw mongo query and the query in Java on the date param.
Generally if you are looking for advice on how to directly convert an aggregation pipeline into Java code (not necessarily using the builders), check out this answer.
I'm not really clear on what component you're currently stuck on though. Is it just the direct translation between the aggregation pipeline and the Java code? Is the aggregation pipeline not giving correct results? You haven't mentioned some information such as driver version that would help us advise further if needed.
A few other general things come to mind that might be worth mentioning:
The sample .aggregate() snippet you provided does not have the square brackets ([ and ]) wrapping the pipeline which would be needed in the shell.
When referencing existing field names, you probably need to prefix them with $ in the Java code similar to how you do in the shell.
You should be able to access the values nested inside of the _id field after the first $group stage using dot notation (eg "$_id.distanceMiles") as you are in the sample aggregation.
Depending on which specific driver you are using, documentation such as this may be helpful with respect to working with the builders.

Query Elastic document field with and without characters

I have the following documents stored at my elasticsearch index (my_index):
{
"name": "111666"
},
{
"name": "111A666"
},
{
"name": "111B666"
}
and I want to be able to query these documents using both the exact value of the name field as well as a character-trimmed version of the value.
Examples
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "111666"
}
}
}
}
should return all of the (3) documents mentioned above.
On the other hand:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "111a666"
}
}
}
}
should return just one document (the one that matches exactly with the the provided value of the name field).
I didn't find a way to configure the settings of my_index in order to support such functionality (custom search/index analyzers etc..).
I should mention here that I am using ElasticSearch's Java API (QueryBuilders) in order to implement the above-mentioned queries, so I thought of doing it the Java-way.
Logic
1) Check if the provided query-string contains a letter
2) If yes (e.g 111A666), then search for 111A666 using a standard search analyzer
3) If not (e.g 111666), then use a custom search analyzer that trims the characters of the `name` field
Questions
1) Is it possible to implement this by somehow configuring how the data are stored/indexed at Elastic Search?
2) If not, is it possible to conditionally change the analyzer of a field at Runtime? (using Java)
You can easily use any build in analyzer or any custom analyzer to map your document in elasticsearch. More information on analyzer is here
The "term" query search for exact match. You can find more information about exact match here (Finding Exact Values)
But you can not change a index once it created. If you want to change any index, you have to create a new index and migrate all your data to new index.
Your question is about different logic for the analyzer at index and query time.
The solution for your Q1 is to generate two tokens at index time (111a666 -> [111a666, 111666]) but only on token at query time (111a666 -> 111a666 and 111666 -> 111666).
I.m.h.o. your have to generate a new analyzer like
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern_replace-tokenfilter.html which supported "preserve_original" like https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-capture-tokenfilter.html does.
Or you could use two fields (one with original and one without letters) and search over both.

Logstash + Kibana terms panel without breaking words

I have a Java application that writes to a log file in json format.
The fields that come in the logs are variable.
The logstash reads this logfile and sends it to Kibana.
I've configured the logstash with the following file:
input {
file {
path => ["[log_path]"]
codec => "json"
}
}
filter{
json {
source => "message"
}
date {
match => [ "data", "dd-MM-yyyy HH:mm:ss.SSS" ]
timezone => "America/Sao_Paulo"
}
}
output {
elasticsearch_http {
flush_size => 1
host => "[host]"
index => "application-%{+YYYY.MM.dd}"
}
}
I've managed to show correctly everything in Kibana without any mapping.
But when I try to create a terms panel to show a count of the servers who sent those messages I have a problem.
I have a field called server in my json, that show the servers name (like: a1-name-server1), but the terms panel split the server name because of the "-".
Also I would like to count the number of times that a error message appears, but the same problem occurs, because the terms panel split the error message because of the spaces.
I'm using Kibana 3 and Logstash 1.4.
I've searched a lot on the web and couldn't find any solution.
I also tried using the .raw from logstash, but it didn't work.
How can I manage this?
Thanks for the help.
Your problem here is that your data is being tokenized. This is helpful to make any search over your data. ES (by default) will split your field message split into different parts to be able to search them. For example you may want to search for the word ERROR in your logs, so you probably would like to see in the results messages like "There was an error in your cluster" or "Error processing whatever". If you don't analyze the data for that field with tokenizers, you won't be able to search like this.
This analyzed behaviour is helpful when you want to search things, but it doesn't allow you to group when different messages that have the same content. This is your usecase. The solution to this is to update your mapping putting not_analyzed for that specific field that you don't want to split into tokens. This will probably work for your host field, but will probably break the search.
What I usually do for these kind of situations is to use index templates and multifields. The index template allow me to set a mapping for every index that match a regex and the multifields allow me to have the analyzed and not_analyzed behaviour in a same field.
Using the following query would do the job for your problem:
curl -XPUT https://example.org/_template/name_of_index_template -d '
{
"template": "indexname*",
"mappings": {
"type": {
"properties": {
"field_name": {
"type": "multi_field",
"fields": {
"field_name": {
"type": "string",
"index": "analyzed"
},
"untouched": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
And then in your terms panel you can use field.untouched, to consider the entire content of the field when you calculate the count of the different elements.
If you don't want to use index templates (maybe your data is in a single index), setting the mapping with the Put Mapping API would do the job too. And if you use multifields, there is no need to reindex the data, because from the moment that you set the new mapping for the index, the new data will be duplicated in these two subfields (field_name and field_name.untouched). If you just change the mapping from analyzed to not_analyzed you won't be able to see any change until you reindex all your data.
Since you didn't define a mapping in elasticsearch, the default settings takes place for every field in your type in your index. The default settings for string fields (like your server field) is to analyze the field, meaning that elastic search will tokenize the field contents. That is why its splitting your server names to parts.
You can overcome this issue by defining a mapping. You don't have to define all your fields, but only the ones that you don't want elasticsearch to analyze. In your particular case, sending the following put command will do the trick:
http://[host]:9200/[index_name]/_mapping/[type]
{
"type" : {
"properties" : {
"server" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
You can't do this on an already existing index because switching from analyzed to not_analyzed is a major change in the mapping.

How to write query with index intersection with Mongo java driver

I googled and read the official doc of mongodb (http://docs.mongodb.org/manual/core/index-intersection/), but didn't find any tutorial or indications on syntax of query using index intersection.
Does mongodb apply automatically index intersection when the query involves 2 fields which are separately indexed by a single index? I don't think so.
Here is what cursor.explain() show when i run a query between 2 dates and a given "name" ("name" is a field, both date and name are indexed.)
{
"cursor": "BtreeCursor Name_1",
"isMultiKey": false,
"n": 99330,
"nscannedObjects": 337500,
"nscanned": 337500,
"nscannedObjectsAllPlans": 337601,
"nscannedAllPlans": 337705,
"scanAndOrder": false,
"indexOnly": false,
"nYields": 18451,
"nChunkSkips":
"millis": 15430,
"indexBounds": {
"Name": [
[
"blabla",
"blabla"
]
]
},
"allPlans": [
{
"cursor": "BtreeCursor Name_1",
"isMultiKey": false,
"n": 99330,
"nscannedObjects": 337500,
"nscanned": 337500,
"scanAndOrder": false,
"indexOnly": false,
"nChunkSkips": 0,
"indexBounds": {
"Name": [
[
"blabla",
"blabla"
]
]
}
},
{
"cursor": "BtreeCursor Date_1",
"isMultiKey": false,
"n": 0,
"nscannedObjects": 101,
"nscanned": 102,
"scanAndOrder": false,
"indexOnly": false,
"nChunkSkips": 0,
"indexBounds": {
"Date": [
[
"2014-08-23 10:28:50.221",
"2014-08-23 13:28:50.221"
]
]
}
},
{
"cursor": "Complex Plan",
"n": 0,
"nscannedObjects": 0,
"nscanned": 103,
"nChunkSkips": 0
}
The complex plan shows nothing. And the elapsed time is 16s. If I query only by name without date, it takes only 0.9s
I want to learn how to write query using index intersection in mongojava driver, something like hint() in mongo shell. Any example or tutorial link is welcome.
I know about writing basic queries with Mongodb java driver. You can just post the essential code example if it saves ur time.
Thanks in advance.
After reading these links: http://docs.mongodb.org/manual/core/query-plans/#index-filters
https://jira.mongodb.org/browse/SERVER-3071
I come to conclude that there is no way for now to force query to use index intersection.
In fact, when several candidate index are possible for a query, mongodb runs them in parallel and waits a index to "win the match". The winner index is the one that completes the whole query first or returns a threshold number of matching result first. Then mongodb uses this index to query.
In the case that your queries are very variant and you cannot build many compound index, its dead. You can only trust mongodb's test.
Sometimes, one index is more selective than another. But it doesn't mean that it returns more quickly the result. Like my case, the "name" index is more selective. It may fetch less documents. But it requires a date comparaison to determine if the fetched document matches the whole query. On the other side, the "date" index fetches more documents from the disque but only does a simple equality test on the "name" field to determine if the document matches the query. That is possibly why it can win the test.
About the index intersection, it has never been used in my several query tests. I doubt if it is useful and expect mongodb to improve its performance in future version.
If my conclusion is wrong, please point it out. Still learning about MongoDB :)
Does mongodb apply automatically index intersection when the query
involves 2 fields which are separately indexed by a single index?
has been answered here: MongoDB index intersection
You can't force MongoDB to apply index intersections rather you could modify your queries to allow MongoDB query optimizer to apply index intersection strategy on your query.
To learn how your query parameters affect the indexing process, see this link, though it is for compound indexes.
http://java.dzone.com/articles/optimizing-mongodb-compound
And Java API provides two methods to use hint() with the find() operation:
MongoDB Java API
public DBCursor hint(String indexName)
public DBCursor hint(DBObject indexKeys)
Informs the database of indexed fields of the collection in order to
improve performance.
which can be used as below,
List obj = collection.find( query ).hint(indexName);

MongoDb - Update collection atomically if set does not exist

I have the following document in my collection:
{
"_id":NumberLong(106379),
"_class":"x.y.z.SomeObject",
"name":"Some Name",
"information":{
"hotelId":NumberLong(106379),
"names":[
{
"localeStr":"en_US",
"name":"some Other Name"
}
],
"address":{
"address1":"5405 Google Avenue",
"city":"Mountain View",
"cityIdInCitiesCodes":"123456",
"stateId":"CA",
"countryId":"US",
"zipCode":"12345"
},
"descriptions":[
{
"localeStr":"en_US",
"description": "Some Description"
}
],
},
"providers":[
],
"some other set":{
"a":"bla bla bla",
"b":"bla,bla bla",
}
"another Property":"fdfdfdfdfdf"
}
I need to run through all documents in collection and if "providers": [] is empty I need to create new set based on values of information section.
I'm far from being MongoDB expert, so I have the few questions:
Can I do it as atomic operation?
Can I do this using MongoDB console? as far as I understood I can do it using $addToSet and $each command?
If not is there any Java based driver that can provide such functionality?
Can I do it as atomic operation?
Every document will be updated in an atomic fashion. There is no "atomic" in MongoDB in the sense of RDBMS, meaning all operations will succeed or fail, but you can prevent other writes interleaves using $isolated operator
Can I do this using MongoDB console?
Sure you can. To find all empty providers array you can issue a command like:
db.zz.find(providers :{ $size : 0}})
To update all documents where the array is of zero length with a fixed set of string, you can issue a query such as
db.zz.update({providers : { $size : 0}}, {$addToSet : {providers : "zz"}})
If you want to add a portion to you document based on a document's data, you can use the notorious $where query, do mind the warnings appearing in that link, or - as you had mentioned - query for empty provider array, and use cursor.forEach()
If not is there any Java based driver that can provide such functionality?
Sure, you have a Java driver, as for each other major programming language. It can practically do everything described, and basically every thing you can do from the shell. Is suggest you to get started from the Java Language Center.
Also there are several frameworks which facilitate working with MongoDB and bridge the object-document world. I will not give a least here as I'm pretty biased, but I'm sure a quick Google search can do.
db.so.find({ providers: { $size: 0} }).forEach(function(doc) {
doc.providers.push( doc.information.hotelId );
db.so.save(doc);
});
This will push the information.hotelId of the corresponding document into an empty providers array. Replace that with whatever field you would rather insert into the providers array.

Categories