Query opitimization in MongoDB using Java

Query opitimization in MongoDB using Java - java

At first let me introduce my use case - I have a collection with documents where I store my XML requests and corresponding responses. Also each document has plenty of accompanying properties and some of them are indexed, but request and response aren't.
Whenever I search using indexed field the performance is sufficient. But there are situations where I have to prepare search using a regular expression basing on the request or response value.
For now I do something like this:
db.traffic.find(
{ $or:
[ { request: { $regex: "some.* code=\"123\"} },
{ response: { $regex: "some.* code=\"123\"} }] })
and I translated this into Java code. But querying is to slow and it takes significant amount of time comparing to other queries.
I can see two solutions:
indexing requests and responses - but I suppose that this is not a good idea as they are really long and most probably index will be huge.
querying using some indexed field first and then applying already mentioned query but in the descending order of found records and picking the very first found so I would like to do something like
db.traffic.find({"conversationID": { $regex: "vendorName" }}).sort({"counter": -1})
.findOne(
{ $or:
[ { request: { $regex: "some.* code=\"123\"} },
{ response: { $regex: "some.* code=\"123\"} }] })
So wrapping up - my question is: should I choose the simpler solution which is indexing requests and responses? And what impact will it have on the size of my index?
Or should I choose the second way? But is my code correct and does it what I want?

Have you tried an explain on both methods?
Use the mongo shell to test queries, and add explain before the query, so:
db.traffic.explain()...
try both and you should get some information that indicates the direction.

In the end I tried the second solution but I had slightly change it because of the fact that I can't run findOne on the query result. But I found the equivalent syntax.
So it looks now sth like this:
db.traffic.findOne($query: { $or:
[ { request: { $regex: "some.* code=\"123\"} },
{ response: { $regex: "some.* code=\"123\"} }] },
$orderby: { "counter": -1})
and the performance is much better now.
Also I used explain to check the real "speed".

Related

Looking for Java code for Mongo aggregation query

Can somebody guide me on the aggregation query in Java for the following Mongo query. I am trying to sum up the distance covered every day by the vehicle. There are some duplicate records (which I cannot eliminate) so I have to use group by to filter them out.
db.collection1.aggregate({ $match: { "vehicleId": "ABCDEFGH", $and: [{ "timestamp": { $gt: ISODate("2022-08-24T00:00:00.000+0000") } }, { "timestamp": { $lt: ISODate("2022-08-25T00:00:00.000+0000") } }, { "distanceMiles": { "$gt": 0 } }] } }, { $group: {"_id": {vehicleId: "$vehicleId", "distanceMiles" : "$distanceMiles" } } }, { $group: { _id: null, distance: { $sum: "$_id.distanceMiles" } } })
If possible can you also suggest some references? I am stuck at the last group by involving $_id part.
The Java code that I have except the last group by is:
Criteria criteria = new Criteria();
criteria.andOperator(Criteria.where("timestamp").gte(start).lte(end),
Criteria.where("vehicleId").in(vehicleIdList));
Aggregation aggregation = Aggregation.newAggregation(Aggregation.match(criteria),
Aggregation.sort(Direction.DESC, "timestamp"),
Aggregation.project("distanceMiles", "vehicleId", "timestamp").and("timestamp")
.dateAsFormattedString("%Y-%m-%d").as("yearMonthDay"),
Aggregation.group("vehicleId", "yearMonthDay").first("vehicleId").as("vehicleId").
first("timestamp").as("lastReported").sum("distanceMiles").as("distanceMiles"));
Note. there is a slight difference between the raw mongo query and the query in Java on the date param.

Generally if you are looking for advice on how to directly convert an aggregation pipeline into Java code (not necessarily using the builders), check out this answer.
I'm not really clear on what component you're currently stuck on though. Is it just the direct translation between the aggregation pipeline and the Java code? Is the aggregation pipeline not giving correct results? You haven't mentioned some information such as driver version that would help us advise further if needed.
A few other general things come to mind that might be worth mentioning:
The sample .aggregate() snippet you provided does not have the square brackets ([ and ]) wrapping the pipeline which would be needed in the shell.
When referencing existing field names, you probably need to prefix them with $ in the Java code similar to how you do in the shell.
You should be able to access the values nested inside of the _id field after the first $group stage using dot notation (eg "$_id.distanceMiles") as you are in the sample aggregation.
Depending on which specific driver you are using, documentation such as this may be helpful with respect to working with the builders.

ElasticSearch IO How to remove id from JSON document before writing

I have an Apache Beam streaming job which reads data from Kafka and writes to ElasticSearch using ElasticSearchIO.
The issue I'm having is that messages in Kafka already have key field, and using ElasticSearchIO.Write.withIdFn() I'm mapping this field to document _id field in ElasticSearch.
Having a big volume of data I don't want the key field to be also written to ElasticSearch as part of _source.
Is there an option/workaround that would allow doing that?

Using the Ingest API and the remove processor you´ll be able to solve this pretty easy only using your elasticsearch cluster. You can also simulate ingest pipeline and the results.
I´ve prepared a example which will probably cover your case:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "remove id form incoming docs",
"processors": [
{"remove": {
"field": "id",
"ignore_failure": true
}}
]
},
"docs": [
{"_source":{"id":"123546", "other_field":"other value"}}
]
}
You see, there is one test document containing a filed "id". This field is not present in the response/result anymore:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"other_field" : "other value"
},
"_ingest" : {
"timestamp" : "2018-12-03T16:33:33.885909Z"
}
}
}
]
}

I've created a ticket in Apache Beam JIRA describing this issue.
For now the original issue can not be resolved as part of indexation process using Apache Beam API.
The workaround that Etienne Chauchot, one of the maintainers, proposed is to
have separate task which will clear indexed data afterwords.
See Remove a field from a Elasticsearch document for example.
For the future, if someone also would like to leverage such feature, you might want to follow the linked ticket.

What is the correct design principle for HTTP GET responses in REST that need to return 2 different sets of responses

I wanted to understand which is the right approach to be followed when designing a HTTP GET response as per REST. I have the following requirement
class Employee {
private long employeedID;
private String name;
private Date dob;
private String address;
private String department;
}
Modeling as per REST, the HTTP GET /employees would return the array of all employees. Similarly HTTP GET /employees/1 will return the employee with Id as 1
Now there is one UI driven workflow, where I need to display only the name and employeeID of each employee. Hence the existing response from the HTTP GET /employees is heavyweight (the other fields are unnecessarily transferred). Therefore I want to limit the response to only contain the name and employeeID of each employee.
I am evaluating the following options
Approach 1 :
Use the Content-Type HTTP header to indicate that the client making the HTTP GET /employees request needs a trimmed list of attributes in the response. I.e have some custom string in the Content-Type (viz application.summary+json) that will result in only the 2 attributes to be included in the response
Approach 2 :
Use an additional query parameter as the HTTP GET /employess?isSummary=true. In this case, on the server side , depending on the value of the isSummary parameter, I can return only the 2 attributes for each employee
Approach 3 :
Create a new REST endpoint itself that supports the trimmed down response i.e HTTP GET /employees/summaryDetails
In this case only the 2 attributes will be returned in the above endpoint.
Of these 3 approaches, which would follow most closely with REST ?
Thanks

I think something in the realm of Approach #2 is the way to go here. Fundamentally, you are still accessing the same search and result set in terms of the resource (Employee), and it is the same resource. so Approach #3 isn't really suitable.
That said, there are various ways to go about #2. One approach is to have a query string parameter representing the projection - kind of like a SQL projection. So something like:
GET /employees?fields=ID,name
I've worked with a few API's that work this way and they work quite well.

This answer is added as a follow-up to the discussion in comments below this answer.
Basically, yes I'm against HATEOAS. Why? The idea in general is good but:
IMO it tends to remove the reasonable limits when it comes to number of endpoints. It seems that the answer of some developers to Will it be RESTful...? or How to do it in REST...? is very often: Add a new endpoint /data/{id}/add/ and it will be documented via _links meta field. This is not how it should be done. This way you can always add a new endpoint and appropriate link and finish with a huge number of endpoints that no one can understand or verify. E.g. this link returns the basic set of data:
"http://foo.bar/employees/1"
this returns further details:
"http://foo.bar/employees/1/details"
What if I need other subset of details? Will I add a new endpoint? And.. What if there are multiple different clients that need a mutually exclusive
subsets of data? Each of these clients will have a dedicated endpoint? This is a nightmare!
Links are not about URLs only but query params as well. Are they included in links? In what form? Templates? So I can't follow the link as is. Every param has a default value? I guess that is not simply possible to provide a default value for every query param.
The mentioned discoverability and documentation. Trust me, for most of APIs you design, develop and deploy you'll need to write docs. Why? Because the company which ordered this API needs it. Does stripe.com follow HATEOS rules? No! So why it's so successful? Because it's extremely well documented and has libraries along with examples in multiple most popular languages and tools.
Find time to view this talk, it's worth it.
And now after this short note about HATEOAS.
Approach #1
Headers should be used when the version of resource is changed (new fields are added or removed) itself rather than you need a particular fields or subset of a resource. So IMO this is not the way to go.
Approach #3
It's a totally bad idea for the reasons mentioned in the intro to this answer.
Approach #2
IMHO this is the way to go. As #leeor answered this is a popular, accepted and a flexible pattern.
You can extend it by adding a query param called e.g. view which is an enum (SIMPLE, EXTENDED, FULL) and represents a list of predefined views. This is the way to avoid adding new endpoints. Instead you add and document(!) new views. It's up to you if view and fields are mutually exclusive or the order they are processed.

The problem with each of the listed approaches is that it complicates the API, and violates probably the most fundamental principle of REST, which is discoverability. The response gives no clue that any of the above API exists. You would have to read documentation (horror!). The fundamental rule of REST is HATEOAS: Hypertext As The Engine of Application State.
So, if you want an API that is the maximally RESTful, consider the following, which follows the standard known as HAL:
This:
HTTP GET /employees
yields:
[ {
"employeeID": 1,
"name": "Joe",
"_links": [ {
"rel": "self",
"href": "http://foo.bar/employees/1"
}, {
"rel": "details",
"href": "http://foo.bar/employees/1/details"
} ]
}, {
"employeeID": 2,
"name": "Sam",
"_links": [ {
"rel": "self",
"href": "http://foo.bar/employees/2"
}, {
"rel": "details",
"href": "http://foo.bar/employees/2/details"
} ]
} ]
Following a link:
HTTP GET /employees/1
yields:
{
"employeeID": 1,
"name": "Joe",
"_links": [ {
"rel": "self",
"href": "http://foo.bar/employees/1"
}, {
"rel": "details",
"href": "http://foo.bar/employees/1/details"
} ]
}
And following another link:
HTTP GET /employees/1/details
yields:
{
"employeeID": 1,
"name": "Joe",
"dob": "1985-04-23",
"address": "123 Main St",
"department": "Department of Redundant Links",
"_links": [ {
"rel": "self",
"href": "http://foo.bar/employees/1"
}, {
"rel": "details",
"href": "http://foo.bar/employees/1/details"
} ]
}
For inspiration, check out the JIRA REST API, probably the best ones I've seen.

Like search in Elasticsearch

I am using elasticsearch for filtering and searching from json file and I am newbie in this technology. So I am little bit confused how to write like query in elasticsearch.
select * from table_name where 'field_name' like 'a%'
This is mysql query. How do I write this query in Elasticsearch? I am using elasticsearch version 0.90.7.

I would highly suggest updating your ElasticSearch version if possible, there have been significant changes since 0.9.x.
This question is not quite specific enough, as there are many ways ElasticSearch can fulfill this functionality, and they differ slightly on your overall goal. If you are looking to replicate that SQL query exactly then in this case use the wildcard query or prefix query.
Using a wildcard query:
Note: Be careful with wildcard searches, they are slow. Avoid using wildcards at the beginning of your strings.
GET /my_index/table_name/_search
{
"query": {
"wildcard": {
"field_name": "a*"
}
}
}
Or Prefix query
GET /my_index/table_name/_search
{
"query": {
"prefix": {
"field_name": "a"
}
}
}
Or partial matching:
Note: Do NOT blindly use partial matching, while there are corner cases for it's use, correct use of analyzers is almost always better.
Also this exact query will be equivalent to LIKE '%a%', which again, could be better setup with correct use of mapping and a normal query search!
GET /my_index/table_name/_search
{
"query": {
"match_phrase": {
"field_name": "a"
}
}
}
If you are reading this wondering about querying ES similarly for search-as-you-type I would suggest reading up on edge-ngrams, which relate to proper use of mapping depending on what you are attempting to do =)

GET /indexName/table_name/_search
{
"query": {
"match_phrase": {
"field_name": "your partial text"
}
}
}
You can use "type" : "phrase_prefix" to prefix or post fix you search
Java code for the same:
AndFilterBuilder andFilterBuilder = FilterBuilders.andFilter();
andFilterBuilder.add(FilterBuilders.queryFilter(QueryBuilders.matchPhraseQuery("field_name",
"your partial text")));
Gave 'and filter' example so that you can append extra filters if you want to.
Check this for more detail:
https://www.elastic.co/guide/en/elasticsearch/guide/current/slop.html

Below query I wrote, this is something like
SELECT * FROM TABLE WHERE api='payment' AND api_v='v1' AND status='200' AND response LIKE '%expired%' AND response LIKE '%token%'
Please note table = document here
GET/POST both accepted
GET /transactions-d-2021.06.24/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"api":"payment"
}
},
{
"match":{
"api_v":"v1"
}
},
{
"match":{
"status":"200"
}
},
{
"wildcard":{
"response":"*expired*"
}
},
{
"wildcard":{
"response":"*token*"
}
}
]
}
}
}

Writing a custom bool query worked for me
#Query("{\"bool\":{\"should\":[{\"query_string\":{\"fields\":[\"field_name\"],\"query\":\"?0*\"}}]}}")

Not able to Query alphanumeric fields from ELASTIC SEARCH using TERMS QUERY

I am trying to query Alphanumeric values from the index using TERMS QUERY, But it is not giving me the output.
Query:
{
"size" : 10000,
"query" : {
"bool" : {
"must" : {
"terms" : {
"caid" : [ "A100945","A100896" ]
}
}
}
},
"fields" : [ "acco", "bOS", "aid", "TTl", "caid" ]
}
I want to get all the entries that has caid A100945 or A100896
The same query works fine for NUmeric fields.
I am not planning to use QueryString/MatchQuery as i am trying to build general query builder that can build query for all the request. Hence am looking to get the entries usinng TERMS Query only.
Note: I am using Java API org.elasticsearch.index.query.QueryBuilders for building the Query.
eg: QueryBuilders.termQuery("caid", "["A10xxx", "A101xxx"]")
Please help.
Regards,
Mik

If you have not customized the mappings/analysis for the caid-field, then your values are indexed as e.g. a100945, a100896 (note the lowercasing.)
The terms-query does not do query-time text-analysis, so you'll be searching for A100945 which does not match a100945.
This is quite a common problem, and is explained a bit more in this article on Troubleshooting Elasticsearch searches, for Beginners.

You better use match query.match query are analyzed[applied default analyzer and query] like
QueryBuilders.matchQuery("caid", "["A10xxx", "A101xxx"]");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.