Limiting the nested fields in Elasticsearch - java

I am trying to index json documents in ElasticSearch with dynamic mappings on. Some of the documents have unpredictable number of keys (nested levels) because of which I started getting this error from ES Java api.
[ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Limit of total fields [1000] in index [my_index] has been exceeded]]]failure in bulk execution
I was wondering if there is an option which can be configured at index level where I can define to scan for fields only till certain level (maybe 2) and store the rest of the document as a string or in flattened form. I did come across some settings like index.mapping.depth.limit but it seems if I set it to 2, this setting rejects the document if there are more levels. link

For Total Field
PUT <index_name>/_settings
{
"index.mapping.total_fields.limit": 2000
}
For depth limit
PUT <index_name>/_settings
{
"index.mapping.depth.limit": 2
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping.html

Add this to your index _settings:
"settings": {
"index.mapping.nested_fields.limit": 150,
...
}
"mappings": {
...
}

Related

optimize mongo query to get max date in a very short time

I'm using the query bellow to get max date (field named extractionDate) in a collection called KPI, and since I'm only interested in the field extractionDate:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
projectionOperation,
group().max(EXTRACTION_DATE).as("result"),
project().andExclude("_id")
),
"kpi",
DBObject.class
));
}
And as you see above, I need to filter the result firstly using the match operation (matchOperation) after that, I'm doing a projection operation to extract only the max of field "extractionDate" and rename it as result.
But this query cost a lot of time (sometimes more than 20 seconds) because I have a huge amount of data, I already added an index on the field extractionDate but I did not gain a lot, so I'm looking for a way to mast it fast as max as possible.
update:
Number of documents we have in the collection kpi: 42.8m documents
The query that being executed:
Streaming aggregation: [{ "$match" : { "type" : { "$in" : ["INACTIVE_SITE", "DEVICE_NOT_BILLED", "NOT_REPLYING_POLLING", "MISSING_KEY_TECH_INFO", "MISSING_SITE", "ACTIVE_CIRCUITS_INACTIVE_RESOURCES", "INCONSISTENT_STATUS_VALUES"]}}}, { "$project" : { "extractionDate" : 1, "_id" : 0}}, { "$group" : { "_id" : null, "result" : { "$max" : "$extractionDate"}}}, { "$project" : { "_id" : 0}}] in collection kpi
explain plan:
Example of a document in the collection KPI:
And finally the indexes that already exist on this collection :
Index tuning will depend more on the properties in the $match expression. You should be able to run the query in mongosh with and get an explain plan to determine if your query is scanning the collection.
Other things to consider is the size of the collection versus the working set of the server.
Perhaps update your question with the $match expression, and the explain plan and perhaps the current set of index definitions and we can refine the indexing strategy.
Finally, "huge" is rather subjective? Are you querying millions or billions or documents, and what is the average document size?
Update:
Given that you're filtering on only one field, and aggregating on one field, you'll find the best result will be an index
{ "type":1,"extractionDate":1}
That index should cover your query -- because the $in will mean that a scan will be selected but a scan over a small index is significantly better than over the whole collection of documents.
NB. The existing index extractionDate_1_customer.irType_1 will not be any help for this query.
I was able to optimize the request thanks to previous answers using this approach:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
sort(Sort.Direction.DESC,EXTRACTION_DATE),
limit(1),
projectionOperation
),
"kpi",
DBObject.class
));
}
Also I had to create a compound index on extractionDate and type (the field I had in matchOperation) like bellow:

Specifying keyword type on String field

I started using hibernate-search-elasticsearch(5.8.2) because it seemed easy to integrate it maintains elasticsearch indices up to date without writing any code. It's a cool lib, but I'm starting to think that it has a very small set of the elasticsearch functionalities implemented. I'm executing a query with a painless script filter which needs to access a String field, which type is 'text' in the index mapping and this is not possible without enabling field data. But I'm not very keen on enabling it as it consumes a lot of heap memory. Here's what elasticsearch team suggests to do in my case:
Fielddata documentation
Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.
Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
Unfortunately I can't find a way to do it with the hibernate-search annotations. Can someone tell me if this is possible or I have to migrate to the vanilla elasticsearch lib and not using any wrappers?
With the current version of Hibernate Search, you need to create a different field for that (e.g. you can't have different flavors of the same field). Note that that's what Elasticsearch is doing under the hood anyway.
#Field(analyzer = "your-text-analyzer") // your default full text search field with the default name
#Field(name="myPropertyAggregation", index = Index.NO, normalizer = "keyword")
#SortableField(forField = "myPropertyAggregation")
private String myProperty;
It should create an unanalyzed field with doc values. You then need to refer to the myPropertyAggregation field for your aggregations.
Note that we will expose much more Elasticsearch features in the API in the future Search 6. In Search 5, the APIs are designed with Lucene in mind and we couldn't break them.

Check for substring efficiently for large data sets

I have:
a database table with 400 000 000 rows (Cassandra 3)
a list of circa 10 000 keywords
both data sets are expected to grow in time
I need to:
check if a specified column contains a keyword
sum how many rows contained the keyword in the column
Which approach should I choose?
Approach 1 (Secondary index):
Create secondary SASI index on the table
Find matches for given keyword "on fly" anytime
However, I am afraid of
cappacity problem - secondary indices can consume extra space and for such large table it could be too much
performance - I am not sure if finding of keyword among hundreds milions of rows can be achieved in a reasonable time
Approach 2 (Java job - brute force):
Java job that continuously iterates over data
Matches are saved into cache
Cache is updated during the next iteration
// Paginate throuh data...
String page = null;
do {
PagingState state = page == null ? null : PagingState.fromString(page);
PagedResult<DataRow> res = getDataPaged(query, status, PAGE_SIZE, state);
// Iterate through the current page ...
for (DataRow row : res.getResult()) {
// Skip empty titles
if (row.getTitle().length() == 0) {
continue;
}
// Find match in title
for (String k : keywords) {
if (k.length() > row.getTitle().length()) {
continue;
}
if (row.getTitle().toLowerCase().contains(k.toLowerCase()) {
// TODO: SAVE match
break;
}
}
}
status = res.getResult();
page = res.getPage();
// TODO: Wait here to reduce DB load
} while (page != null);
Problems
It could be very slow to iterate through whole table. If I waited for one second per every 1000 rows, then this cycle would finish in 4.6 days
This would require extra space for cache; moreover, frequent deletions from cache would produce tombstones in Cassandra
A better way will be to use a search engine like SolR our ElasticSearch. Full text search is their speciality. You could easily dump your data from cassandra to Elasticsearch and implement your java job on top of ElasticSearch.
EDIT:
With Cassandra you can request your result query as a JSON and Elasticsearch 'speak' only in JSON so you will be able to transfer your data very easily.
Elasticsearch
SolR

hbase how to choose pre split strategies and how its affect your rowkeys

I am trying to pre split hbase table. One the HbaseAdmin java api is to create an hbase table is function of startkey, endkey and number of regions. Here's the java api that I use from HbaseAdmin void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
Is there any recommendation on choosing startkey and endkey based on dataset?
My approach is lets say we have 100 records in dataset. I want data divided approximately in 10 regions so each will have approx 10 records. so to find startkey i will say scan '/mytable', {LIMIT => 10} and pick the last rowkey as my startkey and then scan '/mytable', {LIMIT => 90} and pick the last rowkey as my endkey.
Does this approach to find startkey and rowkey looks ok or is there better practice?
EDIT
I tried following approaches to pre split empty table. ALl three didn't work the way I used it. I think I will need to salt the key to get equal distribution.
PS> I am displaying only some region info
1)
byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);
This gives regions with boundaries like:
{
"startkey":"-INFINITY",
"endkey":"11111111",
"numberofrows":3628951,
},
{
"startkey":"11111111",
"endkey":"22222222",
},
{
"startkey":"22222222",
"endkey":"33333333",
},
{
"startkey":"33333333",
"endkey":"44444444",
},
{
"startkey":"88888888",
"endkey":"99999999",
},
{
"startkey":"99999999",
"endkey":"aaaaaaaa",
},
{
"startkey":"aaaaaaaa",
"endkey":"bbbbbbbb",
},
{
"startkey":"eeeeeeee",
"endkey":"INFINITY",
}
This is useless as my rowkeys are of composite form like 'deptId|month|roleId|regionId' and doesn't fit into above boundaries.
2)
byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)
This has same issue:
{
"startkey":"-INFINITY",
"endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99",
}
{
"startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\
"endkey":"33333332",
}
{
"startkey":"33333332",
"endkey":"L\\xCC\\xCC\\xCC\\xCC\\xCC\\xCC\\xCB",
}
{
"startkey":"\\xE6ffffffa",
"endkey":"INFINITY",
}
3) I tried supplying start key and end key and got following useless regions.
hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
Bytes.toBytes("01253|201501|805|1999"), 10);
{
"startkey":"-INFINITY",
"endkey":"04120|200808|805|1999",
}
{
"startkey":"04120|200808|805|1999",
"endkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
}
{
"startkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
"endkey":"000ptq<200wp6\\xBC805|1999",
}
{
"startkey":"001\\x11\\x15\\x13\\x1C201\\x15\\x902\\x5C805|1999",
"endkey":"01253|201501|805|1999",
}
{
"startkey":"01253|201501|805|1999",
"endkey":"INFINITY",
}
First question : Out of my experience with hbase, I am not aware any hard rule for creating number of regions, with start key and end key.
But underlying thing is,
With your rowkey design, data should be distributed across the regions and not hotspotted
(36.1. Hotspotting)
However, if you define fixed number of regions as you mentioned 10. There may not be 10 after heavy data load. If it reaches, certain limit, number of regions will again split.
In your way of creating table with hbase admin documentation says, Creates a new table with the specified number of regions. The start key specified will become the end key of the first region of the table, and the end key specified will become the start key of the last region of the table (the first region has a null start key and the last region has a null end key).
Moreover, I prefer creating a table through script with presplits say 0-10 and I will design a rowkey such that its salted and it will be sitting on one of region servers to avoid hotspotting.
like
EDIT : If you want to implement own regionSplit
you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override
public byte[][] split(int numberOfSplits)
Second question :
My understanding :
You want to find startrowkey and end rowkey for the inserted data in a specific table... below are the ways.
If you want to find start and end rowkeys scan '.meta' table to understand how is your start rowkey and end rowkey..
you can access ui http://hbasemaster:60010 if you can see how the rowkeys are spread across each region. for each region start and rowkeys will be there.
to know how your keys are organized, after pre splitting your table and inserting in to hbase... use FirstKeyOnlyFilter
for example : scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'
which displays all your 100 rowkeys.
if you have huge data (not 100 rows as you mentioned) and want to take a dump of all rowkeys then you can use below from out side shell..
echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt

Logstash + Kibana terms panel without breaking words

I have a Java application that writes to a log file in json format.
The fields that come in the logs are variable.
The logstash reads this logfile and sends it to Kibana.
I've configured the logstash with the following file:
input {
file {
path => ["[log_path]"]
codec => "json"
}
}
filter{
json {
source => "message"
}
date {
match => [ "data", "dd-MM-yyyy HH:mm:ss.SSS" ]
timezone => "America/Sao_Paulo"
}
}
output {
elasticsearch_http {
flush_size => 1
host => "[host]"
index => "application-%{+YYYY.MM.dd}"
}
}
I've managed to show correctly everything in Kibana without any mapping.
But when I try to create a terms panel to show a count of the servers who sent those messages I have a problem.
I have a field called server in my json, that show the servers name (like: a1-name-server1), but the terms panel split the server name because of the "-".
Also I would like to count the number of times that a error message appears, but the same problem occurs, because the terms panel split the error message because of the spaces.
I'm using Kibana 3 and Logstash 1.4.
I've searched a lot on the web and couldn't find any solution.
I also tried using the .raw from logstash, but it didn't work.
How can I manage this?
Thanks for the help.
Your problem here is that your data is being tokenized. This is helpful to make any search over your data. ES (by default) will split your field message split into different parts to be able to search them. For example you may want to search for the word ERROR in your logs, so you probably would like to see in the results messages like "There was an error in your cluster" or "Error processing whatever". If you don't analyze the data for that field with tokenizers, you won't be able to search like this.
This analyzed behaviour is helpful when you want to search things, but it doesn't allow you to group when different messages that have the same content. This is your usecase. The solution to this is to update your mapping putting not_analyzed for that specific field that you don't want to split into tokens. This will probably work for your host field, but will probably break the search.
What I usually do for these kind of situations is to use index templates and multifields. The index template allow me to set a mapping for every index that match a regex and the multifields allow me to have the analyzed and not_analyzed behaviour in a same field.
Using the following query would do the job for your problem:
curl -XPUT https://example.org/_template/name_of_index_template -d '
{
"template": "indexname*",
"mappings": {
"type": {
"properties": {
"field_name": {
"type": "multi_field",
"fields": {
"field_name": {
"type": "string",
"index": "analyzed"
},
"untouched": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
And then in your terms panel you can use field.untouched, to consider the entire content of the field when you calculate the count of the different elements.
If you don't want to use index templates (maybe your data is in a single index), setting the mapping with the Put Mapping API would do the job too. And if you use multifields, there is no need to reindex the data, because from the moment that you set the new mapping for the index, the new data will be duplicated in these two subfields (field_name and field_name.untouched). If you just change the mapping from analyzed to not_analyzed you won't be able to see any change until you reindex all your data.
Since you didn't define a mapping in elasticsearch, the default settings takes place for every field in your type in your index. The default settings for string fields (like your server field) is to analyze the field, meaning that elastic search will tokenize the field contents. That is why its splitting your server names to parts.
You can overcome this issue by defining a mapping. You don't have to define all your fields, but only the ones that you don't want elasticsearch to analyze. In your particular case, sending the following put command will do the trick:
http://[host]:9200/[index_name]/_mapping/[type]
{
"type" : {
"properties" : {
"server" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
You can't do this on an already existing index because switching from analyzed to not_analyzed is a major change in the mapping.

Categories