I am trying to index a record of 5 billion, or even more, rows using lucene.
Does the time of indexing increase exponentially as the record set increases?
My initial indexing of 10 million records happened very quickly, but when I tried to index more than 100 million records, it took more time than I expected, with respect to the 10 million record indexing time.
Is it because it is indexing it against more document hence time is increasing exponentially? Or what could be the reason behind this behavior, and is there any way to optimize it (please note,currently all fields in all the documents are of type StringField, will chaning it to IntField help me in this direction?).
My second question would be how will the search performance be in case of indexing 5 billion records. Any ideas on that?
Let me know if you need more information from my end on this.
Our current use case seems somewhat similar to yours: 1.6 billion rows, most fields are exact matches, periodic addition of files/rows, regular searching. Our initial indexing is not distributed or parallelized in any way, currently, and takes around 9 hours. I only offer that number to give you a very vague sense of what your indexing experience may be.
To try and answer your questions:
Our indexing time does not grow exponentially with the number of rows already indexed, though it does slow down very gradually. For us, perhaps 20% slower by the end, though it could also be specific to our data.
If you are experiencing significant slow-down, I support femtoRgon's suggestion that you profile to see what's eating the time. Lucene has never been the slowest/weakest component in our system.
Yes, you can write to your index in parallel, and you may see improved throughput. Whether it helps or not depends on where your bottlenecks are, of course. Consider using Solr - it may ease your efforts here.
We use a mixture of StringField, LongField, and TextField. It seems unlikely that the type of field is causing your slowdown on its own.
These answers are all anecdotal, but perhaps they'll be of some use to you.
This page is now quite dated, but if you exhaust all your other options, may provide hints of which levers you can pull that might tweak performance: How to make indexing faster
Have you profiled to see what is actually causing your performance issues? You could find something unexpected is eating up all that time. When I profiled a similar performance issue I thought was caused by lucene, turned out the problem was mostly string concatenations.
As to whether you should use StringField or IntField (or TextField, or whatever), you should determine that based on what is in the field on how you are going to search it. If you might want to search the field as a range of numeric values, it should be an IntField, not a StringField. By the way, StringField indexes the entire value as a single term, and skips analysis, so this is also the wrong field for full text, for which you should use a TextField. Basically, using a StringField for everything would seem very much like a bad code smell to me, and could cause performance issues at index time, but I would definitely expect the much larger problems would appear when you start trying to search.
As far as "how will the search performance be with 5 billion values", that's far too vague a question to even attempt to answer. No idea. Try it and see.
Related
I am sending large arrays(>100) elements to my Java backend every second.
Spring(jackson) is converting this array and mapping it to a local String[].
I can map this to a String value for better performance.
Is deserializing in such scenarios a major time consumption activity or not a big deal? If not, when does this become a big deal?
So, basically I am trying to understand the difference between mapping to String vs String[] for big values like an array of 100 elements.
FWIW, internally it is using Jackson parser. And I have to scale this to support concurrent users sending such serialized array data.
Is deserializing in such scenarios a major time consumption activity or not a big deal?
That's a vague question, so it deserves a vague answer - it depends.
In some cases it could be insignificant, in others is could be a major bottleneck. It depends on your application and how often you perform this serialization/deserialization.
It's a bit like asking whether 1$ is a lot. Paying 1$ for something important once a year when you're already spending 1,000,000$ on other things is insignificant.
Paying 1$ every second when the rest of your expenses are 1$ per day is probably a lot.
If you want to improve the performance of your application, you should start by measuring. See what takes too much time and/or resources, and optimize that.
I am trying to implement type-ahead in my app, and I got search suggest to work with an element range index as recommended in the documentation. The problem is, it doesn't fit my use case.
As anyone who has used it knows, it will not return results unless the search string is at the beginning of the content being searched. Barring the use of a leading and trailing wildcard, this won't return what I need.
I was thinking instead of simply doing a search based on the term, then returning the result snippets (truncated in my server-side code) as the suggestions in my type-ahead.
As I don't have a good way of comparing performance, I was hoping for some insight on whether this would be practical, or if it would be too slow.
Also, since it may come up in the answers, yes I have read the post about "chunked Element Range Indexes", but being new to MarkLogic, I can't make heads or tails of it and haven't been able to adapt it to my app.
I wrote the Chunked Element Range Indexes blog post, and found out last-minute that my performance numbers were skewed by a surprisingly large document in my index. When I removed that large document, many of the other techniques such as wildcard matching were suddenly much faster. That surprised me because all the other search engines I'd used couldn't offer such fast performance and flexibility for type-ahead scenarios, expecially if I tried introducing a wild-card search. I decided not to push my post publicly, but someone else accidentally did it for me, so we decided to leave it out there since it still presents a valid option.
Since MarkLogic offers multiple wildcard indexes, there's really a lot you can do in that area. However, search snippets would not be the right way to do that as I believe they'd add some overhead. Call cts:search or one of the other cts calls to match a lexicon. I'm guessing you'd want cts:element-value-match. That does wildcard matches against a range index since which are all in memory, so faster. Turn on all your wildcard indexes on your db if you can.
It should be called from a custom XQuery script in a MarkLogic HTTP server. I'm not recommending a REST extension as I usually would, because you need to be as stream-lined as possible to do most type-ahead scenarios correctly (that is, fast enough).
I'd suggest you find ways to whittle down the set of values in the range index to less than 100,000 so there's less to match against and you're not letting in any junk suggestions. Also, make sure that you filter the set of matches based on the rest of the query (if a user already started typing other words or phrases). Make sure your HTTP script limits the number of suggestions returned since a user can't usually benefit from a long list of suggestions. And craft some algorithms to rank the suggestions so the most helpful ones make it to the top. Finally, be very, very careful not to present suggestions that are more distracting than helpful. If you're going to give your users type-ahead, it will interrupt their searching and train-of-thought, so don't interrupt them if you're going to suggest search phrases that won't help them get what they want. I've seen that way too often, even on major websites. Don't do type-ahead unless you're willing to measure the usage of the feature, and tune it over time or remove it if it's distracting users.
Hoping that helps!
You mention you are using a range index to populate your suggestions, but you can use word lexicons as well. Word lexicons would produce suggestions based on tokenized character data, not entire values of elements (or json properties). It might be worth looking into that.
Alternatively, since you are mentioning wildcards, perhaps cts:value-match could be of interest to you. It runs on values (not words) from range indexes, but takes a wild-carded expression as input. It would perform far better than a snippet approach, which would need to pull up and process actual contents.
HTH!
In mongodb docs the author mentions it's a good idea to shorten property names:
Use shorter field names.
and in an old blog post from how to node (it is offline by now April, 2022 edit)
....oft-reported issue with mongoDB is the
size of the data on the disk... each and every record stores all the field-names
.... This means that it can often be
more space-efficient to have properties such as 't', or 'b' rather
than 'title' or 'body', however for fear of confusion I would avoid
this unless truly required!
I am aware of solutions of how to do it. I am more interested in when is this truly required?
To quote Donald Knuth:
Premature optimization is the root of all evil (or at least most of
it) in programming.
Build your application however seems most sensible, maintainable and logical. Then, if you have performance or storage issues, deal with those that have the greatest impact until either performance is satisfactory or the law of diminishing returns means there's no point in optimising further.
If you are uncertain of the impact of particular design decisions (like long property names), create a prototype to test various hypotheses (like "will shorter property names save much space"). Don't expect the outcome of testing to be conclusive, however it may teach you things you didn't expect to learn.
Keep the priority for meaningful names above the priority for short names unless your own situation and testing provides a specific reason to alter those priorities.
As mentioned in the comments of SERVER-863, if you're using MongoDB 3.0+ with the WiredTiger storage option with snappy compression enabled, long field names become even less of an issue as the compression effectively takes care of the shortening for you.
Bottom line up: So keep it as compact as it still stays meaningful.
I don't think that this is every truly required to be shortened to one letter names. Anyway you should shorten them as much as possible, and you feel comfortable with it. Lets say you have a users name: {FirstName, MiddleName, LastName} you may be good to go with even name:{first, middle, last}. If you feel comfortable you may be fine with name:{f, m,l}.
You should use short names: As it will consume disk space, memory and thus may somewhat slowdown your application(less objects to hold in memory, slower lookup times due to bigger size and longer query time as seeking over data takes longer).
A good schema documentation may tell the developer that t stands for town and not for title. Depending on your stack you may even be able to hide the developer from working with these short cuts through some helper utils to map it.
Finally I would say that there's no guideline to when and how much you should shorten your schema names. It highly depends on your environment and requirements. But you're good to keep it compact if you can supply a good documentation explaining everything and/or offering utils to ease the life of developers and admins. Anyway admins are likely to interact directly with mongodb, so I guess a good documentation shouldn't be missed.
I performed a little benchmark, I uploaded 252 rows of data from an Excel into two collections testShortNames and testLongNames as follows:
Long Names:
{
"_id": ObjectId("6007a81ea42c4818e5408e9c"),
"countryNameMaster": "Andorra",
"countryCapitalNameMaster": "Andorra la Vella",
"areaInSquareKilometers": 468,
"countryPopulationNumber": NumberInt("77006"),
"continentAbbreviationCode": "EU",
"currencyNameMaster": "Euro"
}
Short Names:
{
"_id": ObjectId("6007a81fa42c4818e5408e9d"),
"name": "Andorra",
"capital": "Andorra la Vella",
"area": 468,
"pop": NumberInt("77006"),
"continent": "EU",
"currency": "Euro"
}
I then got the stats for each, saved in disk files, then did a "diff" on the two files:
pprint.pprint(db.command("collstats", dbCollectionNameLongNames))
The image below shows two variables of interest: size and storageSize.
My reading showed that storageSize is the amount of disk space used after compression, and basically size is the uncompressed size. So we see the storageSize is identical. Apparently the Wired Tiger engine compresses fieldnames quite well.
I then ran a program to retrieve all data from each collection, and checked the response time.
Even though it was a sub-second query, the long names consistently took about 7 times longer. It of course will take longer to send the longer names across from the database server to the client program.
-------LongNames-------
Server Start DateTime=2021-01-20 08:44:38
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606964546 EndTimeM= 606965328
ElapsedTime MilliSeconds= 782
-------ShortNames-------
Server Start DateTime=2021-01-20 08:44:39
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606965328 EndTimeM= 606965421
ElapsedTime MilliSeconds= 93
In Python, I just did the following (I had to actually loop through the items to force the reads, otherwise the query returns only the cursor):
results = dbCollectionLongNames.find(query)
for result in results:
pass
Adding my 2 cents on this..
Long named attributes (or, "AbnormallyLongNameAttributes") can be avoided while designing the data model. In my previous organisation we tested keeping short named attributes strategy, such as, organisation defined 4-5 letter encoded strings, eg:
First Name = FSTNM,
Last Name = LSTNM,
Monthly Profit Loss Percentage = MTPCT,
Year on Year Sales Projection = YOYSP, and so on..)
While we observed an improvement in query performance, largely due to the reduction in size of data being transferred over the network, or (since we used JAVA with MongoDB) the reduction in length of "keys" in MongoDB document/Java Map heap space, the overall improvement in performance was less than 15%.
In my personal opinion, this was a micro-optimzation that came at an additional cost (and a huge headache) of maintaining/designing an additional system of managing Data Attribute Dictionary for each of the data models. This system was required to have an organisation wide transparency while debugging the application/answering to client queries.
If you find yourself in a position where upto 20% increase in the performance with this strategy is lucrative to you, may be it is time to scale up your MongoDB servers/choose some other data modelling/querying strategy, or else to choose a different database altogether.
If using verbose xml, trying to ameliorate that with custom names could be very important. A user comment in the SERVER-863 ticket said in his case; I'm ' storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.'
Collection with smaller name - InsertCompress
Collection with bigger name - InsertNormal
I Performed this on our mongo sharded cluster and Analysis shows
There is around 10-15% gain in shorter names while saving and seems purely based on network latency. I added bulk insert using multiple threads. So if single inserts it can save more.
My avg data size for InsertCompress is 280B and InsertNormal is 350B and inserted 25 million records. So InsertNormal shows 8.1 GB and InsertCompress shows 6.6 GB. This is data size.
Surprisingly Index data size shows as 2.2 GB for InsertCompress collection and 2 GB for InsertNormal collection
Again the storage size is 2.2 GB for InsertCompress collection while InsertNormal its around 1.6 GB
Overall apart from network latency there is nothing gained for storage, so not worth to put efforts going in this direction to save storage. Only if you have much bigger document and smaller field names saves lot of data you can consider
It is extremely difficult to illustrate the complexity of frameworks (hibernate, spring, apache-commons, ...)
The only thing I could think of was to compare the file sizes of the jar libraries or even better, the number of classes contained in the jar files.
Of course this is not a mathematical sound proof of complexity. But at least it should make clear that some frameworks are lightweight compared to others.
Of course it would take quiet some time to calculate statistics. In an attempt to save time I was wondering if perhaps somebody did so already ?
EDIT:
Yes, there are a lot of tools to calculate the complexity of individual methods and classes. But this question is about third party jar files.
Also please note that 40% of phrases in my original question stress the fact that everybody is well aware of the fact that complexity is hard to measure and that file size and nr of classes may indeed not be sufficient. So, it is not necessary to elaborate on this any further.
There are tools out there that can measure the complexity of code. However this is more of a psychological question as you cannot mathematically define the term 'complex code'. And obviously giving two random persons some piece of code will give you very different answers.
In general the issue with complexity arises from the fact that a human brain cannot process more than a certain number of lines of code simultaneously (actually functional pieces, but normal lines of code should be exactly that). The exact number of lines that one can hold and understand in memory at the same time of course varies based on many factors (including time of day, day of the week and status of your coffee machine) and therefore completely depends on the audience. However less number of lines of code that you have to keep in your 'internal memory register' for one task is better, therefore this should be the general factor when trying to determine the complexity of an API.
There is however a pitfall with this way of calculating complexity, as many APIs offer you a fast way of solving a problem (easy entry level), but this solution later turns out to cause several very complex coding decisions, that on overall makes your code very difficult to understand. In contrast other APIs require you to do a very complex setup that is hard to understand at first, but the rest of your code will be extremely easy because of that initial setup.
Therefore a good way of measuring API complexity is to define a task to solve by that API that is representative and big enough, and then measure the average amount of simultaneous lines of code one has to keep in mind to implement that task.And once you're done, please publish the result in a scientific paper of your choice. ;)
We have a small index - less than 1MB in size and covering roughly 10,000 documents. The only fields that are stored are quite short which explains the small index size.
After the documents are loaded into the index, an update of an existing document can take between 1 and 2 seconds (there's quite a variance in this range though). We've tried utilizing various best practices (such as those in the Lucene wiki) but can't find what's wrong. We've even gone ahead and are now using RAMDirectory to remove the possibility of IO being the problem.
Is this really the performance to expect?
UPDATE
As requested below, I'm adding some more details:
We're treating Lucene as a black-box, we just time the amount of time it takes to reindex/update an object. We don't know what's going on inside.
The objects (or documents, in Lucene's terms) are quite small, with a total size of a 2KB of data each.
A code snippet outlining your entire update procedure would help. Are you committing after each update? This is not necessary and for top performance you must use Near Realtime Readers. Newer Lucene versions have an NRTManager that handles most of the boilerplate involved.
In many cases the best practice is to commit only rarely or never (except when shutting down). If your service shuts down ungracefully, you lose your index, but even if you didn't, you'd have to rebuild it upon restart anyway to account for all the changes that happened in the meantime.