Best practices for variable names using Mongo [duplicate] - java

In mongodb docs the author mentions it's a good idea to shorten property names:
Use shorter field names.
and in an old blog post from how to node (it is offline by now April, 2022 edit)
....oft-reported issue with mongoDB is the
size of the data on the disk... each and every record stores all the field-names
.... This means that it can often be
more space-efficient to have properties such as 't', or 'b' rather
than 'title' or 'body', however for fear of confusion I would avoid
this unless truly required!
I am aware of solutions of how to do it. I am more interested in when is this truly required?

To quote Donald Knuth:
Premature optimization is the root of all evil (or at least most of
it) in programming.
Build your application however seems most sensible, maintainable and logical. Then, if you have performance or storage issues, deal with those that have the greatest impact until either performance is satisfactory or the law of diminishing returns means there's no point in optimising further.
If you are uncertain of the impact of particular design decisions (like long property names), create a prototype to test various hypotheses (like "will shorter property names save much space"). Don't expect the outcome of testing to be conclusive, however it may teach you things you didn't expect to learn.

Keep the priority for meaningful names above the priority for short names unless your own situation and testing provides a specific reason to alter those priorities.
As mentioned in the comments of SERVER-863, if you're using MongoDB 3.0+ with the WiredTiger storage option with snappy compression enabled, long field names become even less of an issue as the compression effectively takes care of the shortening for you.

Bottom line up: So keep it as compact as it still stays meaningful.
I don't think that this is every truly required to be shortened to one letter names. Anyway you should shorten them as much as possible, and you feel comfortable with it. Lets say you have a users name: {FirstName, MiddleName, LastName} you may be good to go with even name:{first, middle, last}. If you feel comfortable you may be fine with name:{f, m,l}.
You should use short names: As it will consume disk space, memory and thus may somewhat slowdown your application(less objects to hold in memory, slower lookup times due to bigger size and longer query time as seeking over data takes longer).
A good schema documentation may tell the developer that t stands for town and not for title. Depending on your stack you may even be able to hide the developer from working with these short cuts through some helper utils to map it.
Finally I would say that there's no guideline to when and how much you should shorten your schema names. It highly depends on your environment and requirements. But you're good to keep it compact if you can supply a good documentation explaining everything and/or offering utils to ease the life of developers and admins. Anyway admins are likely to interact directly with mongodb, so I guess a good documentation shouldn't be missed.

I performed a little benchmark, I uploaded 252 rows of data from an Excel into two collections testShortNames and testLongNames as follows:
Long Names:
{
"_id": ObjectId("6007a81ea42c4818e5408e9c"),
"countryNameMaster": "Andorra",
"countryCapitalNameMaster": "Andorra la Vella",
"areaInSquareKilometers": 468,
"countryPopulationNumber": NumberInt("77006"),
"continentAbbreviationCode": "EU",
"currencyNameMaster": "Euro"
}
Short Names:
{
"_id": ObjectId("6007a81fa42c4818e5408e9d"),
"name": "Andorra",
"capital": "Andorra la Vella",
"area": 468,
"pop": NumberInt("77006"),
"continent": "EU",
"currency": "Euro"
}
I then got the stats for each, saved in disk files, then did a "diff" on the two files:
pprint.pprint(db.command("collstats", dbCollectionNameLongNames))
The image below shows two variables of interest: size and storageSize.
My reading showed that storageSize is the amount of disk space used after compression, and basically size is the uncompressed size. So we see the storageSize is identical. Apparently the Wired Tiger engine compresses fieldnames quite well.
I then ran a program to retrieve all data from each collection, and checked the response time.
Even though it was a sub-second query, the long names consistently took about 7 times longer. It of course will take longer to send the longer names across from the database server to the client program.
-------LongNames-------
Server Start DateTime=2021-01-20 08:44:38
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606964546 EndTimeM= 606965328
ElapsedTime MilliSeconds= 782
-------ShortNames-------
Server Start DateTime=2021-01-20 08:44:39
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606965328 EndTimeM= 606965421
ElapsedTime MilliSeconds= 93
In Python, I just did the following (I had to actually loop through the items to force the reads, otherwise the query returns only the cursor):
results = dbCollectionLongNames.find(query)
for result in results:
pass

Adding my 2 cents on this..
Long named attributes (or, "AbnormallyLongNameAttributes") can be avoided while designing the data model. In my previous organisation we tested keeping short named attributes strategy, such as, organisation defined 4-5 letter encoded strings, eg:
First Name = FSTNM,
Last Name = LSTNM,
Monthly Profit Loss Percentage = MTPCT,
Year on Year Sales Projection = YOYSP, and so on..)
While we observed an improvement in query performance, largely due to the reduction in size of data being transferred over the network, or (since we used JAVA with MongoDB) the reduction in length of "keys" in MongoDB document/Java Map heap space, the overall improvement in performance was less than 15%.
In my personal opinion, this was a micro-optimzation that came at an additional cost (and a huge headache) of maintaining/designing an additional system of managing Data Attribute Dictionary for each of the data models. This system was required to have an organisation wide transparency while debugging the application/answering to client queries.
If you find yourself in a position where upto 20% increase in the performance with this strategy is lucrative to you, may be it is time to scale up your MongoDB servers/choose some other data modelling/querying strategy, or else to choose a different database altogether.

If using verbose xml, trying to ameliorate that with custom names could be very important. A user comment in the SERVER-863 ticket said in his case; I'm ' storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.'

Collection with smaller name - InsertCompress
Collection with bigger name - InsertNormal
I Performed this on our mongo sharded cluster and Analysis shows
There is around 10-15% gain in shorter names while saving and seems purely based on network latency. I added bulk insert using multiple threads. So if single inserts it can save more.
My avg data size for InsertCompress is 280B and InsertNormal is 350B and inserted 25 million records. So InsertNormal shows 8.1 GB and InsertCompress shows 6.6 GB. This is data size.
Surprisingly Index data size shows as 2.2 GB for InsertCompress collection and 2 GB for InsertNormal collection
Again the storage size is 2.2 GB for InsertCompress collection while InsertNormal its around 1.6 GB
Overall apart from network latency there is nothing gained for storage, so not worth to put efforts going in this direction to save storage. Only if you have much bigger document and smaller field names saves lot of data you can consider

Related

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

Why my JDBC call is consuming memory 4 times more that actual size of data

I wrote a small java program which loads data from DB2 database using simple JDBC call. I am using select query to get data and using java statement for this purpose. I have properly closed statement and connection objects. I am using 64 bit JVM for compilation and for running the program.
The query is returning 52 million records, each row having 24 columns, which takes me around 4 minutes to load complete data in Unix (having multiprocessor environment). I am using HashMap as data-structure to load the data: Map<String, Map<String, GridTradeStatus>>. The bean GridTradeStatus is a simple getter/setter bean with 24 properties in it.
The memory required for the program is alarmingly high. Java heap size goes up to 5.8 - 6GB to load complete data while actual used heap size remains between 4.7 - 4.9GB. I know that we should not load this much data into memory but my business requirements are in that way only.
The question is that when I put whole data of my table in a flat file it comes out to be roughly equivalent to ~1.2GB. I want to know why my java program is consuming memory 4 times more that its actual size.
There is nothing surprising here (to me at least).
a.) Strings in java consume double the space compared to most common text formats (because Strings are always represented as UTF-16 in the heap). Also, String as an object has quite some overhead (String object itself, reference to the char[] it contains, hashCode etc.). For small strings the String object costs easily as much memory as the data it contains.
b.) You put stuff into a HashMap. HashMap is not exactly memory efficient. First it uses a default load factor of 75%, which means a map with many entries has also a big bucket array. Then, each entry in the map is an object itself, which costs at least two references (key and value) plus object overhead.
In conclusion you pretty much have to expect the memory requirements to increase quite a bit. A factor of 4 is reasonable if your average data String is relatively short.
If you think you cannot afford a ratio 1:4 between the size of data in a flat file and the memory necessary to load the Strings in a HashMap, you should considere not using Java but a lower level language such as C++ or even C.
Of course there are possible optimizations :
use byte[] instead of String (about half the size)
do not use default HashMap parameters (initial size / load factor) but tweak them to meet your actual requirements.
What follows is mainly experience opinion based. I generally use 4 language levels :
high level scripting language (Python, Ruby, or even bash ...) when performance
is not a requirement and speed of developpement is
mid level language (Java, less frequently high level C++) when performance matters but when I also want simplicity of developpement and robustness (strong typing, ...)
low level language (low level C++, or C) what performance is a high requirement and when I accept to spend much more time in writing and testing individual modules
assembly language for the small parts where performance is critical and has been proved to be by profiling.
IMHO you can tweak Java code to highly reduce the memory footprint, but you risk to lose a great part of the interest of Java by losing the excellent string and collections support. It might be as easy and perhaps more efficient to code a small part of the application in C++ and use JNI to tie all together.

Mergesort or Database?

I have a rather complex database-query which gives me 30 million records - roughly 15 times the amount of data which would fit into memory. I need to access all records from the database sequentially (i.e. sorted). For performance reasons it is not possible to use an "order by" statement as the preparation of the ordered ResultSet uses roughly 40 minutes.
I see two possible options to solve my problem:
Dump the resulting data into an unordered file and use some form of merge-sort to arrive with a sorted file
Flatten data and dump it into a secondary database and reselect it using ordering mechanisms of the database.
Which would you prefer for reasons of elegance and performance?
If your choice is number two, do you have a suggestion for the database to use? Would you prefer SQLite, MySQL or Apache Derby?
For sorting large amounts of data, one solution is to sort them into blocks of data you can load. e.g a 30th (15 * 2) and sort those records. This will give you 30 sorted files.
Take the 30 sorted files and do a merge sort between them. (This requires at least 30 records in memory) You can process them as you sort them.
BTW: Its is also possible its time to buy a more powerful computer. You can buy a PC with 16 GB of memory and an SSD for close to $1000. For $2000 you can get a fast PC with 32 GB of memory. This could save you a lot of time. ;)
For the best performance, definitely option 1. Dumping the data to a flat file, sorting with a good external sort program, and then reading back in will use the minimum amount of resource from all the options. If you want to post specifics on the record length and system configuration (memory, disk speeds) I can let you know how long it should take.
The problem with option 2 is that it may simply reproduce the problem you currently have in another form. I can't tell from your post how complex your query is (how many tables you're joining), and it may be that a lot of your 40 minutes is being spent in the join. But even if that is the case, option 2 still has to do an external sort if your data is 15 times the size of available memory. The only databases that do this well are those that are designed to use a commercial external sort under the covers, so you're back to option 1 anyway.
As far as elegance is concerned, that's often in the eye of the beholder ;-). Personally, I find ultra-high performance elegant in its own right, but it's kinda subjective.
It's hard to say which method will be better for you. You really have to benchmark it.
A good idea is to increase your memory and keep an ordered index there. Then retrieve the data from disk/database (based on index of the item that you need)

Bitcask ok for simple and high performant file store?

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.

Avoid an "out of memory error" in Java(eclipse), when using large data structure?

OK, so I am writing a program that unfortunately needs to use a huge data structure to complete its work, but it is failing with a "out of memory error" during its initialization. While I understand entirely what that means and why it is a problem, I am having trouble overcoming it, since my program needs to use this large structure and I don't know any other way to store it.
The program first indexes a large corpus of text files that I provide. This works fine.
Then it uses this index to initialize a large 2D array. This array will have n² entries, where "n" is the number of unique words in the corpus of text. For the relatively small chunk I am testing it o n(about 60 files) it needs to make approximately 30,000x30,000 entries. This will probably be bigger once I run it on my full intended corpus too.
It consistently fails every time, after it indexes, while it is initializing the data structure(to be worked on later).
Things I have done include:
revamp my code to use a primitive int[] instead of a TreeMap
eliminate redundant structures, etc...
Also, I have run the program with-Xmx2g to max out my allocated memory
I am fairly confident this is not going to be a simple line of code solution, but is most likely going to require a very new approach. I am looking for what that approach is, any ideas?
Thanks,
B.
It sounds like (making some assumptions about what you're using your array for) most of the entries will be 0. If so, you might consider using a sparse matrix representation.
If you really have that many entries (your current array is somewhere over 3 gigabytes already, even assuming no overhead), then you'll have to use some kind of on-disk storage, or a lazy-load/unload system.
There are several causes of out of memory issues.
Firstly, the simplest case is you simply need more heap. You're using 512M max heap when your program could operate correctly with 2G. Increase is with -Xmx2048m as a JVM option and you're fine. Also be aware than 64 bit VMs will use up to twice the memory of 32 bit VMs depending on the makeup of that data.
If your problem isn't that simple then you can look at optimization. Replacing objects with primitives and so on. This might be an option. I can't really say based on what you've posted.
Ultimately however you come to a cross roads where you have to make a choice between virtulization and partitioning.
Virtualizing in this context simply means some form of pretending there is more memory than there is. Operating systems use this with virtual address spaces and using hard disk space as extra memory. This could mean only keeping some of the data structure in memory at a time and persisting the rest to secondary storage (eg file or database).
Partitioning is splitting your data across multiple servers (either real or virtual). For example, if you were keeping track of stock trades on the NASDAQ you could put stock codes starting with "A" on server1, "B" on server2, etc. You need to find a reasonable approach to slice your data such that you reduce or eliminate the need for cross-communication because that cross-communication is what limits your scalability.
So simple case, if what you're storing is 30K words and 30K x 30K combinations of words you could divide it up into four server:
A-M x A-M
A-M x N-Z
N-Z x A-M
N-Z x N-Z
That's just one idea. Again it's hard toc omment without knowing specifics.
This is a common problem dealing with large datasets. You can optimize as much as you want, but the memory will never be enough (probably), and as soon as the dataset grows a little more you are still smoked. The most scalable solution is simply to keep less in memory, work on chunks, and persist the structure on disk (database/file).
If you don't need a full 32 bits (size of integer) for each value in your 2D array, perhaps a smaller type such as a byte would do the trick? Also you should give it as much heap space as possible - 2GB is still relatively small for a modern system. RAM is cheap, especially if you're expecting to be doing a lot of processing in-memory.

Categories