I have a hug amount of data on MongoDB I want to compare them but the process needs some logical side to achieve the goal, I know that using Java for the case seems to be complicated for Memory usages?
USA CASE :
I want to compare two documents "Tables" in mongodb that has a large amount of data >1.46M lines, we need to use an approach or some logical part which allows us to do it efficiently without losing memory capacities or arrived to the worth cases ?
I've tried to put the data into java collections and in order to get the descripancies between them, I implement the comparison logic code side using threads but i got a lot of memory leaks and the process takes a lot of time.
Related
What is the fastest way to populate a Hazelcast Data Grid. Reading through documentation I can see couple of variants:
Use multithreading and IMap.set
Use multithreading and IMap.putAll
Use a Distributed Execution in order to start populating the grid from all participants.
My performance benchmark shows that IMap.putAll is faster than IMap.Set. But it is stated in the Hazelcasty Documentation that IMap.putAll does not come with guarantees that everything will be inserted atomically.
Can someone clarify a little bit about what would be the fastest way to populate a data grid with data ?
Is variant number 3 good ?
I would see the same three options. Anyhow as you mentioned, option two does not guarantee that everything was put into the map atomically but if you just load data and wait for all threads to finish loading data using IMap::putAll you should be fine.
Apart from that IMap::set would be the alternative. In any case you want to multithread the loading process. I would play around a bit with different thread numbers and loading data from a client is normally recommended to keep nodes free for storage operations.
I personally never benchmarked your third option, anyhow it would be possible as well. Just not sure it is worth the additional work.
How much data do you want to load that you're concerned it could be slow? Do you already know that loading is slow? Do you use Java Serialization (this is a huge performance killer)? Do you use indexes (those have to be generated while putting data)?
There's normally a lot of optimizations to apply to speed up, not only, data loading but also normal operation.
I'm writing an android application which stores a set of ~50.000
strings, and I need input on how to best store them.
My objective is to be able to query with low latency for a list of
strings matching a pattern (like Hello W* or *m Aliv*), but avoid
a huge initialization time.
I thought of the following 2 ways:
A java collection. I imagine a java collection should be quick to
search, but given that it's fairly large I'm afraid it might have a
big impact on the app initialization time.
A table in a SQLite database. I imagine this would go easy on
initialization time (since it doesn't need to be loaded to memory),
but I'm afraid the query would impose some relevant latency since it
needs to start a SQLite process (or doesn't it?).
Are my "imagine"s correct or horribly wrong? Which way would be best?
If you want quick (as in instant) search times, what you need is a full-text index of your strings. Fortunately, SQLite has some full-text search support with the FTS extension. SQLite is part of the Android APIs and the initialisation time is totally negligible. What you you do have to watch is that the index (the .sqlite file) has to either be shipped with your app in the .apk, or be re-created the first time it opens (and that can take quite some time)
Look at data structures like a patricia trie (http://en.wikipedia.org/wiki/Radix_tree) or a Ternary Search Tree (http://en.wikipedia.org/wiki/Ternary_search_tree). They will dramatically reduce your search time and depending on the amount of overlap in your strings may actually reduce the memory requirements. The Java collections are good for many purposes but are not optimal for large sets of short strings.
I would definitely stick to SQLite. It's really fast in the both initialization and querying. SQLite runs in application process, thus there is almost no time penalties on initialization. A query is normally fired in a background thread to not block main thread. It will be very fast on 50.000 records and you won't load all data in memory, which is also important.
your string no are 50 in this case you can use java collection database will be time taking.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items). I want to know if I should write the lists into a Flat database or use Serialization to flatten the object containing the list? Which is more expensive (CPU-wise)? What are the conditions that make one more expensive than the other?
Thanks!!
Especially since they are Strings, just write them out one per line to a file. Simple, fast, and far easier to test.
I have a List of Strings I need to store locally (assume the list can run between 10 items to 100 items).
Assuming that the total length of the strings is small (e.g. less than 10K), the user-space CPU time used to do the saving is likely to be a few milliseconds using either serialization or a flat file. In other words, it will be so fast that the user won't notice the difference.
You should be looking at the other reasons for choosing between the two alternatives (and others):
How easy is it to write the code.
How many extra dependencies does the alternative pull in.
Human readability / editability of the saved data file ... in case you need to do this.
How easy / hard it would be to change the "schema" of the stuff saved to file ... in case you need to do this.
Whether you can update one string without rewriting the whole file ... if this is relevant.
Support for other things such as atomic update, transactions, complex queries, etc ... if these are relevant.
And if, despite what I said above, you still want to know which will be faster (and by how much), then benchmark it. The real world performance will depend on factors that you haven't specified.
Here are a couple of important references on how to write a Java benchmark so that it gives meaningful results.
How NOT to write a Java micro-benchmark
Robust Java benchmarking, Part 1: Issues.
Robust Java benchmarking, Part 2: Statistics and solutions
And you can experiment to answer this part of your question:
What are the conditions that make one more expensive than the other?
(See above)
I am not sure about the expense but I believe since the object representation many a times contains whole lot of meta data (and structure) which might result in creating a big big object size than the original intended data. Example to this may be when you store a xml structure in a DOM object - it takes about 4X size in memory than the original data.
Based on above, I think serializing as an object might be more expensive. You may also want to consider the consumption of the end product. If you want the produced file to be human readable you will have to serialize the String data for readability.
I'm attempting to transfer a large two dimensional array (17955 X 3) from my server to the client using Asynchronous RPC calls. This is taking a very long period of time which is especially bad because the data is needed in order to initialize the application. I've read that using a JSON object might be faster, but I'm not sure how to do the conversion in Java as I'm pretty new to the language and GWT, and I don't know if the speed difference is significant. I also read somewhere that I can zip the data, but I only read that in a forum and I'm not sure if it's actually possible as I couldn't find information for it elsewhere. Is there any way to transfer large amounts of data from server to client? Thanks for your time.
Read this article on adding JSON capabilities to GWT. In regards to compression this article explains gzipping with GWT.
Also the size of your array is still very large even with the compression you may achieve with gzipping, which will vary depending on how much data is repeated in your array. You may want to consider logically breaking up the array in multiple RPC calls if at all possible.
I would recommend revisiting your design if your application needs such a large amount of data to initialize.
As other's pointed out, you should re-consider your design because even if you are able to solve the data transfer speed issue somehow you will likely find other issues waiting for you:
Processing large amount of data in the browser can be slow.
Lot of data means a lot of used-up memory
What you can think about is:
Partitioning the data:
How is your user going to cope with a lot of data. Your user will probably need some kind of user interface aid to be able to work with such a huge data. If you are going to use paging, tabs or other means to partition the data for user's consumption, why not load the data on demand. For example, you can load a single page of records if you are using a paging grid or you can load a single tab worth of records if you are going to use tabs. Similary, if you are going to allow filtering on the records, you can set a default filter after the load to keep the data to a minumum.
Summarizing the data:
You can also summarize the data on the server, if you are not going to show each row to the user. For example you can initially show summary for each group of records and let the user drill-down in a specific group
We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
If you're working with huge amounts of data, you might want to consider using a database instead.
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.