Mongodb performance degrades significantly over time with upsert. - java

I'm using Mongodb as a cache right now. The application will be fed with 3 CSVs over night and the CSVs get bigger because new products will be added all the time. Right now, I'm reached 5 million records and it took about 2 hours to process everything. As the cache is refreshed everyday it'll become impractical to refresh the data.
For example
CSV 1
ID, NAME
1, NAME!
CSV 2
ID, DESCRIPTION
1, DESC
CSV 3
ID, SOMETHING_ELSE
1, SOMETHING_ELSE
The application will read CSV 1 and put it in the database. Then CSV 2 will be read if there're new information it will be added to the same document or create a new record. The same logic applies for CSV 3. So, one document will get different attributes from different CSVs hence the upsert. After everything is done then all the documents will be indexes.
Right now the first 1 million documents is relatively quick, but I can see the performance degrades considerably over time. I'm guessing it's because of the upsert as Mongodb has to find the document and update the attributes otherwise create it. I'm using Java Driver and MongoDB 2.4. Is there anyway I could improve or even do batch upsert in mongodb java driver?

What do you mean by 'after everything is done then all the documents will be indexed'?
If it is because you want to add additional indexes, it is debatable to do it at the end, but it is fine.
If you have absolutely no indexes, then this is likely your issue.
You want to ensure that all inserts/upserts you are doing are using an index. You can run one command and use .explain() to see if an index is getting used appropriately.
You need an index, otherwise you are scanning 1 million documents for each insert/update.
Also, can you also give more details about your application?
are you going to do the import in 3 phases only once, or will you do frequent updates?
do CSV2 and CSV3 modify a large percentage of the documents?
do the modifications of CSV2 and CSV3 add or replace documents?
what is the average size of your documents?
Let's assume you are doing a lot updates on the same documents many times. For example, CSV2 and CSV3 have updates on the same documents. Instead of importing for CSV1, then doing updates for CSV2, then another set of updates for CSV3, you may want to simply keep the documents in the memory of your application, apply all the updates in memory, then push your documents in the database. That assumes that you have enough RAM to do the operation, otherwise you will be using the disk again.

Related

SQL query performance, archive vs status change

Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

Processing a big list from DB in Java

I have a big list of over 20000 items to be fetched from DB and process it daily in a simple console based Java App.
What is the best way to do that. Should I fetch the list in small sets and process it or should I fetch the complete list into an array and process it. Keeping in an array means huge memory requirement.
Note: There is only one column to process.
Processing means, I have to pass that string in column to somewhere else as a SOAP request.
20000 items are string of length 15.
It depends. 20000 is not really a big number. If you are only processing 20000 short strings or numbers, the memory requirement isn't that large. But if it's 20000 images that is a bit larger.
There's always a tradeoff. Multiple chunks of data means multiple trips to the database. But a single trip means more memory. Which is more important to you? Also can your data be chunked? Or do you need for example record 1 to be able to process record 1000.
These are all things to consider. Hopefully they help you come to what design is best for you.
Correct me If I am Wrong , fetch it little by little , and also provide a rollback operation for it .
If the job can be done on a database level i would fo it using SQL sripts, should this be impossible i can recommend you to load small pieces of your data having two columns like the ID-column and the column which needs to be processed.
This will enable you a better performance during the process and if you have any crashes you will not loose all processed data, but in a crash case you eill need to know which datasets are processed and which not, this can be done using a 3rd column or by saving the last processed Id each round.

java - MongoDB + Solr performances

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.
What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):
1- When inserting an object in MongoDB, I then add it to Solr
SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();
2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like
StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();
3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:
get the id of the document
get the object from MongoDB having the same id to be able to retrieve the properties from there
4- When deleting, well I haven't done that part yet and not really sure how to do it in Java
So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after
Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
useColdSearcher is set to
false in order to have good search performance, or
true if you want changes performed to the index to be taken faster into account at the price of a slower search.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
For more information about Solr performance factors, see
http://wiki.apache.org/solr/SolrPerformanceFactors
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query :
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html
You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)
If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing
<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>

Performance problem on Java DB Derby Blobs & Delete

I’ve been experiencing a performance problem with deleting blobs in derby, and was wondering if anyone could offer any advice.
This is primarily with 10.4.2.0 under windows and solaris, although I’ve also tested with the new 10.5.1.1 release candidate (as it has many lob changes), but this makes no significant difference.
The problem is that with a table containing many large blobs, deleting a single row can take a long time (often over a minute).
I’ve reproduced this with a small test that creates a table, inserts a few rows with blobs of differing sizes, then deletes them.
The table schema is simple, just:
create table blobtest( id integer generated BY DEFAULT as identity, b blob )
and I’ve then created 7 rows with the following blob sizes : 1024 bytes, 1Mb, 10Mb, 25Mb, 50Mb, 75Mb, 100Mb.
I’ve read the blobs back, to check they have been created properly and are the correct size.
They have then been deleted using the sql statement ( “delete from blobtest where id = X” ).
If I delete the rows in the order I created them, average timings to delete a single row are:
1024 bytes: 19.5 seconds
1Mb: 16 seconds
10Mb: 18 seconds
25Mb: 15 seconds
50Mb: 17 seconds
75Mb: 10 seconds
100Mb: 1.5 seconds
If I delete them in reverse order, the average timings to delete a single row are:
100Mb: 20 seconds
75Mb: 10 seconds
50Mb: 4 seconds
25Mb: 0.3 seconds
10Mb: 0.25 seconds
1Mb: 0.02 seconds
1024 bytes: 0.005 seconds
If I create seven small blobs, delete times are all instantaneous.
It thus appears that the delete time seems to be related to the overall size of the rows in the table more than the size of the blob being removed.
I’ve run the tests a few times, and the results seem reproducible.
So, does anyone have any explanation for the performance, and any suggestions on how to work around it or fix it? It does make using large blobs quite problematic in a production environment…
I have exact the same issue you have.
I found that when I do DELETE, derby actually "read through" the large segment file completely. I use Filemon.exe to observe how it run.
My file size it 940MB, and it takes 90s to delete just a single row.
I believe that derby store the table data in a single file inside. And some how a design/implementation bug that cause it read everything rather then do it with a proper index.
I do batch delete rather to workaround this problem.
I rewrite a part of my program. It was "where id=?" in auto-commit.
Then I rewrite many thing and it now "where ID IN(?,.......?)" enclosed in a transaction.
The total time reduce to 1/1000 then it before.
I suggest that you may add a column for "mark as deleted", with a schedule that do batch actual deletion.
As far as I can tell, Derby will only store BLOBs inline with the other database data, so you end up with the BLOB split up over a ton of separate DB page files. This BLOB storage mechanism is good for ACID, and good for smaller BLOBs (say, image thumbnails), but breaks down with larger objects. According to the Derby docs, turning autocommit off when manipulating BLOBs may also improve performance, but this will only go so far.
I strongly suggest you migrate to H2 or another DBMS if good performance on large BLOBs is important, and the BLOBs must stay within the DB. You can use the SQuirrel SQL client and its DBCopy plugin to directly migrate between DBMSes (you just need to point it to the Derby/JavaDB JDBC driver and the H2 driver). I'd be glad to help with this part, since I just did it myself, and haven't been happier.
Failing this, you can move the BLOBs out of the database and into the filesystem. To do this, you would replace the BLOB column in the database with a BLOB size (if desired) and location (a URI or platform-dependent file string). When creating a new blob, you create a corresponding file in the filesystem. The location could be based off of a given directory, with the primary key appended. For example, your DB is in "DBFolder/DBName" and your blobs go in "DBFolder/DBName/Blob" and have filename "BLOB_PRIMARYKEY.bin" or somesuch. To edit or read the BLOBs, you query the DB for the location, and then do read/write to the file directly. Then you log the new file size to the DB if it changed.
I'm sure this isn't the answer you want, but for a production environment with throughput requirements I wouldn't use Java DB. MySQL is just as free and will handle your requirements a lot better. I think you are really just beating your head against a limitation of the solution you've chosen.
I generally only use Derby as a test case, and especially only when my entire DB can fit easily into memory. YMMV.
Have you tried increasing the page size of your database?
There's information about this and more in the Tuning Java DB manual which you may find useful.

Categories