Well I had my Java process running over night. First of all, that is what I already have.
I have basically:
80 mio entries (stuff Person have written) and
50 mio entries of Persons
Now I have a CSV file that is connecting both via ID's.
My first idea on the Java implementation was by 200 entries/sec. (noTx)
While my latest is ~2000/sec. (Tx)
But now I'm looking on the current state of the system. And I still see CPU and RAM changing and process is still running. But when I look onto the IO values. It's just reading.
So I was thinking that maybe the lines just contain ID's that are not in the database. Maybe! But I have a syso that shows me every 10,000 lines the current state. And it's not coming up anymore. So this cannot be.
Btw I'm at line 16.777.000 right now. And it's somehow frozen I would say. It's working really hardcore but doing nothing =/
Btw2 I:
use Transactions every 100 lines
STORAGE_KEEP_OPEN=true
ENVIRONMENT_CONCURRENT=false
OIntentMassiveInsert=true
setUsingLog=false
You can find the log here https://groups.google.com/forum/#!topic/orient-database/Whedj893mIY
You need to care about the magic size 2^24 is 16,777,216 as seen in the comments.
Related
We have this line of code in a Spark consumer program.
hbaseConnection.getValue().getConnection().getTable(someTableName).put(puts); // 1 //
So this line from the program is saving to HBase.
Here puts is a collection of Put objects.
It could be quite large - like 30K in size e.g.
Right after this line we have this other line of code puts.clear(); // 2 //
Now... sometimes we have a serious problem, some records are just not saved in HBase, even though they have the correct key and they are in the puts collection. But also, no exception is thrown while saving to HBase.
I wonder if clearing the puts too early could be causing the problem.
In other words is the call to // 1 // synchronous or not?
Could it be that we are clearing the puts collection too early while the saving (from line // 1 //) is still taking place?
Why am I thinking this? Because when running the same code with less data (say 500 records) no issue happens - the data is saved to HBase. But it's the same code?! So what could be the problem here?! The only difference I see is in the sizes of the puts collection in the two scenarios.
So I am thinking we might be clearing the puts too early and line // 1 // might be asynchronous.
Note: This is not my code, it was written by another person.
It's been only a week or two since I started looking into it.
I am using Apache POI to read/write to an excel file for my company as an intern here. My program goes through the excel file which is a big square with top rows computer names and left column user names. 240 computers and 342 users. the sheet[computer][user] is 0 in all spaces and the program calls PSLoggedon for each computer and takes the username(s) currently logged on and increments their 0 so after running it after a month, it shows who is logged in the most to each computer. So far it runs in about 25 minutes since I used a socket to check socket.connect before actually calling PSLoggedon.
Without reading or writing at all to the excel file, just calling all the PSLoggedon calls to each computer, takes about 9 minutes. So, the reading and writing apparently takes 10-15 minutes. The thing is, I am calling PSLoggedon on the computer, then opening the excel to find the [x][y] spot of the [computer][user] and then writing to it a +=1 then closing it. So the reason it is taking this long I suppose is because it opens and closes the file so much? I could be completely wrong. But I can't think of a way to make this faster by opening and reading/writing all at once and only opening and closing the file once. Any ideas?
Normally Apache-POI is very fast, if you are running into some issue then you might need to check below points:
POI's logging might be on, you need to turn them off:
You can add one of these –D to your JVM settings to do this:
-Dorg.apache.poi.util.POILogger=org.apache.poi.util.NullLogger
You may be setting your VM heap to low value, try to increase.
Prefer XLS over XLSX.
Get HSQLDB (or another in-process database, but this is what I've used in the past). Add it to your build.
You can now create either a file-based or in-memory database (I would use file-based, as it lets you persist state between runs) simply by using JDBC. Create a table with the columns User, Computer, Count
In your reading thread(s), INSERT or UPDATE your table whenever you find a user with PSLoggedon
Once your data collection is complete, you can SELECT Computer, User, Count from Data ORDER BY Computer, User (or switch the order depending on your excel file layout), loop through the ResultSet and write the results directly.
This is an old question, but from what I see:
Since you are sampling and using Excel, is it safe to assume that consistency and atomicity isn't critical? You're just estimating fractional usage and don't care if a user logged in and logged out between observations.
Is the Excel file stored over a slow network link? Opening and closing a file 240 times could bring significant overhead. How about the following:
You need to open the Excel file once to get the list of computers. At that time, just snapshot the entire contents of the matrix into a Map<ComputerName, Map<UserName, Count>>. Also get a List<ComputerName> and List<UserName> to remember the row/column headings. The entire spreadsheet has less than 90,000 integers --- no need to bring in heavy database machinery.
9 minutes for 240 computers, single-threaded, is roughly 2.25 seconds per computer. Is that the expected throughput of PSLoggedOn? Can you create a thread pool and query all 240 computers at once or in a small number of rounds?
Then, parse the results, increment your map and dump it back to the Excel file. Is there a possibility that you might see new users that were not previously in the Excel? Those will need to be added to the Map and List<UserName>.
I've inserted ~2M nodes (via Java API), and deleted them after a day or two of usage (through java too). Now my db has got 16k nodes, and weights 6 GB.
Why this space wasn't freed?
What may be the cause?
The data/graph.db directory contains multiple items:
Store itself, split into multiple files
Indexes
Transaction log files
Log files (messages.log)
All your operations are stored in the transaction logs and then expire according to the keep_logical_logs setting. Not sure what the default value is, by I presume that you might have quite some space in use there.
I'd suggest to check what is taking up the space.
Also, we have sometimes seen that the space in use (reported with du for example) differs when Neo4j is running and stopped.
In addition to Alberto's answer, the store is not compacted. It leaves the empty records for reuse, and they will stay there forever. As far as I know, there is no available tool to compact the store (I've considered writing one myself, but usually convince myself that there aren't that many use cases affected by this).
If you do have a lot of churn where you are inserting and deleting records often, it's a good idea to restart your database often so it will reuse the records that it has marked as deleted.
As Alberto mentions, one of the first things I set (the other being the heap size) when I install a new neo4j is the keep_logical_logs to something like 1-7 days. If you let them grow forever (the default), they will get quite large.
I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?
Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...
So I've got these huge text files that are filled with a single comma delimited record per line. I need a way to process the files line by line, removing lines that meet certain criteria. Some of the removals are easy, such as one of the fields is less than a certain length. The hardest criteria is that these lines all have timestamps. Many records are identical except for their timestamps and I have to remove all records but one that are identical and within 15 seconds of one another.
So I'm wondering if some others can come up with the best approach for this. I did come up with a small program in Java that accomplishes the task, using JodaTime for the timestamp stuff which makes it really easy. However, the initial way I coded the program was running into OutofMemory Heap Space errors. I refactored the code a bit and it seemed ok for the most part but I do still believe it has some memory issues as once in awhile the program just seems to get hung up. That and it just seems to take way too long. I'm not sure if this is a memory leak issue, a poor coding issue, or something else entirely. And yes I tried increasing the Heap Size significantly but still was having issues.
I will say that the program needs to be in either Perl or Java. I might be able to make a python script work too but I'm not overly familiar with python. As I said, the timestamp stuff is easiest (to me) in Java because of the JodaTime library. I'm not sure how I'd accomplish the timestamp stuff in Perl. But I'm up for learning and using whatever would work best.
I will also add the files being read in vary tremendously in size but some big ones are around 100Mb with something like 1.3 million records.
My code essentially reads in all the records and puts them into a Hashmap with the keys being a specific subset of the data from a record that similar records would share. So a subset of the record not including the timestamps which would be different. This way you'd end up with some number of records with identical data but that occurred at different times. (So completely identical minus the timestamps).
The value of each key then, is a Set of all records that have the same subset of data. Then I simply iterate through the Hashmap, taking each set and iterating through it. I take the first record and compare its times to all the rest to see if they're within 15 seconds. If so the record is removed. Once that set is finished it's written out to a file until all the records have been gone through. Hopefully that makes sense.
This works but clearly the way I'm doing it is too memory intensive. Anyone have any ideas on a better way to do it? Or, a way I can do this in Perl would actually be good because trying to insert the Java program into the current implementation has caused a number of other headaches. Though perhaps that's just because of my memory issues and poor coding.
Finally, I'm not asking someone to write the program for me. Pseudo code is fine. Though if you have ideas for Perl I could use more specifics. The main thing I'm not sure how to do in Perl is the time comparison stuff. I've looked a little into Perl libraries but haven't seen anything like JodaTime (though I haven't looked much). Any thoughts or suggestions are appreciated. Thank you.
Reading all the rows in is not ideal, because you need to store the whole lot in memory.
Instead you could read line by line, writing out the records that you want to keep as you go. You could keep a cache of the rows you've hit previously, bounded to be within 15 seconds of the current program. In very rough pseudo-code, for every line you'd read:
var line = ReadLine()
DiscardAnythingInCacheOlderThan(line.Date().Minus(15 seconds);
if (!cache.ContainsSomethingMatchingCriteria()) {
// it's a line we want to keep
WriteLine(line);
}
UpdateCache(line); // make sure we store this line so we don't write it out again.
As pointed out, this assumes that the lines are in time stamp order. If they aren't, then I'd just use UNIX sort to make it so they are, as that'll quite merrily handle extremely large files.
You might read the file and output just the line numbers to be deleted (to be sorted and used in a separate pass.) Your hash map could then contain just the minimum data needed plus the line number. This could save a lot of memory if the data needed is small compared to the line size.