We have this line of code in a Spark consumer program.
hbaseConnection.getValue().getConnection().getTable(someTableName).put(puts); // 1 //
So this line from the program is saving to HBase.
Here puts is a collection of Put objects.
It could be quite large - like 30K in size e.g.
Right after this line we have this other line of code puts.clear(); // 2 //
Now... sometimes we have a serious problem, some records are just not saved in HBase, even though they have the correct key and they are in the puts collection. But also, no exception is thrown while saving to HBase.
I wonder if clearing the puts too early could be causing the problem.
In other words is the call to // 1 // synchronous or not?
Could it be that we are clearing the puts collection too early while the saving (from line // 1 //) is still taking place?
Why am I thinking this? Because when running the same code with less data (say 500 records) no issue happens - the data is saved to HBase. But it's the same code?! So what could be the problem here?! The only difference I see is in the sizes of the puts collection in the two scenarios.
So I am thinking we might be clearing the puts too early and line // 1 // might be asynchronous.
Note: This is not my code, it was written by another person.
It's been only a week or two since I started looking into it.
Related
I am trying to count the number of rows that a Java process reads and writes. The process is using the SQL API dealing with Datasets of Row. Adding .count() at various points seems to slow it down a lot, even if I do a .persist() prior to those points.
I have also seen code that does a
.map(row -> {
accumulator.add(1);
return row;
}, SomeEncoder)
which works well enough but the deserialization and re-serialization of the whole row seems unnecessary and it isn't mentally automatic since one has to come up with the correct SomeEncoder at each point.
A third option is maybe to call a UDF0 that does the counting and then drop the dummy object it would return but I'm not sure if Spark would be allowed to optimize the whole code away if it can tell the UDF0 isn't changing the output.
Is there a good way of counting without deserializing the rows? Or alternatively, is there a method that does the equivalent of Java's streams' .peek() where the returned data isn't important?
EDIT: to clarify, the job isn't just counting. The counting is just for record-keeping purposes. The job is doing other things. In fact, this is a pretty generic problem, I've got lots of jobs that are doing some transformations on data and saving them somewhere, I just want to keep a running record of how many rows these jobs read and wrote.
Thank you
Well I had my Java process running over night. First of all, that is what I already have.
I have basically:
80 mio entries (stuff Person have written) and
50 mio entries of Persons
Now I have a CSV file that is connecting both via ID's.
My first idea on the Java implementation was by 200 entries/sec. (noTx)
While my latest is ~2000/sec. (Tx)
But now I'm looking on the current state of the system. And I still see CPU and RAM changing and process is still running. But when I look onto the IO values. It's just reading.
So I was thinking that maybe the lines just contain ID's that are not in the database. Maybe! But I have a syso that shows me every 10,000 lines the current state. And it's not coming up anymore. So this cannot be.
Btw I'm at line 16.777.000 right now. And it's somehow frozen I would say. It's working really hardcore but doing nothing =/
Btw2 I:
use Transactions every 100 lines
STORAGE_KEEP_OPEN=true
ENVIRONMENT_CONCURRENT=false
OIntentMassiveInsert=true
setUsingLog=false
You can find the log here https://groups.google.com/forum/#!topic/orient-database/Whedj893mIY
You need to care about the magic size 2^24 is 16,777,216 as seen in the comments.
I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?
Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...
So, I am been playing with Cassandra, and have setup a cluster with three nodes. I am trying to figure out how redundancy works with ConsistencyLevels. Currently, I am writing data with ConsistenyLevel.ALL and am reading data with ConsistencyLevel.ONE. From what I have been reading, this seems to make sense. I have three Cassandra nodes, and I want to write to all three of them. I only care about reading from one of them, so I will take the first response. To test this, I have written a bunch of data (again, with ConsistencyLevel.ALL). I then kill one of my nodes (not the "seed" or "listen_address" machine).
When I then try to read, I expect, maybe after some delay, to get my data back. Initially, I get a TimeoutException... which I expect. This is what one gets when Cassandra is trying to deal with an unexpected node loss, right? After about 20 seconds, I try again, and now am getting an UnavailableException, which is described as "Not all the replicas required could be created and/or read".
Well, I don't care about all the replicas... just one (as in ConsistencyLevel.ONE on my get statement), right?
Am I missing the ConsistencyLevel point here? How can I configure this to still get my information if a node dies?
Thanks
It sounds like you have Replication Factor (RF) set to 1, meaning only one node holds any given row. Thus, when you take a node down, no matter what consistency level you use, you won't be able to read or write 1/3 of your data. Your expectations match what should happen with RF = 3.
I have one csv file, which is being written continuously by script. It writes timestamp and some other data per row. I have to read the latest data first.
Currently I am using RandomAccessFile in java to read the file in reverse way. But as its written continuously, I have to read the new data with priority. I am maintaining which timestamp has been sent and doing the work. It results unnecessary scanning operations.
Is there any better way to deal with this scenario?
Thanks in advance,
You could consider having one thread that reads new lines as they appear and pushes them onto a stack of unprocessed rows, and a second thread that pops the stack and processes the new rows in reverse order.
Depending on how long it takes to process a new row compared to how quickly they are generated, this might be sufficient. If new rows are generated faster than you can process them then this approach probably won't work - the stack will get too big and you'll run out of memory. In that case, depending on your requirements, you might be able to get away with a size-limited stack that discards old entries.
Two ideas:
Use a fixed size record format instead of CSV. Then you can tell exactly what offsets your records are at instead of having to seek around looking for newlines.
If that isn't possible, have a thread that reads items from the file and pushes them onto a stack. Another thread pops items from the stack and processes them. Because it's a stack it'll always be dealing with the most recent available item. You'll need to figure out how you want to deal with cases where the stack gets too big. Do you just want to throw away items that are too old?
If you have access to the original script, write the record to a database, in addition to the CSV file. Then you can do whatever you want with the database; access the last record, run a report, etc.
If your application is running in a Unix environment, you could run
tail -f /csv-file | custom-program
custom-program would simply accept standard input and echo that to a socket connection with your Java program.
I'm assuming that your Java program is some sort of server app that can't be started from the command line like that. If that would actually be okay, then you could replace custom-program with your Java program.
It results unnecessary scanning operations.
I presume that you are referring to the overheads of seeking to some point, and then finding the next valid CSV row start position by reading until you hit the next newline.
I can think of three ways to do this that may be more efficient than what you are currently doing:
Read the entire file and parse out the rows in forwards direction, storing the positions in memory. Then process the in-memory rows in reverse order.
Scan the file from the beginning looking for row starts, and store the row start positions in memory. Then iterate through the positions in reverse order, seeking to each one to read the corresponding row. (You can do the input more efficiently by processing multiple rows in each seek.)
Map the file into memory using a MappedByteBuffer, then you can step through the Byte buffer forwards or backwards to find the row boundaries.
The first approach requires that you can buffer the entire file in memory, but has the lower I/O overheads because you read the file just once with a minimum number of system calls. The third approach has the same the same issue, though you could map an extremely large file into memory in (large) sections to reduce the memory requirements.
But ultimately, there is no simple and efficient way of reading a file backwards in Java.