How to scan hbase in a map reduce job

How to scan hbase in a map reduce job - java

I need to run hourly analysis of elements that are stored in hbase, which means for each element, I need to get the previous hour's data to analyze. The only problem is it's taking me more than 1 hour to scan everything when using a for loop without mapreduce. There are over 100,000 elements we are storing data on.
The data isn't contiguous and it is in time series, coming in every 2 minutes. There is a hash to prevent scanning/writing hotspots so the key looks like this:
hash_elementName_epochTimestamp, e.g. 100_element_1234567890.
I tried to write a mapreduce program using the TableMapReduceUtil.initTableMapperJob method to run my scan. However, the TableMapReduceUtil.initTableMapperJob method takes one scan object and I can't figure out how to scan all the elements that I need without making 100,000 separate scan objects and without scanning the whole table.
Someone else said there was a spark-hbase connector library I could use to leverage Spark and store everything in memory to query quickly. The problem is this code is written in Scala, which I don't know and can't spend time to learn right now.
Is there a spark-hbase connector that's in Java? Is there a way to utilize the TableMapReduceUtil.initTableMapperJob method scan Hbase quickly, even with 100,000 non-contiguous scan ranges?
EDIT: Someone suggested that I should use a start and end time filter. The problem is I'm not interested in every single element. There are only certain elements I'm interested in. When I made my query via the for loop, I passed in the elements I was interested in. Now with the TableMapReduceUtil, I can only make one scan object.

Related

How to count input and output rows on the Spark SQL API from Java?

I am trying to count the number of rows that a Java process reads and writes. The process is using the SQL API dealing with Datasets of Row. Adding .count() at various points seems to slow it down a lot, even if I do a .persist() prior to those points.
I have also seen code that does a
.map(row -> {
accumulator.add(1);
return row;
}, SomeEncoder)
which works well enough but the deserialization and re-serialization of the whole row seems unnecessary and it isn't mentally automatic since one has to come up with the correct SomeEncoder at each point.
A third option is maybe to call a UDF0 that does the counting and then drop the dummy object it would return but I'm not sure if Spark would be allowed to optimize the whole code away if it can tell the UDF0 isn't changing the output.
Is there a good way of counting without deserializing the rows? Or alternatively, is there a method that does the equivalent of Java's streams' .peek() where the returned data isn't important?
EDIT: to clarify, the job isn't just counting. The counting is just for record-keeping purposes. The job is doing other things. In fact, this is a pretty generic problem, I've got lots of jobs that are doing some transformations on data and saving them somewhere, I just want to keep a running record of how many rows these jobs read and wrote.
Thank you

Parsing 20 GB input file to an ArrayList

I need to sort a 20 GB file ( which consists of random numbers) in the ascending order, But I am not understanding what technique should I use. I tried to use ArrayList in my Java Program, but it runs out of Memory. Increasing the heap size didn't work too, I guess 20 GB is too big. Can anybody guide me, how should I proceed ?

You shall use an external sorting algorithm, do not try to fit this in memory.
http://en.wikipedia.org/wiki/External_sorting
If you think it is too complex, try this:
include H2 database in your project
create a new on-disk database (will be created automatically on first connection)
create some simple table where the numbers will be stored
read data number-by-number and insert it into the database (do not forget to commit each 1000 numbers or so)
select numbers with ORDER BY clause :)
use JDBC resultSet to fetch results on-the-fly and write them to an output file
H2 database is simple, works very well with Java and can be embedded in your JAR (does not need any kind of installation or setup).

You don't need any special tools for this really. This is a textbook case for external merge sort, wherein you read in parts of the large file at a time (say 100M), sort them, and write the sorted results to an external file. Read in another part, sort it, spit it back out, until there's nothing left to sort. Then you need to read in the sorted chunks, a smaller piece at a time (say 10M) and sort those in memory. The tricky point is to merge those sorted bits together in the right way. Read the external sorting page on Wikipedia as well, as already mentioned. Also, here's an implementation in Java that does this kind of external merge sorting.

Fastest way to process through million line files with timestamps

So I've got these huge text files that are filled with a single comma delimited record per line. I need a way to process the files line by line, removing lines that meet certain criteria. Some of the removals are easy, such as one of the fields is less than a certain length. The hardest criteria is that these lines all have timestamps. Many records are identical except for their timestamps and I have to remove all records but one that are identical and within 15 seconds of one another.
So I'm wondering if some others can come up with the best approach for this. I did come up with a small program in Java that accomplishes the task, using JodaTime for the timestamp stuff which makes it really easy. However, the initial way I coded the program was running into OutofMemory Heap Space errors. I refactored the code a bit and it seemed ok for the most part but I do still believe it has some memory issues as once in awhile the program just seems to get hung up. That and it just seems to take way too long. I'm not sure if this is a memory leak issue, a poor coding issue, or something else entirely. And yes I tried increasing the Heap Size significantly but still was having issues.
I will say that the program needs to be in either Perl or Java. I might be able to make a python script work too but I'm not overly familiar with python. As I said, the timestamp stuff is easiest (to me) in Java because of the JodaTime library. I'm not sure how I'd accomplish the timestamp stuff in Perl. But I'm up for learning and using whatever would work best.
I will also add the files being read in vary tremendously in size but some big ones are around 100Mb with something like 1.3 million records.
My code essentially reads in all the records and puts them into a Hashmap with the keys being a specific subset of the data from a record that similar records would share. So a subset of the record not including the timestamps which would be different. This way you'd end up with some number of records with identical data but that occurred at different times. (So completely identical minus the timestamps).
The value of each key then, is a Set of all records that have the same subset of data. Then I simply iterate through the Hashmap, taking each set and iterating through it. I take the first record and compare its times to all the rest to see if they're within 15 seconds. If so the record is removed. Once that set is finished it's written out to a file until all the records have been gone through. Hopefully that makes sense.
This works but clearly the way I'm doing it is too memory intensive. Anyone have any ideas on a better way to do it? Or, a way I can do this in Perl would actually be good because trying to insert the Java program into the current implementation has caused a number of other headaches. Though perhaps that's just because of my memory issues and poor coding.
Finally, I'm not asking someone to write the program for me. Pseudo code is fine. Though if you have ideas for Perl I could use more specifics. The main thing I'm not sure how to do in Perl is the time comparison stuff. I've looked a little into Perl libraries but haven't seen anything like JodaTime (though I haven't looked much). Any thoughts or suggestions are appreciated. Thank you.

Reading all the rows in is not ideal, because you need to store the whole lot in memory.
Instead you could read line by line, writing out the records that you want to keep as you go. You could keep a cache of the rows you've hit previously, bounded to be within 15 seconds of the current program. In very rough pseudo-code, for every line you'd read:
var line = ReadLine()
DiscardAnythingInCacheOlderThan(line.Date().Minus(15 seconds);
if (!cache.ContainsSomethingMatchingCriteria()) {
// it's a line we want to keep
WriteLine(line);
}
UpdateCache(line); // make sure we store this line so we don't write it out again.
As pointed out, this assumes that the lines are in time stamp order. If they aren't, then I'd just use UNIX sort to make it so they are, as that'll quite merrily handle extremely large files.

You might read the file and output just the line numbers to be deleted (to be sorted and used in a separate pass.) Your hash map could then contain just the minimum data needed plus the line number. This could save a lot of memory if the data needed is small compared to the line size.

measuring statistics in java simulation

I have a group of nodes who send measurements to a bootstrap server. In the end I want the bootstrap server to sum all the measurements and write it to a file. One way to do that is to over-write the data to the file each time a measurement message is received(after summing up the current measurements). But this would be very inefficient. I want to store the measurement data and write it to file only once after the simulation is completed.
But the problem is that the simulator code that I am using is not under my control, its a library that I am using. So, I cant tell when exactly the simulation is going to end (and hence I cant tell which measurement message will be the last one).
I naively tried to store the measurement data in a static class but this data is not accessible when the simulation terminates. Is there any other way that I can do this ?
Thanks,

I would find the last message using a timeout.
Write to disk if you have new data but you haven't got anything for a while e.g. a second.

If you cannot store the data you need in the process (which it seems you can't, since the static class failed), you need to persist the data some other way. To an on-disk file is one option, and another common one would be to a database.

Need suggestion on my approach : to read a file which is being written continuously?

I have one csv file, which is being written continuously by script. It writes timestamp and some other data per row. I have to read the latest data first.
Currently I am using RandomAccessFile in java to read the file in reverse way. But as its written continuously, I have to read the new data with priority. I am maintaining which timestamp has been sent and doing the work. It results unnecessary scanning operations.
Is there any better way to deal with this scenario?
Thanks in advance,

You could consider having one thread that reads new lines as they appear and pushes them onto a stack of unprocessed rows, and a second thread that pops the stack and processes the new rows in reverse order.
Depending on how long it takes to process a new row compared to how quickly they are generated, this might be sufficient. If new rows are generated faster than you can process them then this approach probably won't work - the stack will get too big and you'll run out of memory. In that case, depending on your requirements, you might be able to get away with a size-limited stack that discards old entries.

Two ideas:
Use a fixed size record format instead of CSV. Then you can tell exactly what offsets your records are at instead of having to seek around looking for newlines.
If that isn't possible, have a thread that reads items from the file and pushes them onto a stack. Another thread pops items from the stack and processes them. Because it's a stack it'll always be dealing with the most recent available item. You'll need to figure out how you want to deal with cases where the stack gets too big. Do you just want to throw away items that are too old?

If you have access to the original script, write the record to a database, in addition to the CSV file. Then you can do whatever you want with the database; access the last record, run a report, etc.

If your application is running in a Unix environment, you could run
tail -f /csv-file | custom-program
custom-program would simply accept standard input and echo that to a socket connection with your Java program.
I'm assuming that your Java program is some sort of server app that can't be started from the command line like that. If that would actually be okay, then you could replace custom-program with your Java program.

It results unnecessary scanning operations.
I presume that you are referring to the overheads of seeking to some point, and then finding the next valid CSV row start position by reading until you hit the next newline.
I can think of three ways to do this that may be more efficient than what you are currently doing:
Read the entire file and parse out the rows in forwards direction, storing the positions in memory. Then process the in-memory rows in reverse order.
Scan the file from the beginning looking for row starts, and store the row start positions in memory. Then iterate through the positions in reverse order, seeking to each one to read the corresponding row. (You can do the input more efficiently by processing multiple rows in each seek.)
Map the file into memory using a MappedByteBuffer, then you can step through the Byte buffer forwards or backwards to find the row boundaries.
The first approach requires that you can buffer the entire file in memory, but has the lower I/O overheads because you read the file just once with a minimum number of system calls. The third approach has the same the same issue, though you could map an extremely large file into memory in (large) sections to reduce the memory requirements.
But ultimately, there is no simple and efficient way of reading a file backwards in Java.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.