How to split a big Sequence file into multiple sequence files? - java

I have a large sequence file with around 60 million entries (almost 4.5GB).
I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)!
Can anyone suggest a better/valid approach?

Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.
Your code should then look like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
Another thing that may cause problems is that some SequenceFiles are not splittable.

Maybe I'm not understanding your question right, but why not just read your file line by line (= entry by entry?) and build your three files this way?
It would be something like this:
int i = 0;
List<PrintWriter> files = new ArrayList<PrintWriter>();
files.add(new PrintWriter("the-file-name1.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name2.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name3.txt", "UTF-8"));
for String line in Files.readAllLines(Paths.get(fileName)){
files.get(i % 3).writeln(line);
i++;
}
In this case, one line every three line goes into the frist, the second and the third file.
An other solution would be to make a binary read, if the file is not a text file, using Files.readAllBytes(Paths.get(inputFileName)) and writing into your output files with Files.write(Paths.get(output1), byteToWrite).
However, I do not have an answer to why the output takes so much more place in the way you are doing it. Maybe the encoding is guilty? I think java encodes in UTF-8 by default and your input file might be encoded in ASCII.

Related

Compare data of each line from file1 to data from file 2

I have two large txt files around 150 mb. I want to read some data from each line of file1 and scan through all the lines of file2 till I find the matching data. If the matching data is not found, I want to output that line to another file.
I want the program to use as less memory as possible. Time is not a constraint.
Edit1
I have tried couple of options
Option1 : I have read the file2 using BufferedReader, Scanner and apache commons FileUtils.lineIterator. Loaded data of file2 into HashMap by reading each line. Read the data from file1 one line at a time and compared with data in HashMap. If it didn't match, wrote the line in a file3.
Option 2 : Read the file2 n times for every records in File 1 using the above mentioned three Readers.After every read I had to close the file and read again. I am wondering what's the best way. Is there any other option I can look into
I have to make some assumptions about the file.
I am going to assume the lines are long, and you want the lines that are not the same in the 2 files.
I would read the files 4 times (2 times per file).
Of course, it's not as efficient as reading it 2 times (1 time per file), but reading it 2 times means lots of memory is used.
Pseudo code for 1st read of each file:
Map<MyComparableByteArray, Long> digestMap = new HashMap<>();
try (BufferedReader br = ...)
{
long lineNr = 0;
String line;
while ((line = br.readLine()) != null)
{
digestMap.put(CreateDigest(line), lineNr);
}
}
If the digests are different/unique, I know that the line does not occur in the other file.
If the digests are the same, we will need to check the lines and actually compare them to make sure that they are really the same - this can occur during the second read.
Now what is also important is that we need to be careful of the digest we choose.
If we choose a short digest (i.e. md-5), we might run into lots of collisions, but this is appropriate for files with short lines, and we will need to handle the collisions separately (i.e. convert the map to a map<digest, list> structure.
If we choose a long digest (i.e. sha2-512), we won't run into lots of collisions (still safer to handle it like I mentioned above), BUT we will have the problem of not saving as much memory unless the file lines are very long.
So the general technique is:
Read each file and generate hashes.
Compare the hashes to mark the lines that need to be compared.
Read each file again and generate the output. Recheck all collisions found by the hashes in this step.
By the way, MyComparableByteArray is a custom wrapper around a byte[], to enable it to be a HashMap key (i.e. by implementing equals() and hashCode() methods). The byte[] cannot be used as a key, as it doesn't work with equals() and hashCode(). There are 2 ways to handle this:
custom wrapper as I've mentioned - this will be more efficient than the alternative.
convert it to a string using base64. This will make the memory usage around 2.5x worse than option 1, but does not need the custom code.

Java: What's the most efficient way to read relatively large txt files and store its data?

I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?
One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.
The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.
Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.

Java reading nth line

I am trying to read a specific line from a text file, however I don't want to load the file into memory (it can get really big).
I have been looking but every example i have found requires either to read every line (this would slow my code down as there are over 100,000 lines) or load the whole thing into an array and get the correct element (file will have alot of lines to input).
An example of what I want to do:
String line = File.getLine(5);
"code is not actual code, it is made up to show the principle of what i want"
Is there a way to do this?
-----Edit-----
I have just realized this file will be written too in between reading lines (adding to the end of the file).
Is there a way to do this?
Not unless the lines are of a fixed number of bytes each, no.
You don't have to actually keep each line in memory - but you've got to read through the whole file to get to the line you want, as otherwise you won't know where to start reading.
You have to read the file line by line. Otherwise how do you know when you have gotten to line 5 (as in your example)?
Edit:
You might also want to check out Random Access Files which could be helpful if you know how many bytes per line, as Jon Skeet has said.
The easiest way to do this would be to use a BufferedReader (http://docs.oracle.com/javase/1.5.0/docs/api/java/io/BufferedReader.html), because you can specify your buffer size. You could do something like:
BufferedReader in = new BufferedReader(new FileReader("foo.in"), 1024);
in.readLine();
in.readLine();
in.readLine();
in.readLine();
String line = in.readLine();
1) read a line which the user selects,
If you only need to read a user-selected line once or infrequently (or if the file is small enough), then you just have to read the file line-by-line from the start, until you get to the selected line.
If, on the other hand you need to read a user-selected line frequently, you should build an index of line numbers and offsets. So, for example, line 42 corresponds to an offset of 2347 bytes into the file. Ideally, then, you would only read the entire file once and store the index--for example, in a map, using the line numbers as keys and offsets as values.
2) read new lines added since
the last read. i plan to read the file every 10 seconds. i have got
the line count and can find out the new line numbers but i need to
read that line
For the second point, you can simply save the current offset into the file instead of saving the current line number--but it certainly wouldn't hurt to continue building the index if it will continue to provide a significant performance benefit.
Use RandomAccessFile.seek(long offset) to set the file pointer to the most recently saved offset (confirm the file is longer than the most recently saved offset first--if not, nothing new has been appended).
Use RandomAccessFile.readLine() to read a line of the file
Call RandomAccessFile.getFilePointer() to get the current offset after reading the line and optionally put(currLineNo+1, offset) into the index.
Repeat steps 2-3 until reaching the end of the file.
But don't get too carried away with performance optimizations unless the performance is already a problem or is highly likely to be a problem.
For small files:
String line = Files.readAllLines(Paths.get("file.txt")).get(n);
For large files:
String line;
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line = lines.skip(n).findFirst().get();
}
Java 7:
String line;
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
for (int i = 0; i < n; i++)
br.readLine();
line = br.readLine();
}
Source: Reading nth line from file
The only way to do this is to build an index of where each line is (you only need to record the end of each line) Without a way to randomly access a line based on an index from the start, you have to read every byte before that line.
BTW: Reading 100,000 lines might take only one second on a fast machine.
If performance is a big concern here and you are frequently reading random lines from a static file then you can optimize this a bit by reading through the file and building an index (basically just a long[]) of the starting offset of each line of the file.
Once you have this you know exactly where to jump to in the file and then you can read up to the next newline character to retrieve the full line.
Here is a snippet of some code I had which will read a file and write every 10th line including the first line to a new file (writer.) You can always replace the try section with whatever you want to do. To change the number of lines to read just change the 0 in the if statement "lc.endsWith("0")" to whatever line you want to read. But if the file is being written to as you read it, this code will only work with the data that is contained inside the file when you run this code.
LineNumberReader lnr = new LineNumberReader(new FileReader(new File(file)));
lnr.skip(Long.MAX_VALUE);
int linecount=lnr.getLineNumber();
lnr.close();
for (int i=0; i<=linecount; i++){
//read lines
String line = bufferedReader.readLine();
String lc = String.valueOf(i);
if (lc.endsWith("0")){
try{
writer.append(line+"\n");
writer.flush();
}catch(Exception ee){
}
}
}

dividing input file into multiple files based on one of the column

I have a semicolon delimited input file where first column is a 3 char fixed width code, while the remaining columns are some string data.
001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
I want to divide above file into number of files based on different values of first column.
For e.g. in above example, there are three different values in the first column, so I will divide the file into three files viz. 001.txt, 002.txt, 003.txt
The output file should contain item count as line one and data as remaining lines.
So there are 5 001 rows, so 001.txt will be:
5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
Similarly, 002 file will have first line as 4 and then 4 lines of data and 003 file will have first line as 5 and then five lines of data.
What would be the most efficient way to achieve this considering very large input file with greater then 100,000 rows?
I have written below code to read lines from the file:
try{
FileInputStream fstream = new FileInputStream(this.inputFilePath);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null) {
String[] tokens = strLine.split(";");
}
in.close();
}catch(IOException e){
e.printStackTrace();
}
for each line
extract chunk name, e.g 001
look for file named "001-tmp.txt"
if one exist, read first line - it will give you number of lines, then increment the value and write into same file using seek function with argument 0 and then use writeUTF to override the string. Perhaps some string length calculation has to be applied here, leave placeholder for 10 spaces for example.
if one does not exist, then create one and write 1 as first line, padded with 10 spaces
append current line to the file
close current file
proceed with next line of source file
One of the solutions that comes to mind is to keep a 'Map' and only open every file once. But you wont be able to this because you have around 1 lac rows, so no OS will allow you that many open file descriptors.
So one of the way is to open the file in append mode and keep writing to it and closing it. But because the of huge many file open close calls , the process may slow up. You can test it for your self though.
If the above is not providing satisfying results, you may try a mix of approach 1 and 2, where by you only open 100 open files at any time and only closing a file if a new file that is not already opened needs to be written to....
First, create HashMap<String, ArrayList<String>> map to collect all the data from the file.
Second, use strLine.split(";",2) instead of strLine.split(";"). The result will be array of length 2, first element be the code and the second the data.
Then, add decoded string to the map:
ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);
At the end, scan the map.keySet() and for each key, create a file named as that key and write list's size and list's content to it.
For each three character code, you're going to have a list of input lines. To me the obvious solution would be to use a Map, with String keys (your three character codes) pointing to the corresponding List that contains all of the lines.
For each of those keys, you'd create a file with the relevant name, the first line would be the size of the list, and then you'd iterate over it to write the remaining lines.
I guess you are not fixed to three files so I suggest you create a map of writers with your three characters code as key and the writer as value.
For each line you read, you select or create the required reader and write the lines into. Also you need a second map to maintain the line count values for all files.
Once you are done with reading the source file, you flush and close all writers and read the files one by one again. This time you just add the line count in front of the file. There is no other way but to rewrite the entire file to my knowledge because its not directly possible to add anything to the beginning of a file without buffering and rewriting the entire file. I suggest you use a temporary file for this one.
This answer applies only in case you file is too large to be stored fully in memory. In case storing is possible, there are faster solutions to this. Like storing the contents of the file fully in StringBuffer objects before writing it to files.

Read Specific line using SuperCSV

Is it possible to read a specific line using SuperCsv?
Suppose a .csv file contains 100 lines and i want to read line number 11.
CSV files usually contain variable-length records, which means it is impossible to "jump" to a specified record. The only solution is to sequentially read CSV records from the beginning of the file, while keeping a count, until you reach the needed record.
I have not found any special API in SuperCsv for doing this skipping of lines, so I guess you will have to manually call CsvListReader#read() method 11 times to get the line you want.
I don't know if other CSV reading libraries will have a "jump-to-line" feature, and even if they do, it is unlikely to perform any better than manually skipping to the required line, for the reason given in the first paragraph.
Here is a simple solution which you can adapt:
listReader = new CsvListReader(new InputStreamReader(new FileInputStream(CSVFILE, CHARSET), CsvPreference.TAB_PREFERENCE);
listReader.getHeader(false);
while ((listReader.read(processors)) != null) {
if (listReader.getLineNumber() == 1) {
System.out.println("Do whaever you need.");
}
}

Categories