dividing input file into multiple files based on one of the column

dividing input file into multiple files based on one of the column - java

I have a semicolon delimited input file where first column is a 3 char fixed width code, while the remaining columns are some string data.
001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
I want to divide above file into number of files based on different values of first column.
For e.g. in above example, there are three different values in the first column, so I will divide the file into three files viz. 001.txt, 002.txt, 003.txt
The output file should contain item count as line one and data as remaining lines.
So there are 5 001 rows, so 001.txt will be:
5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
Similarly, 002 file will have first line as 4 and then 4 lines of data and 003 file will have first line as 5 and then five lines of data.
What would be the most efficient way to achieve this considering very large input file with greater then 100,000 rows?
I have written below code to read lines from the file:
try{
FileInputStream fstream = new FileInputStream(this.inputFilePath);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null) {
String[] tokens = strLine.split(";");
}
in.close();
}catch(IOException e){
e.printStackTrace();
}

for each line
extract chunk name, e.g 001
look for file named "001-tmp.txt"
if one exist, read first line - it will give you number of lines, then increment the value and write into same file using seek function with argument 0 and then use writeUTF to override the string. Perhaps some string length calculation has to be applied here, leave placeholder for 10 spaces for example.
if one does not exist, then create one and write 1 as first line, padded with 10 spaces
append current line to the file
close current file
proceed with next line of source file

One of the solutions that comes to mind is to keep a 'Map' and only open every file once. But you wont be able to this because you have around 1 lac rows, so no OS will allow you that many open file descriptors.
So one of the way is to open the file in append mode and keep writing to it and closing it. But because the of huge many file open close calls , the process may slow up. You can test it for your self though.
If the above is not providing satisfying results, you may try a mix of approach 1 and 2, where by you only open 100 open files at any time and only closing a file if a new file that is not already opened needs to be written to....

First, create HashMap<String, ArrayList<String>> map to collect all the data from the file.
Second, use strLine.split(";",2) instead of strLine.split(";"). The result will be array of length 2, first element be the code and the second the data.
Then, add decoded string to the map:
ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);
At the end, scan the map.keySet() and for each key, create a file named as that key and write list's size and list's content to it.

For each three character code, you're going to have a list of input lines. To me the obvious solution would be to use a Map, with String keys (your three character codes) pointing to the corresponding List that contains all of the lines.
For each of those keys, you'd create a file with the relevant name, the first line would be the size of the list, and then you'd iterate over it to write the remaining lines.

I guess you are not fixed to three files so I suggest you create a map of writers with your three characters code as key and the writer as value.
For each line you read, you select or create the required reader and write the lines into. Also you need a second map to maintain the line count values for all files.
Once you are done with reading the source file, you flush and close all writers and read the files one by one again. This time you just add the line count in front of the file. There is no other way but to rewrite the entire file to my knowledge because its not directly possible to add anything to the beginning of a file without buffering and rewriting the entire file. I suggest you use a temporary file for this one.
This answer applies only in case you file is too large to be stored fully in memory. In case storing is possible, there are faster solutions to this. Like storing the contents of the file fully in StringBuffer objects before writing it to files.

Related

Compare data of each line from file1 to data from file 2

I have two large txt files around 150 mb. I want to read some data from each line of file1 and scan through all the lines of file2 till I find the matching data. If the matching data is not found, I want to output that line to another file.
I want the program to use as less memory as possible. Time is not a constraint.
Edit1
I have tried couple of options
Option1 : I have read the file2 using BufferedReader, Scanner and apache commons FileUtils.lineIterator. Loaded data of file2 into HashMap by reading each line. Read the data from file1 one line at a time and compared with data in HashMap. If it didn't match, wrote the line in a file3.
Option 2 : Read the file2 n times for every records in File 1 using the above mentioned three Readers.After every read I had to close the file and read again. I am wondering what's the best way. Is there any other option I can look into

I have to make some assumptions about the file.
I am going to assume the lines are long, and you want the lines that are not the same in the 2 files.
I would read the files 4 times (2 times per file).
Of course, it's not as efficient as reading it 2 times (1 time per file), but reading it 2 times means lots of memory is used.
Pseudo code for 1st read of each file:
Map<MyComparableByteArray, Long> digestMap = new HashMap<>();
try (BufferedReader br = ...)
{
long lineNr = 0;
String line;
while ((line = br.readLine()) != null)
{
digestMap.put(CreateDigest(line), lineNr);
}
}
If the digests are different/unique, I know that the line does not occur in the other file.
If the digests are the same, we will need to check the lines and actually compare them to make sure that they are really the same - this can occur during the second read.
Now what is also important is that we need to be careful of the digest we choose.
If we choose a short digest (i.e. md-5), we might run into lots of collisions, but this is appropriate for files with short lines, and we will need to handle the collisions separately (i.e. convert the map to a map<digest, list> structure.
If we choose a long digest (i.e. sha2-512), we won't run into lots of collisions (still safer to handle it like I mentioned above), BUT we will have the problem of not saving as much memory unless the file lines are very long.
So the general technique is:
Read each file and generate hashes.
Compare the hashes to mark the lines that need to be compared.
Read each file again and generate the output. Recheck all collisions found by the hashes in this step.
By the way, MyComparableByteArray is a custom wrapper around a byte[], to enable it to be a HashMap key (i.e. by implementing equals() and hashCode() methods). The byte[] cannot be used as a key, as it doesn't work with equals() and hashCode(). There are 2 ways to handle this:
custom wrapper as I've mentioned - this will be more efficient than the alternative.
convert it to a string using base64. This will make the memory usage around 2.5x worse than option 1, but does not need the custom code.

How to split a big Sequence file into multiple sequence files?

I have a large sequence file with around 60 million entries (almost 4.5GB).
I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)!
Can anyone suggest a better/valid approach?

Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.
Your code should then look like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
Another thing that may cause problems is that some SequenceFiles are not splittable.

Maybe I'm not understanding your question right, but why not just read your file line by line (= entry by entry?) and build your three files this way?
It would be something like this:
int i = 0;
List<PrintWriter> files = new ArrayList<PrintWriter>();
files.add(new PrintWriter("the-file-name1.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name2.txt", "UTF-8"));
files.add(new PrintWriter("the-file-name3.txt", "UTF-8"));
for String line in Files.readAllLines(Paths.get(fileName)){
files.get(i % 3).writeln(line);
i++;
}
In this case, one line every three line goes into the frist, the second and the third file.
An other solution would be to make a binary read, if the file is not a text file, using Files.readAllBytes(Paths.get(inputFileName)) and writing into your output files with Files.write(Paths.get(output1), byteToWrite).
However, I do not have an answer to why the output takes so much more place in the way you are doing it. Maybe the encoding is guilty? I think java encodes in UTF-8 by default and your input file might be encoded in ASCII.

Java reading nth line

I am trying to read a specific line from a text file, however I don't want to load the file into memory (it can get really big).
I have been looking but every example i have found requires either to read every line (this would slow my code down as there are over 100,000 lines) or load the whole thing into an array and get the correct element (file will have alot of lines to input).
An example of what I want to do:
String line = File.getLine(5);
"code is not actual code, it is made up to show the principle of what i want"
Is there a way to do this?
-----Edit-----
I have just realized this file will be written too in between reading lines (adding to the end of the file).

Is there a way to do this?
Not unless the lines are of a fixed number of bytes each, no.
You don't have to actually keep each line in memory - but you've got to read through the whole file to get to the line you want, as otherwise you won't know where to start reading.

You have to read the file line by line. Otherwise how do you know when you have gotten to line 5 (as in your example)?
Edit:
You might also want to check out Random Access Files which could be helpful if you know how many bytes per line, as Jon Skeet has said.

The easiest way to do this would be to use a BufferedReader (http://docs.oracle.com/javase/1.5.0/docs/api/java/io/BufferedReader.html), because you can specify your buffer size. You could do something like:
BufferedReader in = new BufferedReader(new FileReader("foo.in"), 1024);
in.readLine();
in.readLine();
in.readLine();
in.readLine();
String line = in.readLine();

1) read a line which the user selects,
If you only need to read a user-selected line once or infrequently (or if the file is small enough), then you just have to read the file line-by-line from the start, until you get to the selected line.
If, on the other hand you need to read a user-selected line frequently, you should build an index of line numbers and offsets. So, for example, line 42 corresponds to an offset of 2347 bytes into the file. Ideally, then, you would only read the entire file once and store the index--for example, in a map, using the line numbers as keys and offsets as values.
2) read new lines added since
the last read. i plan to read the file every 10 seconds. i have got
the line count and can find out the new line numbers but i need to
read that line
For the second point, you can simply save the current offset into the file instead of saving the current line number--but it certainly wouldn't hurt to continue building the index if it will continue to provide a significant performance benefit.
Use RandomAccessFile.seek(long offset) to set the file pointer to the most recently saved offset (confirm the file is longer than the most recently saved offset first--if not, nothing new has been appended).
Use RandomAccessFile.readLine() to read a line of the file
Call RandomAccessFile.getFilePointer() to get the current offset after reading the line and optionally put(currLineNo+1, offset) into the index.
Repeat steps 2-3 until reaching the end of the file.
But don't get too carried away with performance optimizations unless the performance is already a problem or is highly likely to be a problem.

For small files:
String line = Files.readAllLines(Paths.get("file.txt")).get(n);
For large files:
String line;
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line = lines.skip(n).findFirst().get();
}
Java 7:
String line;
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
for (int i = 0; i < n; i++)
br.readLine();
line = br.readLine();
}
Source: Reading nth line from file

The only way to do this is to build an index of where each line is (you only need to record the end of each line) Without a way to randomly access a line based on an index from the start, you have to read every byte before that line.
BTW: Reading 100,000 lines might take only one second on a fast machine.

If performance is a big concern here and you are frequently reading random lines from a static file then you can optimize this a bit by reading through the file and building an index (basically just a long[]) of the starting offset of each line of the file.
Once you have this you know exactly where to jump to in the file and then you can read up to the next newline character to retrieve the full line.

Here is a snippet of some code I had which will read a file and write every 10th line including the first line to a new file (writer.) You can always replace the try section with whatever you want to do. To change the number of lines to read just change the 0 in the if statement "lc.endsWith("0")" to whatever line you want to read. But if the file is being written to as you read it, this code will only work with the data that is contained inside the file when you run this code.
LineNumberReader lnr = new LineNumberReader(new FileReader(new File(file)));
lnr.skip(Long.MAX_VALUE);
int linecount=lnr.getLineNumber();
lnr.close();
for (int i=0; i<=linecount; i++){
//read lines
String line = bufferedReader.readLine();
String lc = String.valueOf(i);
if (lc.endsWith("0")){
try{
writer.append(line+"\n");
writer.flush();
}catch(Exception ee){
}
}
}

Read Specific line using SuperCSV

Is it possible to read a specific line using SuperCsv?
Suppose a .csv file contains 100 lines and i want to read line number 11.

CSV files usually contain variable-length records, which means it is impossible to "jump" to a specified record. The only solution is to sequentially read CSV records from the beginning of the file, while keeping a count, until you reach the needed record.
I have not found any special API in SuperCsv for doing this skipping of lines, so I guess you will have to manually call CsvListReader#read() method 11 times to get the line you want.
I don't know if other CSV reading libraries will have a "jump-to-line" feature, and even if they do, it is unlikely to perform any better than manually skipping to the required line, for the reason given in the first paragraph.

Here is a simple solution which you can adapt:
listReader = new CsvListReader(new InputStreamReader(new FileInputStream(CSVFILE, CHARSET), CsvPreference.TAB_PREFERENCE);
listReader.getHeader(false);
while ((listReader.read(processors)) != null) {
if (listReader.getLineNumber() == 1) {
System.out.println("Do whaever you need.");
}
}

How can i return the address of a line in a Random Access file?

i'm trying to create a Random Access file in java.
I write something in a new line.
How can i return the address of that line in Java?
Also, I'm a bit confused with RAFs.
For example i have a file that consists of the following entries in alphabetical manner
George 10 10 8
Mary 9 10 10
Nick 8 8 8
Nickolas 10 10 9
I would like to return the grades of Nickolas.
How can i declare that in a RAF?
Is there any method that can "read("Nickolas")" and return to me the line?
Thanks, in advance

Random access files usually contain binary data rather than ascii (e.g. plain text) data. The example you are showing is ascii.
Since the data is ascii, this means it's not as easy to seek to various places in the file. In fact, generally the approach to get the grades for Nickolas would be to read the file line by line and parse each line into columns. Then, compare the first column for Nickolas.
For example,
BufferedReader in = new BufferedReader(new FileReader("grades.txt"));
String line = in.readLine();
while(null != line) {
String [] columns = line.split(" ");
if( columns[0].equals("Nickolas") )
System.out.println("I found the line! " + line);
line = in.readLine();
}
EDIT:
There are a number of ways to speed this up. Here are three:
Storing all data in a HashMap
If you don't have too many records, or if each record doesn't take much space, you could read them all into RAM. You can also use a HashMap to map the name of the student to their record. For example:
HashMap<String, Student> grades = new HashMap<String, Student>();
BufferedReader in = new BufferedReader(new FileReader("grades.txt"));
String line = in.readLine();
while(null != line) {
String [] columns = line.split(" ");
grades.put( column[0],
new Student( /* create student class instance from columns */ );
line = in.readLine();
}
Now, lookups will be extremely fast.
Using a Binary Search
If you have too many records to fit in RAM, you can write all of the student data to a random access (binary) file. Here, you have a couple of options: you can either make each record vary in length, or you can make each record have a fixed length. Fixed length records are easier for some kinds of searching, like binary searches.
For example, if you know each record is 100 bytes, then you know how to get to the n'th record in the binary file storing the records. Basically, read 99*n bytes. Then the next 100 bytes are the 100th record.
Thus, if the records are sorted by student name, you can very easily use a binary search to find a specific student. This approach will still be fast, albeit not as fast as the RAM-based data structure.
Using a HashMap as an index
Yet another option is to combine the two approaches I mentioned above. Write the data to a binary file, and store the byte offsets of the records in a hash map. The hash map can use the student name as the key as before, but then stores a long integer offset to the record in the random access file. Thus, to look up a specific student, you find the byte offset using the hash map, and then "seek" to the record in the file and then read it. This last approach works even if the records vary in length.

There is no such thing as a 'line'. There are, however line delimiters (newline, which is '\n'). You can write a line but that only writes the data followed by newline. You can read a line, but again that only reads until it finds a newline character or the end of the file.
So to find line n, you have to keep reading until you've counted n-1 newline characters, and keep reading until you find the next one (or the end of the file).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.