Java: Handling a stream to read from file - java

for a lab at my University I'm developing a system in Java that is able to store data from a file (given the file path as a String). I was trying to handle the problem using a stream of lines from the path, but I got stuck at an early stage. The file is organised as follow: the different fields of a line are separated by ";" and each line starts with a "P" or a "D". Depending on this parameter, I'll use the contenent of the line to create a "Patient" object or a "Doctor" object, subsequently stored in two different maps (patients or doctors). I did the following:
Path p= Paths.get(path);
Stream <String> lines=Files.lines(p, StandardCharsets.UTF_8);
lines.flatMap(l->Stream.of(l.split("; ")))....
My idea was to check the word at the beginning of the line and, according to that, the remaining element would be used as parameter for a method able to create and store the corresponding object(insertPatient or insertDoctor). But I haven't got the faintest idea of how to do that. I know different way to do the same, but I really want to develop the solution using a stream, at least in the procedure of reading the different lines.
Thanks,
Gianluca.

try (Stream<> lines = Files.lines(p, StandardCharsets.UTF_8)) { // 1
lines.map(line -> line.split(";")) // 2
.forEach(lineAsArray -> {
if (lineAsArray[0].equals("D")) {
insertDoctor(lineAsArray);
}
else if (lineAsArray[0].equals("P")) {
insertPatient(lineAsArray);
}
});
}
Use try-with-resources to make sure the stream, and thus the file reader, is closed
Don't use flatMap, since you want to act on complete lines, and not on cells.

Related

How to get a value in a file with coordinates in Java

My programm needs to read a file that has different data structures with a variable separator.
In my properties-file you can set the separator and put coordinates for values of different variables:
separator = ;
variable1 = 1,7
variable2 = 2,42
I would like to have a way where I can access a column and a line with some kind of coordinates.
I'm thinking of a syntax like this:
file.get(1,7,";")
(Which would give you the value of the 1st line and 7th column with the specific separator)
Does someone know a library or a code snippet that does exactly this?
Using String.split() :
public String get(File file, int lineNumber, int column, String separator ) {
//getting to the lineNumber of the file ommitted
// suppose you got it in a String named "line"
return line.split(separator)[column - 1];
}
You can use OpenCSV or SuperCSV for example. I'm not aware of any library that does your 'coordinates' gettings, but it's as simple as reading the CSV with the given separator as List-of-Lists and then call
csv.get(1).get(7)
Seems to be a simple file processing, You should first process the file -
create ArrayList<ArrayList<String>> processedFile
Read every line, split using "line".split(separator)
Store the array above in the ArrayList processedFile at current index
increase the index with every line
Once processedFile is ready, you can simply use processedFile.get(row).get(column). Also once the file is processed, all the other queries will be O(1). Hints are enough, try writing the code yourself, you will learn more.
PS: Take care of NullPointerExceptions wherever required.

Java processing lines in file and data structures

I have read a bit about multidimensional arrays would it make sense to solve this problem using such data structures in Java, or how should I proceed?
Problem
I have a text file containing records which contain multiple lines. One record is anything between <SUBBEGIN and <SUBEND.
The lines in the record follow no predefined order and may be absent from a record. In the input file (see below) I am only interested in lines MSISDN, CB,CF and ODBIC fields.
For each of these fields I would like to apply regular expressions to extract the value to the right of the equals.
Output file would be a comma separated file containing these values, example
MSISDN=431234567893 the value 431234567893 is written to the output file
error checking
NoMSISDNnofound when no MSISDN is found in a record
noCFUALLPROVNONE when no CFU-ALL-PROV-NONE is found in a recored
Search and replace operations
CFU-ALL-PROV-NONE should be replaced by CFU-ALL-PROV-1/1/1
CFU-TS10-ACT-914369223311 should be replaced by CFU-TS10-ACT-1/1/0/4369223311
Output for first record
431234567893,BAOC-ALL-PROV,BOIC-ALL-PROV,BOICEXHC-ALL-PROV,BICROAM-ALL-PROV,CFU-ALL-PROV-1/1/1,CFB-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFU-TS10-ACT-1/1/1/4369223311,BAIC,BAOC
Input file
<BEGINFILE>
<SUBBEGIN
IMSI=11111111111111;
MSISDN=431234567893;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
IMEISV=4565676567576576;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-
YES-YES-NO;
ODBIC=BAIC;
ODBOC=BAOC;
ODBROAM=ODBOHC;
ODBPRC=ENTER;
ODBPRC=INFO;
ODBPLMN=NONE;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=YES;
ODBADULTSMS=YES;
<SUBEND
<SUBBEGIN
IMSI=11111111111133;
MSISDN=431234567899;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO+-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-NO-NONE-YES-65535-YES-YES-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFD-TS10-REG-91430000000-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-YES-YES-NO;
ODBIC=BICCROSSDOMESTIC;
ODBOC=BAOC;
ODBROAM=ODBOH;
ODBPRC=INFO;
ODBPLMN=PLMN1
ODBPLMN=PLMN3;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=NO;
ODBADULTSMS=YES;
<SUBEND
From what I understand, you are simply reading a text file and processing it and maybe replacing some words. You do not therefore need a data structure to store the words in. Instead you can simply read the file line by line and pass it through a bunch of if statements (maybe a couple booleans to check if the specific parameters you are searching for have been found?) and then rewrite the line you want to a new file.
Dealing with big files to implement data in machine learning algorithms, I did it by passing all of the file contents in a variable, and then using the String.split("delimeter") method (Supported from Java 8 and later), I broke the contents in a one-dimensional array, where each cell had the info before the delimeter.
Firstly read the file via a scanner or your way of doing it (let content be the variable with your info), and then break it with
content.split("<SUBEND");

How to delete all lines from a file one-by-one after reading the line?

I'm writing a java program that does the following:
Reads in a line from a file
Does some action based on that line
Delete the line (or replace it with ""), and if 2 is not successful, write it to a new file
Continue on to the next line for all lines in file (as opposed to removing an arbitrary line)
Currently I have:
try (BufferedReader br = new BufferedReader(new FileReader(inputFile))) {
String line;
while ((line = br.readLine()) != null) {
try {
if (!do_stuff(line)){ //do_stuff returns bool based on success
write_non_success(line);
}
} catch (Exception e) {
e.printStackTrace(); //eat the exception for now, do something in the future
}
}
Obviously I'm going to need to not use a BufferedReader for this, as it can't write, but what class should I use? Also, read order doesn't matter
This differs from this question because I want to remove all lines, as opposed to an arbitrary line number as the other OP wants, and if possible I'd like to avoid writing the temp file after every line, as my files are approximately 1 million lines
If you do everything according to the algorithm that you describe, the content left in the original file would be the same as the content of "new file" from step #3:
If a line is processed successfully, it gets removed from the original file
If a line is not processed successfully, it gets added to the new file, and it also stays in the original file.
It is easy to see why at the end of this process the original file is the same as the "new file". All you need to do is to carry out your algorithm to the end, and then copy the new file in place of the original.
If your concern is that the process is going to crash in the middle, the situation becomes very different: now you have to write out the current state of the original file after processing each line, without writing over the original until you are sure that it is going to be in a consistent state. You can do it by reading all lines into a list, deleting the first line from the list once it has been processed, writing the content of the entire list into a temporary file, and copying it in place of the original. Obviously, this is very expensive, so it shouldn't be attempted in a tight loop. However, this approach ensures that the original file is not left in an inconsistent state, which is important when you are looking to avoid doing the same work multiple times.

dividing input file into multiple files based on one of the column

I have a semicolon delimited input file where first column is a 3 char fixed width code, while the remaining columns are some string data.
001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
I want to divide above file into number of files based on different values of first column.
For e.g. in above example, there are three different values in the first column, so I will divide the file into three files viz. 001.txt, 002.txt, 003.txt
The output file should contain item count as line one and data as remaining lines.
So there are 5 001 rows, so 001.txt will be:
5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
Similarly, 002 file will have first line as 4 and then 4 lines of data and 003 file will have first line as 5 and then five lines of data.
What would be the most efficient way to achieve this considering very large input file with greater then 100,000 rows?
I have written below code to read lines from the file:
try{
FileInputStream fstream = new FileInputStream(this.inputFilePath);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null) {
String[] tokens = strLine.split(";");
}
in.close();
}catch(IOException e){
e.printStackTrace();
}
for each line
extract chunk name, e.g 001
look for file named "001-tmp.txt"
if one exist, read first line - it will give you number of lines, then increment the value and write into same file using seek function with argument 0 and then use writeUTF to override the string. Perhaps some string length calculation has to be applied here, leave placeholder for 10 spaces for example.
if one does not exist, then create one and write 1 as first line, padded with 10 spaces
append current line to the file
close current file
proceed with next line of source file
One of the solutions that comes to mind is to keep a 'Map' and only open every file once. But you wont be able to this because you have around 1 lac rows, so no OS will allow you that many open file descriptors.
So one of the way is to open the file in append mode and keep writing to it and closing it. But because the of huge many file open close calls , the process may slow up. You can test it for your self though.
If the above is not providing satisfying results, you may try a mix of approach 1 and 2, where by you only open 100 open files at any time and only closing a file if a new file that is not already opened needs to be written to....
First, create HashMap<String, ArrayList<String>> map to collect all the data from the file.
Second, use strLine.split(";",2) instead of strLine.split(";"). The result will be array of length 2, first element be the code and the second the data.
Then, add decoded string to the map:
ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);
At the end, scan the map.keySet() and for each key, create a file named as that key and write list's size and list's content to it.
For each three character code, you're going to have a list of input lines. To me the obvious solution would be to use a Map, with String keys (your three character codes) pointing to the corresponding List that contains all of the lines.
For each of those keys, you'd create a file with the relevant name, the first line would be the size of the list, and then you'd iterate over it to write the remaining lines.
I guess you are not fixed to three files so I suggest you create a map of writers with your three characters code as key and the writer as value.
For each line you read, you select or create the required reader and write the lines into. Also you need a second map to maintain the line count values for all files.
Once you are done with reading the source file, you flush and close all writers and read the files one by one again. This time you just add the line count in front of the file. There is no other way but to rewrite the entire file to my knowledge because its not directly possible to add anything to the beginning of a file without buffering and rewriting the entire file. I suggest you use a temporary file for this one.
This answer applies only in case you file is too large to be stored fully in memory. In case storing is possible, there are faster solutions to this. Like storing the contents of the file fully in StringBuffer objects before writing it to files.

Read Specific line using SuperCSV

Is it possible to read a specific line using SuperCsv?
Suppose a .csv file contains 100 lines and i want to read line number 11.
CSV files usually contain variable-length records, which means it is impossible to "jump" to a specified record. The only solution is to sequentially read CSV records from the beginning of the file, while keeping a count, until you reach the needed record.
I have not found any special API in SuperCsv for doing this skipping of lines, so I guess you will have to manually call CsvListReader#read() method 11 times to get the line you want.
I don't know if other CSV reading libraries will have a "jump-to-line" feature, and even if they do, it is unlikely to perform any better than manually skipping to the required line, for the reason given in the first paragraph.
Here is a simple solution which you can adapt:
listReader = new CsvListReader(new InputStreamReader(new FileInputStream(CSVFILE, CHARSET), CsvPreference.TAB_PREFERENCE);
listReader.getHeader(false);
while ((listReader.read(processors)) != null) {
if (listReader.getLineNumber() == 1) {
System.out.println("Do whaever you need.");
}
}

Categories