I'm trying to read specific lines in-between two sections using Java 8.
I need to get the information in between ~CURVE INFORMATION and ~PARAMETER INFORMATION
I was able to get it using by checking startsWith() or equals and start storing the lines in some stringbuilder or collection. But is there any method available to get some specific lines in-between some sections.
I was looking at below questions for reference.
How to read specific parts of a .txt file in JAVA
How to read specific parts of the text file using Java
Sample data from file:
~WELL INFORMATION
#MNEM.UNIT DATA TYPE INFORMATION
#---------- ------------ ------------------------------
STRT.FT 5560.0000: START DEPTH
STOP.FT 16769.5000: STOP DEPTH
STEP.FT 0.5000: STEP LENGTH
NULL. -999.2500: NULL VALUE
COMP. SHELL: COMPANY
~CURVE INFORMATION
#MNEM.UNIT API CODE CURVE DESCRIPTION
#---------- ------------ ------------------------------
DEPT.F :
SEWP.OHMM 99 000 00 00:
SEMP.OHMM 99 120 00 00:
SEDP.OHMM 99 120 00 00:
SESP.OHMM 99 220 01 00:
SGRC.GAPI 99 310 01 00:
SROP.FT/HR 99 000 00 00:
SBDC.G/C3 45 350 01 00:
SCOR.G/C3 99 365 01 00:
SPSF.DEC 99 890 03 00:
~PARAMETER INFORMATION
#MNEM.UNIT VALUE DESCRIPTION
#---------- ------------ ------------------------------
RMF .OHMM -: RMF
MFST.F -: RMF MEAS. TEMP.
RMC .OHMM -: RMC
MCST.F -: RMC MEAS. TEMP.
MFSS. -: SOURCE RMF.
MCSS. -: SOURCE RMC.
WITN. MILLER: WITNESSED BY
~OTHER INFORMATION
Using Java9 you can do it elegantly with streams
public static void main(String[] args) {
try (Stream<String> stream = Files.lines(Paths.get(args[0]))) {
System.out.println(stream.dropWhile(string -> !"~CURVE INFORMATION".equals(string)).takeWhile( string -> !"~PARAMETER INFORMATION".equals(string)).skip(1).collect(Collectors.joining("\n")));
} catch (IOException e) {
e.printStackTrace();
}
}
What makes it pleasing is the declarative nature of streams, your literally writing code that says drop elements until start mark then take elements until end mark and join them using "\n"! Java9 added takeWhile and dropWhile, I'm sure you can implement them or get their implementation from a library for java 8. Of course this is just another way to achieve the original goal.
I am trying to read a folder of Gzipped CSV's (without extension) with a list of variables. e.g.:
CSV file 1: TIMESTAMP | VAR1 | VAR2 | VAR3
CSV file 2: TIMESTAMP | VAR1 | VAR3
Each file represents a day. The order of the columns can be different (or there can be missing columns on one file).
The first option of reading the whole folder on one shot using spark.read is discarded because the join between the files is taking into account the column order and not the column names.
My next options is to read by file:
for (String key : pathArray) {
Dataset<Row> rawData = spark.read().option("header", true).csv(key);
allDatasets.add(rawData);
}
And then do a full outer join on the column names:
Dataset<Row> data = allDatasets.get(0);
for (int i = 1; i < allDatasets.size(); i++) {
ArrayList<String> columns = new
ArrayList(Arrays.asList(data.columns()));
columns.retainAll(new
ArrayList(Arrays.asList(allDatasets.get(i).columns())));
data = data.join(allDatasets.get(i),
JavaConversions.asScalaBuffer(columns), "outer");
}
But this process is very slow as it loads a file at a time.
The next approach is to use sc.binaryFiles as with sc.readFiles is not possible to make a workaround for adding custom Hadoop codecs(in order to be able to read Gzipped files without the gz extension).
Using the latest approach and translating this code to Java I have the following:
A JavaPairRDD<String, Iterable<Tuple2<String, String>>> containing the name of the variable (VAR1) and a iterable of tuples TIMESTAMP,VALUE for that VAR.
I would like to form with this a DataFrame representing all the files, however I am completely lost on how to transform this final PairRDD to a Dataframe. The DataFrame should represent the contents of all the files together. And example of the final DataFrame that I would like to have is the following:
TIMESTAMP | VAR1 | VAR2 | VAR3
01 32 12 32 ==> Start of contents of file 1
02 10 5 7 ==> End of contents of file 1
03 1 5 ==> Start of contents of file 2
04 4 8 ==> End of contents of file 2
Any suggestions or ideas?
Finally I got it with very good performance:
Reading by month in "background" (using a Java Executor to read in parallel other folders with CSV's), with this approach the time that the Driver takes while scanning each folder is reduced because is done in parallel.
Next, the process follows extracting on the one hand the headers and on the other hand their contents (tuples with varname, timestamp, value).
Finally, union the contents using the RDD API and make the Dataframe with the headers.
Flatfile1.txt
HDR06112016FLATFILE TXT
CLM12345 ABCDEF....
DTL12345 100A00....
DTL12345 200B00....
CLM54321 ABCDEF....
DTL54321 100C00....
DTL54321 200D00....
Flatfile2.txt
HDR06112016FLATFILE TXT
CLM54321 FEDCBA....
DTL54321 100C00....
DTL54321 200D00....
CLM12345 ABCDEF....
DTL12345 100A00....
DTL12345 200B00....
Mapping for both file will be same:
Header:
Field StartPosition EndPos Length
Identifier 1 3 3
Date 4 12 8
and so on
Clm:
Field StartPosition EndPos Length
Identifier 1 3 3
Key 4 12 8
Data 13 19 6
and so on
Dtl:
Field StartPosition EndPos Length
Identifier 1 3 3
Key 4 12 8
and so on
This is a sample file and may size upto 500mb and about 50 fields. I will need to compare the two files based on their mapping. The file format is - one header and claim data(12345) in one line and its detail data can be more than one. These claims can be present randomly in the other file.Its not line to line mapping. Detail data ordering will be same in both the files.
Desired output :
For Key 54321 , Data(pos 13:19) is not same.
Would you please help me in comparing the two files? Will it be feasible in Java since the file size will be huge?.
Java would work fine. You don't need to have the files entirely in memory; you can open them both and read from both incrementally, comparing as you go along.
I have a file and here is a portion of of the file. The common word in all lines is PIC here and I am able to find out the index of PIC. I am trying to extract the description for each line. Here how can I extract the word before the word PIC?
15 EXTR-SITE PIC X.
05 EXTR-DBA PIC X.
TE0305* 05 EXTR-BRANCH PIC X(05).
TE0305* 05 EXTR-NUMBER PIC X(06).
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
CW0104* 05 FILLER PIC X(567).
I have to get result like below
EXTR-SITE
EXTR-DBA
EXTR-NUMBER
-------
FILLER
Is there any expression I can use to find the word before 'PIC'?
Here is my code to get lines that contain 'PIC':
int wordStartIndex = line.indexOf("PIC");
int wordEndIndex = line.indexOf(".");
if ((wordStartIndex > -1) && (wordEndIndex >= wordStartIndex)) {
System.out.println(line); }
15 EXTR-SITE PIC X.
05 EXTR-DBA PIC X.
TE0305* 05 EXTR-BRANCH PIC X(05).
TE0305* 05 EXTR-NUMBER PIC X(06).
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
CW0104* 05 FILLER PIC X(567).
I think you need to find out more about COBOL before you approach this task.
Columns 1-6 can contain a sequence number, can be blank, or can contain anything. If you are attempting to parse COBOL code you need to ignore columns 1-6.
Column 7 is called the Indicator area. It may be blank, or contain an * which indicates a comment, or a -, which indicates the line is a continuation of the previous non-blank/non-comment line, or contain a D which indicates it is a debugging line.
Columns 73-80 may contain another sequence number, or blank, or anything, and must be ignored.
If your COBOL source was "free format", things would be a bit different, but it is not.
There is no sense in extracting data from comment lines, so your expected output is not valid. It is also unclear where you get the line of dashes in your expected output.
If you are trying to parse COBOL source, you must have valid COBOL source. This is not valid:
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
A level-number (the 05) is a group-item if it is followed by higher level-numbers (the two 10s). A group-item cannot have a PICture.
PIC itself can also be written in full, as PICTURE.
PIC can quite easily appear in an identifier/data-name (EPIC-CODE). As could PICTURE, in theory.
PIC and PICTURE could appear in a comment line, even if not a commented line of code.
The method you want to use to find the "description" (which is the identifier, or data-name) is flawed.
01 the-record.
05 fixed-part-of-record.
10 an-individual-item PIC X.
10 another-item COMP-1.
10 and-another COMP-3 PIC 9(3).
10 PIC X.
05 variable-part-of-record.
10 entry-name OCCURS 10 TIMES.
15 entry-name-client-first-name
PIC X(30).
15 entry-name-client-surname
PIC X(30).
That is just a short example, not to be considered all-encompassing.
From that, your method would retrieve
an-individual-item
COMP-3
and two lines of "whatever happens when PIC is the first thing on line"
To save this becoming a chameleon question, you need to ask a new question (or sort it out yourself) with a different method.
Depending on the source of the COBOL source, there are better ways to deal with this. If the source is an IBM Mainframe COBOL, then the source for your source should either be a compile listing or the SYSADATA from the compile.
From either of those, you'd pick up the identifier/data-name at a specific location under a specific condition. No parsing to do at all.
If you cannot get that, then I'd suggest you look for the level-number, and find the first thing after that. You will still have some work to do.
Level-numbers can be one or two digits, in the range 1-49, plus 66, 77, 88. Some compilers also have 78. If your extract is only "records" (likely) you won't see 77 or 78. You'll likely not see 66 (only seen it used once) and quite probably will see 88s, which you may or may not want to include in your output (depending on what you need it for).
1.
01.
01 FILLER.
01 data-name-name-1.
01 data-name-name-2 PIC X(80).
5.
05.
05 FILLER.
05 FILLER PIC X.
05 data-name-name-3.
05 data-name-name-4 PIC X.
The use of a single-digit for a level-number and not spelling FILLER explicitly are fairly "new" (from the 1985 Standard) and it is quite possible you don't have any of those. But you might.
The output from the above should be:
FILLER
FILLER
FILLER
data-name-name-1
data-name-name-2
FILLER
FILLER
FILLER
FILLER
data-name-name-3
data-name-name-4
I have no idea what you'd want to do with that output. With no context, it doesn't have a lot of meaning.
It is possible that your selected method would work with your actual data (assuming you pickled your sample, and that what you get is valid code).
However, it would still be simpler to say "if the first word on a line is one- or two-digit numeric, if there is a second word, that's what we want, else use FILLER". Noting, of course, the previous comments about what you should ignore.
Unless your source contains 88-levels. Because it would be quite common for a range of values to require a second line, and if the values happen to be numeric, and one or two digits, then that won't work either.
So, identify the source of your source. If it is an IBM Mainframe, attempt to get output from the compile. Then your task is really easy, and 100% accurate.
If you can't get that, then understand your data thoroughly. If you have really simple structures such that your method works, doing it from the level-number will still be easier.
If you need to come back to this, please ask a new question. Otherwise you're hanging out to dry the people who have already spent their time voluntarily answering your existing question.
If you are not committed to writing a Cobol parser yourself, a couple of options include:
Use the Cobol Compiler to process the Cobol copybook. This will create a listing of the Cobol-Copybook in a format that is easier to parse. I have worked at companies that converted all there Cobol-Copybooks to the equivalent easytrieve copybooks automatically by compiling the Cobol-Copybook in a Hello-World type program and processing the output.
Products like File-Aid have a Cobol parsers that produce an easily digested version of the Cobol Copybook.
The java project cb2xml will convert a Cobol-Copybook to Xml. The project provides some examples of processing the Xml with Jaxb.
To parse a Cobol-Copybook into a Java list of items using cb2xml (taken from Demo2.java):
JAXBContext jc = JAXBContext.newInstance(Condition.class, Copybook.class, Item.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
Document doc = Cb2Xml2.convertToXMLDOM(
new File(Code.getFullName("BitOfEverything.cbl").getFile()),
false,
Cb2xmlConstants.USE_STANDARD_COLUMNS);
JAXBElement<Copybook> copybook = unmarshaller.unmarshal(doc, Copybook.class);
The program Demo2.java will then print the contents of a cobol copybook out:
List<Item> items = copybook.getValue().getItem();
for (Item item : items) {
Code.printItem(" ", item);
}
And to print a Cobol-Item Code.java:
public static void printItem(String indent, Item item) {
char[] nc = new char[Math.max(1, 50 - indent.length()
- item.getName().length())];
String picture = item.getPicture();
Arrays.fill(nc, ' ');
if (picture == null) {
picture = "";
}
System.out.println(indent + item.getLevel() + " " + item.getName()
+ new String(nc) + item.getPosition()
+ " " + item.getStorageLength() + "\t" + picture);
List<Item> childItems = item.getItem();
for (Item child : childItems) {
printItem(indent + " ", child);
}
}
The output from Demo2 is like (gives you the level, field name, start, length and picture):
01 CompFields 1 5099
03 NumA 1 25 --,---,---,---,---,--9.99
03 NumB 26 3 9V99
03 NumC 29 3 999
03 text 32 20 x(20)
03 NumD 52 3 VPPP999
03 NumE 55 3 999PPP
03 float 58 4
03 double 62 8
03 filler 70 23
05 RBI-REPETITIVE-AREA 70 13
10 RBI-REPEAT 70 13
15 RBI-NUMBER-S96SLS 70 7 S9(06)
15 RBI-NUMBER-S96DISP 77 6 S9(06)
05 SFIELD-SEP 83 10 S9(7)V99
Another cb2xml example is DemoCobolJTreeTable.java which displays a COBOL copybook in a Tree table:
You can try regex like this :
public static void main(String[] args) {
String s = "15 EXTR-SITE PIC X.";
System.out.println(s.replaceAll("(.*?\\s+)+(.*?)(?=\\s+PIC).*", "$1"));
}
O/P:
EXTR-SITE
Explanation :
(.*?\\s+)+(.*?)(?=\\s+PIC).*", "$1") :
(.*?\\s+)+ --> Find one or more groups of "anything" which is followed by a space.
(.*?)(?=\\s+PIC) -->find a group of "any set of characters" which are followed by a space and the word "PIC".
.* --> Select everything after PIC.
$1 --> the contents of the actual String with the first captured group i.e, data between `()`.
PS : This works with all your current inputs :P
//let 'lines' be an array of all your lines
//with one complete line as string per element
for(String line : lines){
String[] splitted = line.split(" ");
for(int i = 0; i < splitted.length; i++){
if(splitted[i].equals("PIC") && i > 0) System.out.println(splitted[i-1]);
}
}
Please note that I didn't test this code yet (but will in a few minutes). However the general approach shold be clear now.
Try to use String.split("\\s+"). This method splits the original string into an array of Strings (String[]). Then, using Arrays.asList(...) you can transform your array into a List, so you can search for a particular object using indexOf.
Here is an extract of a possibile solution:
String words = "TE0305* 05 EXTR-BRANCH PIC X(05).";
List<String> list = Arrays.asList(words.split("\\s+"));
int index = list.indexOf("PIC");
// Prints EXTR-BRANCH
System.out.println(index > 0 ? list.get(index - 1) : ""); // Added a guard
In my honest opinion, this code lets Java working for you, and not the opposite. It is concise, readable and then more maintainable.
Can someone help me to read a selective column of data in a text file into a list..
e.g.: if the text file data is as follows
-------------
id name age
01 ron 21
02 harry 12
03 tom 23
04 jerry 25
-------------
from the above data if I need to gather the column "name" using list in java and print it..
java.util.Scanner could be used to read the file, discarding the unwanted columns.
Either print the wanted column values as the file is processed or add() them to a java.util.ArrayList and print them once processing is complete.
A small example with limited error checking:
Scanner s = new Scanner(new File("input.txt"));
List<String> names = new ArrayList<String>();
// Skip column headings.
// Read each line, ensuring correct format.
while (s.hasNext())
{
s.nextInt(); // read and skip 'id'
names.add(s.next()); // read and store 'name'
s.nextInt(); // read and skip 'age'
}
for (String name: names)
{
System.out.println(name);
}
Use a file reader and read it line by line, break on the spaces and add any column you want to the List. Use a BufferedReader to grab the lines by something like this:
BufferedReader br = new BufferedReader(new FileReader("C:\\readFile.txt"));
Then you can do to grab a line:
String line = br.readLine();
Finally you can split the string into an array by column by doing this:
String[] columns = line.split(" ");
Then you can access the columns and add them into the list depending on if you want column 0, 1, or 2.
Are the columns delimited by tabs?
Look at Java CSV, an open source library for reading comma delimited or tab delimited text files. SHould do most of the job. I've never used it myself, but I assume you'd be able to ask for all the values from column 1 (or similar).
Alternatively, you could read the file one line at a time using a BufferedReader (which has a readLine()) method) and then call String.split() and grab the parts you want.