Find a word in a line based on next word - java

I have a file and here is a portion of of the file. The common word in all lines is PIC here and I am able to find out the index of PIC. I am trying to extract the description for each line. Here how can I extract the word before the word PIC?
15 EXTR-SITE PIC X.
05 EXTR-DBA PIC X.
TE0305* 05 EXTR-BRANCH PIC X(05).
TE0305* 05 EXTR-NUMBER PIC X(06).
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
CW0104* 05 FILLER PIC X(567).
I have to get result like below
EXTR-SITE
EXTR-DBA
EXTR-NUMBER
-------
FILLER
Is there any expression I can use to find the word before 'PIC'?
Here is my code to get lines that contain 'PIC':
int wordStartIndex = line.indexOf("PIC");
int wordEndIndex = line.indexOf(".");
if ((wordStartIndex > -1) && (wordEndIndex >= wordStartIndex)) {
System.out.println(line); }

15 EXTR-SITE PIC X.
05 EXTR-DBA PIC X.
TE0305* 05 EXTR-BRANCH PIC X(05).
TE0305* 05 EXTR-NUMBER PIC X(06).
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
CW0104* 05 FILLER PIC X(567).
I think you need to find out more about COBOL before you approach this task.
Columns 1-6 can contain a sequence number, can be blank, or can contain anything. If you are attempting to parse COBOL code you need to ignore columns 1-6.
Column 7 is called the Indicator area. It may be blank, or contain an * which indicates a comment, or a -, which indicates the line is a continuation of the previous non-blank/non-comment line, or contain a D which indicates it is a debugging line.
Columns 73-80 may contain another sequence number, or blank, or anything, and must be ignored.
If your COBOL source was "free format", things would be a bit different, but it is not.
There is no sense in extracting data from comment lines, so your expected output is not valid. It is also unclear where you get the line of dashes in your expected output.
If you are trying to parse COBOL source, you must have valid COBOL source. This is not valid:
TE0305 05 FILLER PIC X(11).
CW0104 10 EXTR-TEXT6 PIC X(67).
CW0104 10 EXTR-TEXT7 PIC X(67).
A level-number (the 05) is a group-item if it is followed by higher level-numbers (the two 10s). A group-item cannot have a PICture.
PIC itself can also be written in full, as PICTURE.
PIC can quite easily appear in an identifier/data-name (EPIC-CODE). As could PICTURE, in theory.
PIC and PICTURE could appear in a comment line, even if not a commented line of code.
The method you want to use to find the "description" (which is the identifier, or data-name) is flawed.
01 the-record.
05 fixed-part-of-record.
10 an-individual-item PIC X.
10 another-item COMP-1.
10 and-another COMP-3 PIC 9(3).
10 PIC X.
05 variable-part-of-record.
10 entry-name OCCURS 10 TIMES.
15 entry-name-client-first-name
PIC X(30).
15 entry-name-client-surname
PIC X(30).
That is just a short example, not to be considered all-encompassing.
From that, your method would retrieve
an-individual-item
COMP-3
and two lines of "whatever happens when PIC is the first thing on line"
To save this becoming a chameleon question, you need to ask a new question (or sort it out yourself) with a different method.
Depending on the source of the COBOL source, there are better ways to deal with this. If the source is an IBM Mainframe COBOL, then the source for your source should either be a compile listing or the SYSADATA from the compile.
From either of those, you'd pick up the identifier/data-name at a specific location under a specific condition. No parsing to do at all.
If you cannot get that, then I'd suggest you look for the level-number, and find the first thing after that. You will still have some work to do.
Level-numbers can be one or two digits, in the range 1-49, plus 66, 77, 88. Some compilers also have 78. If your extract is only "records" (likely) you won't see 77 or 78. You'll likely not see 66 (only seen it used once) and quite probably will see 88s, which you may or may not want to include in your output (depending on what you need it for).
1.
01.
01 FILLER.
01 data-name-name-1.
01 data-name-name-2 PIC X(80).
5.
05.
05 FILLER.
05 FILLER PIC X.
05 data-name-name-3.
05 data-name-name-4 PIC X.
The use of a single-digit for a level-number and not spelling FILLER explicitly are fairly "new" (from the 1985 Standard) and it is quite possible you don't have any of those. But you might.
The output from the above should be:
FILLER
FILLER
FILLER
data-name-name-1
data-name-name-2
FILLER
FILLER
FILLER
FILLER
data-name-name-3
data-name-name-4
I have no idea what you'd want to do with that output. With no context, it doesn't have a lot of meaning.
It is possible that your selected method would work with your actual data (assuming you pickled your sample, and that what you get is valid code).
However, it would still be simpler to say "if the first word on a line is one- or two-digit numeric, if there is a second word, that's what we want, else use FILLER". Noting, of course, the previous comments about what you should ignore.
Unless your source contains 88-levels. Because it would be quite common for a range of values to require a second line, and if the values happen to be numeric, and one or two digits, then that won't work either.
So, identify the source of your source. If it is an IBM Mainframe, attempt to get output from the compile. Then your task is really easy, and 100% accurate.
If you can't get that, then understand your data thoroughly. If you have really simple structures such that your method works, doing it from the level-number will still be easier.
If you need to come back to this, please ask a new question. Otherwise you're hanging out to dry the people who have already spent their time voluntarily answering your existing question.

If you are not committed to writing a Cobol parser yourself, a couple of options include:
Use the Cobol Compiler to process the Cobol copybook. This will create a listing of the Cobol-Copybook in a format that is easier to parse. I have worked at companies that converted all there Cobol-Copybooks to the equivalent easytrieve copybooks automatically by compiling the Cobol-Copybook in a Hello-World type program and processing the output.
Products like File-Aid have a Cobol parsers that produce an easily digested version of the Cobol Copybook.
The java project cb2xml will convert a Cobol-Copybook to Xml. The project provides some examples of processing the Xml with Jaxb.
To parse a Cobol-Copybook into a Java list of items using cb2xml (taken from Demo2.java):
JAXBContext jc = JAXBContext.newInstance(Condition.class, Copybook.class, Item.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
Document doc = Cb2Xml2.convertToXMLDOM(
new File(Code.getFullName("BitOfEverything.cbl").getFile()),
false,
Cb2xmlConstants.USE_STANDARD_COLUMNS);
JAXBElement<Copybook> copybook = unmarshaller.unmarshal(doc, Copybook.class);
The program Demo2.java will then print the contents of a cobol copybook out:
List<Item> items = copybook.getValue().getItem();
for (Item item : items) {
Code.printItem(" ", item);
}
And to print a Cobol-Item Code.java:
public static void printItem(String indent, Item item) {
char[] nc = new char[Math.max(1, 50 - indent.length()
- item.getName().length())];
String picture = item.getPicture();
Arrays.fill(nc, ' ');
if (picture == null) {
picture = "";
}
System.out.println(indent + item.getLevel() + " " + item.getName()
+ new String(nc) + item.getPosition()
+ " " + item.getStorageLength() + "\t" + picture);
List<Item> childItems = item.getItem();
for (Item child : childItems) {
printItem(indent + " ", child);
}
}
The output from Demo2 is like (gives you the level, field name, start, length and picture):
01 CompFields 1 5099
03 NumA 1 25 --,---,---,---,---,--9.99
03 NumB 26 3 9V99
03 NumC 29 3 999
03 text 32 20 x(20)
03 NumD 52 3 VPPP999
03 NumE 55 3 999PPP
03 float 58 4
03 double 62 8
03 filler 70 23
05 RBI-REPETITIVE-AREA 70 13
10 RBI-REPEAT 70 13
15 RBI-NUMBER-S96SLS 70 7 S9(06)
15 RBI-NUMBER-S96DISP 77 6 S9(06)
05 SFIELD-SEP 83 10 S9(7)V99
Another cb2xml example is DemoCobolJTreeTable.java which displays a COBOL copybook in a Tree table:

You can try regex like this :
public static void main(String[] args) {
String s = "15 EXTR-SITE PIC X.";
System.out.println(s.replaceAll("(.*?\\s+)+(.*?)(?=\\s+PIC).*", "$1"));
}
O/P:
EXTR-SITE
Explanation :
(.*?\\s+)+(.*?)(?=\\s+PIC).*", "$1") :
(.*?\\s+)+ --> Find one or more groups of "anything" which is followed by a space.
(.*?)(?=\\s+PIC) -->find a group of "any set of characters" which are followed by a space and the word "PIC".
.* --> Select everything after PIC.
$1 --> the contents of the actual String with the first captured group i.e, data between `()`.
PS : This works with all your current inputs :P

//let 'lines' be an array of all your lines
//with one complete line as string per element
for(String line : lines){
String[] splitted = line.split(" ");
for(int i = 0; i < splitted.length; i++){
if(splitted[i].equals("PIC") && i > 0) System.out.println(splitted[i-1]);
}
}
Please note that I didn't test this code yet (but will in a few minutes). However the general approach shold be clear now.

Try to use String.split("\\s+"). This method splits the original string into an array of Strings (String[]). Then, using Arrays.asList(...) you can transform your array into a List, so you can search for a particular object using indexOf.
Here is an extract of a possibile solution:
String words = "TE0305* 05 EXTR-BRANCH PIC X(05).";
List<String> list = Arrays.asList(words.split("\\s+"));
int index = list.indexOf("PIC");
// Prints EXTR-BRANCH
System.out.println(index > 0 ? list.get(index - 1) : ""); // Added a guard
In my honest opinion, this code lets Java working for you, and not the opposite. It is concise, readable and then more maintainable.

Related

Writing Html file with java duplicates the entry

I have a program to do some calculations in excel and writing the output in a table tag in html file. I am adding rows dynamically at runtime depending on the number of results. While writing to html file the entries are not correct.
Suppose i have 50 rows in a html file. I am appening 49 rows at runtime in the template file and replacing values $id0, $age0, $time0.....$id49, $age49, $time49 in html file . For me first 10 rows are writing properly. From 11th row, the values are writing wrong. I am getting correct ones in the logs as well.
for(int i = 0; i < c; i++) {
htmlString = htmlString.replace("$id"+i, cycle.get("id"+i).toString().trim());
htmlString = htmlString.replace("$time"+i, cycle.get("time"+i).toString().trim());
htmlString = htmlString.replace("$name"+i, cycle.get("name"+i).toString().trim())
}
The entry comes in html as
id Name age time
9 abc 8 8.08
10 xyz 12 9.19
11 xyz1 121 9.191
12 xyz12 122 9.192
the values for id 11, 12 are wrong. It shows 10th id's values appended with 1,2 etc.
I was able to resolve by adding an extra character after the $id1 like $id1:.
Example:
id1=abc
id2=xyz
without the extra code $id11 was giving as abc1

How to get specific lines from the file ( between two sections)?

I'm trying to read specific lines in-between two sections using Java 8.
I need to get the information in between ~CURVE INFORMATION and ~PARAMETER INFORMATION
I was able to get it using by checking startsWith() or equals and start storing the lines in some stringbuilder or collection. But is there any method available to get some specific lines in-between some sections.
I was looking at below questions for reference.
How to read specific parts of a .txt file in JAVA
How to read specific parts of the text file using Java
Sample data from file:
~WELL INFORMATION
#MNEM.UNIT DATA TYPE INFORMATION
#---------- ------------ ------------------------------
STRT.FT 5560.0000: START DEPTH
STOP.FT 16769.5000: STOP DEPTH
STEP.FT 0.5000: STEP LENGTH
NULL. -999.2500: NULL VALUE
COMP. SHELL: COMPANY
~CURVE INFORMATION
#MNEM.UNIT API CODE CURVE DESCRIPTION
#---------- ------------ ------------------------------
DEPT.F :
SEWP.OHMM 99 000 00 00:
SEMP.OHMM 99 120 00 00:
SEDP.OHMM 99 120 00 00:
SESP.OHMM 99 220 01 00:
SGRC.GAPI 99 310 01 00:
SROP.FT/HR 99 000 00 00:
SBDC.G/C3 45 350 01 00:
SCOR.G/C3 99 365 01 00:
SPSF.DEC 99 890 03 00:
~PARAMETER INFORMATION
#MNEM.UNIT VALUE DESCRIPTION
#---------- ------------ ------------------------------
RMF .OHMM -: RMF
MFST.F -: RMF MEAS. TEMP.
RMC .OHMM -: RMC
MCST.F -: RMC MEAS. TEMP.
MFSS. -: SOURCE RMF.
MCSS. -: SOURCE RMC.
WITN. MILLER: WITNESSED BY
~OTHER INFORMATION
Using Java9 you can do it elegantly with streams
public static void main(String[] args) {
try (Stream<String> stream = Files.lines(Paths.get(args[0]))) {
System.out.println(stream.dropWhile(string -> !"~CURVE INFORMATION".equals(string)).takeWhile( string -> !"~PARAMETER INFORMATION".equals(string)).skip(1).collect(Collectors.joining("\n")));
} catch (IOException e) {
e.printStackTrace();
}
}
What makes it pleasing is the declarative nature of streams, your literally writing code that says drop elements until start mark then take elements until end mark and join them using "\n"! Java9 added takeWhile and dropWhile, I'm sure you can implement them or get their implementation from a library for java 8. Of course this is just another way to achieve the original goal.

delete unwanted characters from URL

I have this variable String var = class.getSomething that contains this url http://www.google.com§°§#[]|£%/^<> .The output that comes out is this: http://www.google.comç°§#[]|£%/^<>. How can i delete that Ã? Thanks!
You could do this, it replaces any character for empty getting your purpouse.
str = str.replace("Â", "");
With that you will replace  for nothing, getting the result you want.
Use String.replace
var = var.replace("Ã", "");
specify the charset as UTF-8 to get rid of unwanted extra chars :
String var = class.getSomething;
var = new String(var.getBytes(),"UTF-8");
Do you really want to delete only that one character or all invalid characters? Otherwise you can check each character with CharacterUtils.isAsciiPrintable(char ch). However, according to RFC 3986 even fewer character are allowed in URLs (alphanumerics and "-_.+=!*'()~,:;/?$#&%", see Characters allowed in a URL).
In any case, you have to create a new String object (like with replace in the answer by Elias MP or putting valid characters one by one into a StringBuilder and convert it to a String) as Strings are immutable in Java.
The string in var is output using utf-8, which results in the byte sequence:
c2 a7 c2 b0 c2 a7 23 5b 5d 7c c2 a3 25 2f 5e 3c 3e
This happens to be the iso-8859-1 encoding of the characters as you see them:
§ ° §#[]| £%/^<>
ç°§#[]|£%/^<>
C2 is the encoding for Â.
I'm not sure how the à was produced; it's encoding is C3.
We need the full code to learn how this happened, and a description how the character encoding for text files on your system is configured.
Modifying the variable var is useless.

Extracting data from a text file - repeated values

79 0009!017009!0479%0009!0479 0009!0469%0009!0469
0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449
0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419
0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009
0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339
0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032
In this data, I'm supposed to extract the number 47, 46 , 45 , 44 and so on. I´m supposed to avoid the rest. The numbers always follow this flow - 9!0 no 9%
for example: 9!0 42 9%
Which language should I go about to solve this and which function might help me?
Is there any function that can position a special character and copy the next two or three elements?
Ex: 9!0 42 9% and ' 009
look out for ! and then copy 42 from there and look out for ' that refers to another value (009). It's like two different regex to be used.
You can use whatever language you want, or even a unix command line utility like sed, awk, or grep. The regex should be something like this - you want to match 9!0 followed by digits followed by 0%. Use this regex: 9!0(\d+)0% (or if the numbers are all two digits, 9!0(\d{2})0%).
The other answers are fine, my regex solution is simply "9!.(\d\d)"
And here's a full solution in powershell, which can be easily correlated to other .net langs
$t="79 0009!017009!0479%0009!0479 0009!0469%0009!0469 0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449 0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419 0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009 0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339 0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032"
$p="9!.(\d\d)"
$ms=[regex]::match($t,$p)
while ($ms.Success) {write-host $ms.groups[1].value;$ms=$ms.NextMatch()}
This is perl:
#result = $subject =~ m/(?<=9!0)\d+(?=9%)/g;
It will give you an array of all your numbers. You didn't provide a language so I don't know if this is suitable for you or not.
Pattern regex = Pattern.compile("(?<=9!0)\\d+(?=9%)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}

Read file contents in j2me

I Have a file like shown below,
OrderNo id name count Format
1 AA1 sdflsdfsdfd 12 01
2 AB2 asdaewqrftr 13 02
3 AA3 aerefytrsu 12 01
I want to read this file and want to sort with orderNo. Please suggest me some way to read and sort.(in J2ME). Thanks...
Create an object representing this
Read the file line by line ('\n' new line )
Sort them in memory and write them back.
Note:
Be careful about memory

Categories