extract text between matches using java

extract text between matches using java - java

Here's my input text
1. INTRODUCTION
This is a test document. This document lines can span multiple lines.
This is another line.
2. PROCESS
This is a test process. This is another line.
3. ANOTHER HEADING
...
I want to extract text between the main titles, 1,2,3 and so on. I am using this regular expression to match the titles - ^[ ]{0,2}?[0-9]{0,2}\\.(.*)$
How do I extract text between matches?
EDIT
I tried using this code -
while(matcher.find()) {
}
if I look ahead for the next match's starting index in this while loop, it will change the state of the Matcher. How do I get the text between using String.substring? I will need to the end of the current match and the beginning of the next match to do a substring.

How do I extract text between matches?
Do you mean between 1. INTRODUCTION and 2. PROCESS and so on? If so, if the next line is not a "header" line, add the text to some buffer. If it is a header, add the buffer to a running list and then clear the buffer.
Something like (in pseudo code)
List<String> content
currentContent = ""
while line = readNextLine()
if not matched header
currentContent += line
else
// found new header, clear the content and add it to the list
if currentContent != ""
content.add(currentContent)
currentContent = ""
Edit: as one big string
// Split the lines by new lines
String[] bits = yourString.split("\\n");
String currentContent = ""; // Text between headers
List<String> content = new ArrayList<String>(); // Running list of text between headers
// Loop through each line
for (String bit : bits) {
Matcher m = yourPattern.match(bit);
if (m.matches()) {
// Found a header
if (currentContent.length() != 0) {
content.add(currentContent);
currentContent = "";
}
} else {
// Not a header, just append the line
currentContent += bit;
}
}
Something like that would work. I suppose you could do a complicated multi-line regex but this seems easier to me

How about this:
String text =
" 1. INTRODUCTION\n"
+ " This is a test document. This document lines can span multiple lines.\n"
+ " This is another line.\n"
+ " 2. PROCESS\n"
+ " This is a test process. This is another line.\n"
+ " 3. ANOTHER HEADING\n";
Pattern pat = Pattern.compile("^[ ]{0,2}?[0-9]{0,2}\\.(.*)$", Pattern.MULTILINE);
Matcher m = pat.matcher(text);
int start = 0;
while (m.find()) {
if (start < m.start()) {
System.out.println("*** paragraphs:");
System.out.println(text.substring(start, m.start()));
}
System.out.println("*** title:");
System.out.println(m.group());
start = m.end();
}
The results are:
*** title:
1. INTRODUCTION
*** paragraphs:
This is a test document. This document lines can span multiple lines.
This is another line.
*** title:
2. PROCESS
*** paragraphs:
This is a test process. This is another line.
*** title:
3. ANOTHER HEADING
You may wish to remove the newlines before and after the paragraphs.

Related

Need Regular Expression to parse multi-line environmental variables

I want to parse a file that is a list of environmental variables similar to this example:
TPS_LIB_DIR = "$DEF_VERSION_DIR\lib\ver215";
TPS_PH_DIR = "$DEF_VERSION_DIR";
TPS_SCHEMA_DIR = "~TPS_DIR\Supersedes\code;" +
"~TPR_DIR\..\Supersedes\code;" +
"~TPN_DIR\..\..\Supersedes\code;" +
"$TPS_VERSION_DIR";
TPS_LIB_DIR = "C:\prog\lib";
BASE_DIR = "C:\prog\base";
SPARS_DIR = "C:\prog\spars";
SIGNALFILE_DIR = "E:\SIGNAL_FILES";
SIGNALFILE2_DIR = "E:\SIGNAL_FILES2";
SIGNALFILE3_DIR = "E:\SIGNAL_FILES2";
I came up with this regular expression that matches the single line definitions fine, but it will not match the multi-line definitions.
(\w+)\s*=\s*(.*);[\r\n]+
Does anyone know of a regular expression which will parse all lines in this file where the environmental variable name is in group 1 and the value (on right side of =) is in group 2? Even better would be if the multiple paths were in separate groups, but I can handle that part manually.
UPDATE:
Here is what I ended up implementing. The first pattern "Pattern p" matches the individual environmental variable blocks. The second pattern, "Pattern valpattern" parses the one or more values for each environmental variable. Hope someone finds this useful.
private static void parse(File filename) {
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*([\\s\\S]+?\";)");
Pattern valpattern = Pattern.compile("\\s*\"(.+)\"\\s*");
try {
String str = readFile(filename, StandardCharsets.UTF_8);
Matcher matcher = p.matcher(str);
while(matcher.find()) {
String key = matcher.group(1);
Matcher valmatcher = valpattern.matcher(matcher.group(2));
System.out.println(key);
while(valmatcher.find()) {
System.out.println("\t" + valmatcher.group(1).replaceAll(System.getProperty("line.separator"), ""));
}
}
} catch (IOException e) {
System.out.println("Error: ProcessENV.parse -- problem parsing file: " + filename + System.lineSeparator());
e.printStackTrace();
}
}
static String readFile(File file, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(file.toPath());
return new String(encoded, encoding);
}

It is simpler to split on '=' and '";'.
[ c.strip().split(' = ') for c in s.split('";') ]
Or with double comprehension to get the individual paths:
[ [p[0].strip(), * [x.strip() for x in p.strip().split('=')] for c in s.split('";') for p in c.split(" = ")]
Split could be done with re, adding \s* to remove the trailing spaces:
re.split(r'\s*=\s*|";\s*', text, flags=re.MULTILINE):
even elements r[::2] would be vars, odd [1::2] values
then get rid of extra white space in values

You can use the following regex:
(\w+)\s*=\s*([\s\S]+?)";
It will start by matching a Group 1 of Word character, zero or more White Spaces, an equal sign, zero or more White Space, then a Group 2 or more of any characters (non greedy), and finally a a last double quote and a semi colon.
That will match all the lines.

How to extract the parameters from the output of a formated string in Java

I am trying to parse the output of a program and extract the parameters used to generated these results. The output are in the form of sentences generated from the format function in Python e.g.:
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. is genereated from Opening browser '%s' to base url '%s'
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. is genereated from Clicking element '%s'.
I want to extract the initial input parameters in the format function. My function would look something like:
private List<String> extractParameters(String output, String format){
// code would come here
}
The function takes as input the generated string and the format string that was used to generate it (e.g. "Clicking element '%s'.") and returns a sorted list of the parameters that were used (e.g. "xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]")
I started working on a method using regex, but I have many formats to manage and not being a regex expert the solution I am moving towards to is really ugly and non maintainable. So the question is:
Is there any elegant way to achieve my goal in an elegant way in Java?

Regex should do the trick but you should be sure they are optimized and well written. For your above examples I made a simple line analyzer based on regex patterns:
class RegexLineAnalyzer {
private List<Pattern> patterns = new ArrayList<>();
public RegexLineAnalyzer() {
patterns.add(Pattern.compile("^Opening browser '(.+)' to base url '(.+)'", Pattern.CASE_INSENSITIVE));
patterns.add(Pattern.compile("^Clicking element '(.+)'", Pattern.CASE_INSENSITIVE));
// add other patterns
}
public List<String> extractParameters(String line) {
for (Pattern pattern : patterns) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
List<String> parameters = new ArrayList<>(matcher.groupCount());
for (int i = 0; i < matcher.groupCount(); i++) {
parameters.add(matcher.group(i + 1));
}
return parameters;
}
}
return Collections.emptyList();
}
}
I assume that log files are split on lines. How to read and split files by lines efficiently you can find on this page.
Example usage of above analyzer could be like below:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
List<String> lines = new ArrayList<>();
lines.add("Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'.");
lines.add("Clicking element 'xpath=.//a[contains(normalize-space(#class), \"cc-btn cc-dismiss\")]'.");
RegexLineAnalyzer regexLineAnalyzer = new RegexLineAnalyzer();
for (String line : lines) {
System.out.println(line + " => " + regexLineAnalyzer.extractParameters(line));
}
}
}
Prints:
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. => [Google Chrome, https://https://stackoverflow.com]
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. => [xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]]
EDITED
I thought you have a list of patterns you can match to each line. In case you need to guess a pattern and after that analyse it and find arguments you can use a simpler solution based on split function. We have to assume that each line contains even number of ' character. We would have a problem with lines like: Jon's browser is 'IE' or User last name is 'O'Reilly' or we could face User's last name is 'O'Reilly'. Simple implementation could look like below:
class SplitLineAnalyzer {
public List<String> extractParameters(String line) {
final String regex = "'";
final String[] split = line.split(regex);
if (split.length % 2 == 0) {
System.out.println("Line contains unexpected number of parts. Hard to guess pattern for line = " + line);
return Collections.emptyList();
}
List<String> args = new ArrayList<>();
for (int i = 1; i < split.length; i += 2) {
args.add(split[i]);
split[i] = "%s";
}
Arrays.stream(split).reduce((s1, s2) -> s1 + regex + s2).ifPresent(s -> System.out.println("Possible pattern: " + s));
return args;
}
}
Example usage:
public class Main {
public static void main(String[] args) throws Exception {
List<String> lines = new ArrayList<>();
lines.add("Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'.");
lines.add("Clicking element 'xpath=.//a[contains(normalize-space(#class), \"cc-btn cc-dismiss\")]'.");
lines.add("'Firefox' is used by user 'Tom'.");
lines.add("Lines like this' could be broken.");
lines.add("User's first name is 'Jerry'.");
lines.add("User's last name is 'O'Reilly'");
SplitLineAnalyzer regexLineAnalyzer = new SplitLineAnalyzer();
for (String line : lines) {
System.out.println(line + " => " + regexLineAnalyzer.extractParameters(line));
System.out.println("");
}
}
}
Prints:
Possible pattern: Opening browser '%s' to base url '%s'.
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. => [Google Chrome, https://https://stackoverflow.com]
Possible pattern: Clicking element '%s'.
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. => [xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]]
Possible pattern: '%s' is used by user '%s'.
'Firefox' is used by user 'Tom'. => [Firefox, Tom]
Line contains unexpected number of parts. Hard to guess pattern for line = Lines like this' could be broken.
Lines like this' could be broken. => []
Line contains unexpected number of parts. Hard to guess pattern for line = User's first name is 'Jerry'.
User's first name is 'Jerry'. => []
Line contains unexpected number of parts. Hard to guess pattern for line = User's last name is 'O'Reilly'
User's last name is 'O'Reilly' => []

cannot split a specific kind of strings using Java

I am working in Java. I have list of parameters stored in a string which is coming form excel. I want to split it only at starting hyphen of every new line. This string is stored in every excel cell and I am trying to extract it using Apache poi. The format is as below:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
What I want
array or arraylist which looks like this
[I am string one,
I am string two,
I am string-three with new line,
I am string-four,
I am string five]
What I Tried
I tried to use split function like this:
String[] newline_split = text.split("-");
but the output I get is not what I want
My O/P
[, I am string one,
I am string two,
I am string, // wrong
three // wrong
with new line, // wrong
I am string, // wrong!
four, // wrong!
I am string five]
I might have to tweak split function a bit but not able to understand how, because there are so many hyphens and new lines in the string.
P.S.
If i try splitting only at new line then the line - I am string-three \n with new line breaks into two parts which again is not correct.
EDIT:
Please know that this data inside string is incorrectly formatted just like what is shown above. It is coming from an excel file which I have received. I am trying to use apache poi to extract all the content out of each excel cell in a form of a string.
I intentionally tried to keep the format like what client gave me. For those who are confused about description inside A, I have changed it because I cannot post the contents on here as it is against privacy of my workplace.

You can
remove line separators (replace it with space) if they don't have - after it (in next line): .replaceAll("\\R(?!-)", " ") should do the trick
\R (written as "\\R" in string literal) since Java 8 can be used to represent line separators
(?!...) is negative-look-ahead mechanism - ensures that there is no - after place in which it was used (will not include it in match so we will not remove potential - which ware matched by it)
then remove - placed at start of each line (lets also include followed whitespaces to trim start of the string). In other words replace - placed
after line separators: can be represented by "\\R"
after start of string: can be represented by ^
This should do the trick: .replaceAll("(?<=\\R|^)-\\s*","")
split on remaining line separtors: .split("\\R")
Demo:
String text =
"- I am string one\n" +
"-I am string two\n" +
"- I am string-three\n" +
"with new line\n" +
"-I am string-four\n" +
"- I am string five";
String[] split = text.replaceAll("\\R(?!-)", " ")
.replaceAll("(?<=\\R|^)-\\s*","")
.split("\\R");
for (String s: split){
System.out.println("'"+s+"'");
}
Output (surrounded with ' to show start and end of results):
'I am string one'
'I am string two'
'I am string-three with new line'
'I am string-four'
'I am string five'

This is how I would do:
import java.util.*;
public class MyClass {
public static void main(String args[]) {
String A = "- I am string one \n" +
" -I am string two\n" +
" - I am string-three \n" +
" with new line\n" +
" -I am string-four\n" +
"- I am string five";
String[] s2 = A.split("\r?\n");
List<String> lines = new ArrayList<String>();
String line = "";
for (int i = 0; i < s2.length; i++) {
String ss = s2[i].trim();
if (i == 0) { // first line MUST start with "-"
line = ss.substring(1).trim();
} else if (ss.startsWith("-")) {
lines.add(line);
ss = ss.substring(1).trim();
line = ss;
} else {
line = line + " " + ss;
}
}
lines.add(line);
System.out.println(lines.toString());
}
}
I hope it helps.
A little explanation:
I will process line by line, trimming each one.
If it starts with '-' it means the end of the previous line, so I include it in the list. If not, I concatenate with the previous line.

looks as if you are splitting the FIRST - of each line, so you need to remove every instance of a "newline -"
str.replace("\n-", '\n')
then Remove the initial "-"
str = str.substring(1);

Parsing a Log File to Display Data from Multiple Lines Using Regular Expressions

So I'm trying to parse a bit of code here to get message text from a log file. I'll explain as I go. Here's the code:
// Print to interactions
try
{
// assigns the input file to a filereader object
BufferedReader infile = new BufferedReader(new FileReader(log));
sc = new Scanner(log);
while(sc.hasNext())
{
String line=sc.nextLine();
if(line.contains("LANTALK")){
Document doc = Jsoup.parse(line);
Element idto = doc.select("MBXTO").first();
Element msg = doc.select("MSGTEXT").first();
System.out.println(" to " + idto.text() + " " +
msg.text());
System.out.println();
} // End of if
} // End of while
try
{
// Print to output file
sc = new Scanner (log);
while(sc.hasNext())
{
String line=sc.nextLine();
if(line.contains("LANTALK")){
Document doc = Jsoup.parse(line);
Element idto = doc.select("MBXTO").first();
Element msg = doc.select("MSGTEXT").first();
outFile.println(" to " + idto.text() + " " +
msg.text());
outFile.println();
outFile.println();
} // End of if
} // End of while
} // end of try
I'm getting input from a log file, here's a sample of what it looks like and the lines that I'm filtering out:
08:25:20.740 [D] [T:000FF0] [F:LANTALK2C] <CMD>LANMSG</CMD>
<MBXID>1124</MBXID><MBXTO>5760</MBXTO><SUBTEXT>LanTalk</SUBTEXT><MOBILEADDR>
</MOBILEADDR><LAP>0</LAP><SMS>0</SMS><MSGTEXT>and I talked to him and he
gave me a credit card number</MSGTEXT>
08:25:20.751 [+] [T:000FF0] [S:1:1:1124:5607:5] LANMSG [15/2 | 0]
08:25:20.945 [+] [T:000FF4] [S:1:1:1124:5607:5] LANMSGTYPESTOPPED [0/2 | 0]
08:25:21.327 [+] [T:000FE8] [S:1:1:1124:5607:5] LANMSGTYPESTARTED [0/2 | 0]
So far, I've been able to filter the line that contains the message (LANMSG). And from that, I've been able to get the id number of the recipient (MBXTO). But the next line contains the sender's id, which I need to pull out and display. ([S:1:1:1124:SENDERID:5]). How should I do this? Below is a copy of the output I'm getting:
to 5760 and I talked to him and he gave me a credit card number
And here's what I need to get:
SENDERID to 5760 and I talked to him and he gave me a credit card number
Any help you guys could give me on this would be great. I'm just not sure how to go about getting the information I need.

Your answer isn't clear enough, but as it seems like you have not used regex in this code... remember to specify what have you tried before asking.
Anyways the regex you're searching for is:
(\d{2}:\d{2}:\d{2}\.\d{3})\s\[D\].+<MBXID>(\d+)<\/MBXID><MBXTO>(\d+)<\/MBXTO>.+<MSGTEXT>(.+)<\/MSGTEXT>
Working example in Regex101
It should capture:
$1: 08:25:20.740
$2: 1124
$3: 5760
$4: and I talked to him and he
gave me a credit card number (Note that it also capture \n, or newline, characters).
(Also, you'll use matcher.group(number) instead of $number in Java).
And then you can use these substitution (group reference) terms to get your formatted output.
E.g.: $1 [$2] to [$3] $4
Should return:
08:25:20.740 [1124] to [5760] and I talked to him and he
gave me a credit card number
Remember, when you're going to implement regex in your Java code, you must escape all the backslashes (\), for this reason, this regex looks bigger:
Pattern pattern = Pattern.compile("(\\d{2}:\\d{2}:\\d{2}\\.\\d{3})\\s\\[D\\].+<MBXID>(\\d+)<\\/MBXID><MBXTO>(\\d+)<\\/MBXTO>.+<MSGTEXT>(.+)<\\/MSGTEXT>", Pattern.MULTILINE + Pattern.DOTALL);
// Multiline is used to capture the LANMSG more than once, and Dotall is used to make the '.' term in regex also match the newline in the input
Matcher matcher = pattern.matcher(input);
while (matcher.find()){
String output = matcher.group(1) + " [" + matcher.group(2) + "] to [" + matcher.group(3) + "] " + matcher.group(4);
System.out.println(output);
}
And for your second problem Oh, you have edited and erased it already. . . But I'll still answer:
You can parse the $2 and $3 and make them return an integer:
int id1 = Integer.parseInt(matcher.group(2));
int id2 = Integer.parseInt(matcher.group(3));
This way you can create a method to return a name for these IDs. e.g.: UserUtil.getName(int id)

How do sort my pattern match results as complete match, half a match and so on [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Regular expression not matching subwords in phrase
My program displays the matching results, but I want to sort the results as complete match (100%), half a match and so on.
My text file contains the following line:
Red car
Red
Car
So If I search for: “red car”. I get the following results
Red car
Red
Car
So what I want to do is to sort the found results as follows:
"red car" 100% match
"red" 40% match
"car" 40% match
Any help is appreciated.
Any help is appreciated. My code is as follows:
public static void main(String[] args) {
// TODO code application logic here
String strLine;
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("C:\\textfile.txt"");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
Scanner input = new Scanner (System.in);
System.out.print("Enter Your Search: "); // String key="red or yellow";
String key = input.nextLine();
while ((strLine = br.readLine()) != null) {
Pattern p = Pattern.compile(key); // regex pattern to search for
Matcher m = p.matcher(strLine); // src of text to search
boolean b = false;
while(b = m.find()) {
System.out.println( " " + m.group()); // returns index and match
// Print the content on the console
}
}
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}

Assuming you are searching for "Red" or "Yellow", and or is the only logical operator you need (no 'and' or 'xor') and you don't want to use any wildcards or regular-expressions in what you search for, then I would simply loop through, trying to match each String in turn against the line. In pseudo-code, something like:
foreach (thisLine: allLinesInTheFile) {
numOfCharsMatching = 0
foreach (thisString: allSearchStrings) {
if (thisLine.contains(thisString) {
numOfCharsMatching = numOfCharsMatching + thisString.length
}
}
score = ( numOfCharsMatching / thisLine.length ) * 100
}
If you don't want spaces to count in your score, then you'd need to remove them from the thisString.length (and not allow them in your search terms)
One other problem is the numOfCharsMatching will be incorrect if matches can overlap (i.e. if searching for 'row' or 'brown' in 'brown row' it will say that there are 11 characters matching, longer than the length of the string. You could use a BitSet to track which characters have been involved in a match, something like:
foreach (thisLine: allLinesInTheFile) {
whichCharsMatch = new BitSet()
foreach (thisString: allSearchStrings) {
if (thisLine.contains(thisString) {
whichCharsMatch.set(startPositionOfMatch, endPositionOfMatch, true)
}
}
score = ( numOfCharsMatching / thisLine.length ) * 100
}
Have a look at the BitSet javadoc, particularly the set and cardinality methods

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

extract text between matches using java - java

Related

Need Regular Expression to parse multi-line environmental variables

How to extract the parameters from the output of a formated string in Java

cannot split a specific kind of strings using Java

Parsing a Log File to Display Data from Multiple Lines Using Regular Expressions

How do sort my pattern match results as complete match, half a match and so on [duplicate]

Categories

Resources