Regex search a sequence and stop at first "," - java

i need some help with regex in the following case.
I'm reading a folder with multiple files like these ones A.AAA2000.XYZ or B.BBB2000.AY
I have to search in every file for a line(or lines) with a pattern like this:
CALL(or CALL-PROC or ENTER) $XX.whatever,whatever1,whatever2 and so on.
the XX.whatever can be another file in my folder or it doesn't even exist. What i need to do is see what files contain that pattern and in that pattern if those XX.whatever are files or don't exist and output the ones that don't exist. The problem is i have to stop at the first occurence of "," otherwise i get false results and i can't seem to get it to work properly. I did everything except getting rid of that ",". I attached some code and example below, please help if you can:
Example (as intended to work):
Searching file A.AAA2000.XYZ
Found procedure(s): $XX.B.BBB.2000.AY,LALA,LALA1,LALA2
Searching file B.BBB.2000.AY
Found procedure(s): $XX.C.CCC.2000.XYZ,LALALA,LALALALA,LALALALA
Searching file C.CCC.2000.XYZ
ERROR: File doesn't exist or no procedures called
Procedures found:
B.BBB.2000.AY
Procedures not found:
C.CCC.2000.XYZ
Example2 (how it's working right now):
Searching file A.AAA2000.XYZ
Found procedure(s): #XX.B.BBB.2000.AY,LALA,LALA1,LALA2
Searching file B.BBB.2000.AY
Found procedure(s): #XX.C.CCC.2000.XYZ,LALALA,LALALALA,LALALALA
Searching file C.CCC.2000.XYZ
ERROR: File doesn't exist or no procedures called
...........................
...........................
...........................
Procedures found:
XX.C.CCC.2000.XYZ,LALALA,LALALALA,LALALALALA
Procedures not found:
B.BBB.2000.AY,LALA,LALA1,LALA2
C.CCC.2000.XYZ,LALALA,LALALALA,LALALALA
Parts of code:
private static final String[] _keyWords = {"CALL-PROC", "CALL", "ENTER"};
private static final String _procedureRegex = ".* \\$PR\\..*";
private static final String _lineSkipper = "/REMARK";
private static final String _procedureNameFormat = "\\$PR\\..+";
private static boolean CallsProcedure(String givenLine)
{for (String keyWord : _keyWords) {
if (givenLine.contains(keyWord) && !givenLine.contains(_lineSkipper)) {
Pattern procedurePattern = Pattern.compile(_procedureRegex);
Matcher procedureMatcher = procedurePattern.matcher(givenLine);
return procedureMatcher.find();
}
}
READING:
private void ReadContent(File givenFile,
HashMap<String, HashSet<String>> whereToAddProcedures,
HashMap<String, HashSet<String>> whereToAddFiles) throws IOException {
System.out.println("Processing file " + givenFile.getAbsolutePath());
BufferedReader fileReader = new BufferedReader(new FileReader(givenFile));
String currentLine;
while ((currentLine = fileReader.readLine()) != null) {
if (CallsProcedure(currentLine)) {
String CProc = currentLine.split("\\$PR\\.")[1];
if (whereToAddProcedures.containsKey(CProc)) {
System.out.println("Procedure " + CProc + " already exists, adding more paths.");
whereToAddProcedures.get(CProc).add(givenFile.getAbsolutePath());
} else {
System.out.println("Adding Procedure " + CProc);
whereToAddProcedures.put(CProc,
new HashSet<>(Collections.singletonList(givenFile.getAbsolutePath())));
}
if (givenFile.getName().matches(_procedureNameFormat)) {
if (whereToAddFiles.containsKey(givenFile.getAbsolutePath())) {
System.out.println("File " + givenFile.getName()
+ " already has procedure calls, adding " + CProc);
whereToAddProcedures.get(givenFile.getName()).add(CProc);
} else {
System.out.println("Adding Procedure Call for " + CProc + " to "
+ givenFile.getName());
whereToAddProcedures.put(givenFile.getName(),
new HashSet<>(Collections.singletonList(CProc)));
}
}
}
}
fileReader.close();

If the comma is a marker of the end of the pattern you can make it the last position of the regex, stoping the matching when a comma appear. Like this
_procedureRegex = ".* \\$PR\\.[^,]*";

Related

.contains() not finding multiple lines in java

im trying to do string .contains() for specific lines on text
im reading in lines of a file using Files.readAlllines.
im trying to do
Path c1=Paths.get(prop.getProperty("testPWP"));
List<String> newLines1 = new ArrayList<String>();
for (String line : Files.readAllLines(c1, StandardCharsets.UTF_8)) {
if (line.contains("return test ;\r\n" + " }")) {
newLines1.add( line.replace("return test ;\r\n" +
" }", "return test ;\r\n" +
" }*/"));
}
else {
newLines1.add(line);
}
}
Files.write(c1, newLines1, StandardCharsets.UTF_8);
im basically trying to comment the } after the return statement but the contains function not recongnizing it as its in new line in the file.
Any help on this issue?
As you may have noticed, Files.readAllLines reads all lines and returns a list in which each string represents a line. To accomplish what you are trying to do, you either need to read the entire file into a single string, or concatenate the strings you already have, or change your approach of substitution. The easiest way would be to read the entire contents of the file into one string, which can be accomplished as follows:
String content = new String(Files.readAllBytes(Paths.get("path to file")));
or if you are using Java 11 or higher:
String content = Files.readString(Paths.get("path to file"));
You can use the replaceable parameter to replace the regex.
Demo:
public class Main {
public static void main(String[] args) {
String find = "return test ;\r\n" + " }";
String str = "Hello return test ;\r\n" + " } Hi Bye";
boolean found = str.contains(find);
System.out.println(found);
if (found) {
str = str.replaceAll("(" + find + ")", "/*$1*/");
}
System.out.println(str);
}
}
Output:
true
Hello /*return test ;
}*/ Hi Bye
Here $1 specifies the capturing group, group(1).
In your program, the value of str can be populated as follows:
String str = Files.readString(path, StandardCharsets.US_ASCII);
In case your Java version is less than 11, you do it as follows:
String str = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.US_ASCII);

using regex output in a method

I'm using regex to read data from a file but I'm having trouble using the data I'm reading.
here is my code:
File file = new File(eventsFile);
try {
Scanner sc = new Scanner(file);
while(sc.hasNext()){
String eventLine = sc.nextLine();
Pattern pattern = Pattern.compile("^Event=(?<event>[^,]*),time=(?<time>[^,]*)(,rings=(?<rings>[^,]*))?$");
Matcher matcher = pattern.matcher(eventLine);
while (matcher.find()) {
System.out.print(matcher.group("event") + " " + matcher.group("time"));
String eventName = matcher.group("event");
int time = Integer.parseInt(matcher.group("time"));
Class<?> eventClass = Class.forName(eventName);
Constructor<?> constructor = eventClass.getConstructor(long.class);
Event event = (Event) constructor.newInstance(time);
addEvent(event);
if (matcher.group(4) != null) {
System.out.println(" " + matcher.group(4));
} else {
System.out.println();
}
}
}
The print statements are there just temporarily to make sure the scanning of the file and regex work. what i'm trying to accomplish is use matcher.group(1) and matcher.group(2) as follows addEvent(new eventname(time)) where eventname is matcher.group(1) and time is matcher.group(2)
I tried creating variables to store group(1) and 2 and use them in addEvent but that didn't really work. So any ideas on how to approach such an issue?
EDIT:
Example of text file
Event=ThermostatNight,time=0
Event=LightOn,time=2000
Event=WaterOff,time=10000
Event=ThermostatDay,time=12000
Event=Bell,time=9000,rings=5
Event=WaterOn,time=6000
Event=LightOff,time=4000
Event=Terminate,time=20000
Event=FansOn,time=7000
Event=FansOff,time=8000
I'm trying to reach a situation where i would be running for an addEvent function for each of these lines in the text file that would follow this example addEvent(new ThermostatNight(0));

How to extract the parameters from the output of a formated string in Java

I am trying to parse the output of a program and extract the parameters used to generated these results. The output are in the form of sentences generated from the format function in Python e.g.:
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. is genereated from Opening browser '%s' to base url '%s'
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. is genereated from Clicking element '%s'.
I want to extract the initial input parameters in the format function. My function would look something like:
private List<String> extractParameters(String output, String format){
// code would come here
}
The function takes as input the generated string and the format string that was used to generate it (e.g. "Clicking element '%s'.") and returns a sorted list of the parameters that were used (e.g. "xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]")
I started working on a method using regex, but I have many formats to manage and not being a regex expert the solution I am moving towards to is really ugly and non maintainable. So the question is:
Is there any elegant way to achieve my goal in an elegant way in Java?
Regex should do the trick but you should be sure they are optimized and well written. For your above examples I made a simple line analyzer based on regex patterns:
class RegexLineAnalyzer {
private List<Pattern> patterns = new ArrayList<>();
public RegexLineAnalyzer() {
patterns.add(Pattern.compile("^Opening browser '(.+)' to base url '(.+)'", Pattern.CASE_INSENSITIVE));
patterns.add(Pattern.compile("^Clicking element '(.+)'", Pattern.CASE_INSENSITIVE));
// add other patterns
}
public List<String> extractParameters(String line) {
for (Pattern pattern : patterns) {
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
List<String> parameters = new ArrayList<>(matcher.groupCount());
for (int i = 0; i < matcher.groupCount(); i++) {
parameters.add(matcher.group(i + 1));
}
return parameters;
}
}
return Collections.emptyList();
}
}
I assume that log files are split on lines. How to read and split files by lines efficiently you can find on this page.
Example usage of above analyzer could be like below:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
List<String> lines = new ArrayList<>();
lines.add("Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'.");
lines.add("Clicking element 'xpath=.//a[contains(normalize-space(#class), \"cc-btn cc-dismiss\")]'.");
RegexLineAnalyzer regexLineAnalyzer = new RegexLineAnalyzer();
for (String line : lines) {
System.out.println(line + " => " + regexLineAnalyzer.extractParameters(line));
}
}
}
Prints:
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. => [Google Chrome, https://https://stackoverflow.com]
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. => [xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]]
EDITED
I thought you have a list of patterns you can match to each line. In case you need to guess a pattern and after that analyse it and find arguments you can use a simpler solution based on split function. We have to assume that each line contains even number of ' character. We would have a problem with lines like: Jon's browser is 'IE' or User last name is 'O'Reilly' or we could face User's last name is 'O'Reilly'. Simple implementation could look like below:
class SplitLineAnalyzer {
public List<String> extractParameters(String line) {
final String regex = "'";
final String[] split = line.split(regex);
if (split.length % 2 == 0) {
System.out.println("Line contains unexpected number of parts. Hard to guess pattern for line = " + line);
return Collections.emptyList();
}
List<String> args = new ArrayList<>();
for (int i = 1; i < split.length; i += 2) {
args.add(split[i]);
split[i] = "%s";
}
Arrays.stream(split).reduce((s1, s2) -> s1 + regex + s2).ifPresent(s -> System.out.println("Possible pattern: " + s));
return args;
}
}
Example usage:
public class Main {
public static void main(String[] args) throws Exception {
List<String> lines = new ArrayList<>();
lines.add("Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'.");
lines.add("Clicking element 'xpath=.//a[contains(normalize-space(#class), \"cc-btn cc-dismiss\")]'.");
lines.add("'Firefox' is used by user 'Tom'.");
lines.add("Lines like this' could be broken.");
lines.add("User's first name is 'Jerry'.");
lines.add("User's last name is 'O'Reilly'");
SplitLineAnalyzer regexLineAnalyzer = new SplitLineAnalyzer();
for (String line : lines) {
System.out.println(line + " => " + regexLineAnalyzer.extractParameters(line));
System.out.println("");
}
}
}
Prints:
Possible pattern: Opening browser '%s' to base url '%s'.
Opening browser 'Google Chrome' to base url 'https://https://stackoverflow.com'. => [Google Chrome, https://https://stackoverflow.com]
Possible pattern: Clicking element '%s'.
Clicking element 'xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]'. => [xpath=.//a[contains(normalize-space(#class), "cc-btn cc-dismiss")]]
Possible pattern: '%s' is used by user '%s'.
'Firefox' is used by user 'Tom'. => [Firefox, Tom]
Line contains unexpected number of parts. Hard to guess pattern for line = Lines like this' could be broken.
Lines like this' could be broken. => []
Line contains unexpected number of parts. Hard to guess pattern for line = User's first name is 'Jerry'.
User's first name is 'Jerry'. => []
Line contains unexpected number of parts. Hard to guess pattern for line = User's last name is 'O'Reilly'
User's last name is 'O'Reilly' => []

NPE in a do/while loop due to EOF...catching the EOF earlier to avoid the NPE [duplicate]

This question already has answers here:
What is a NullPointerException, and how do I fix it?
(12 answers)
Closed 5 years ago.
I have written this program to compare 2 files. They are 500mb to 2.8gb in size and are created every 6 hours. I have 2 files from 2 sources (NMD and XMP). They are broken up into lines of text that have fields separated by the pipe(|) character. Each line is a single record and may be up to 65,000 characters long. The data is about TV shows and movies, showing times and descriptive content. I have determined that any particular show or movie has a minimum of 3 pieces of data that will uniquely identify that show or movie. IE: CallSign, ProgramId and StartLong. The two sources for this data are systems called NMD and XMP hence that acronym added to various variables. So my goal is to compare a file created by NMD and one created by XMP and confirm that everything that NMD produces is also produced by XMP and that the data in each matched record is the same.
What I am trying to accomplish here is this: 1. Read the NMD file record by record for the 3 unique data fields. 2. Read the XMP file record by record and look for a match for the current record in the NMD file. 3.The NMD file should iterate one record at a time. Each NMD record should then be searched for in the entire XMD file, record by record for that same record. 4. Write a log entry in one of 2 files indicating success or failure and what that data was.
I have an NPE issue when I reach the end of the testdataXMP.txt file. I assume the same thing will happen for testdataNMD.txt. I'm trying to break out of the loop right after the readLine since the epgsRecordNMD or epgsRecordXMP will have just reached the end of the file if it at that point in the file. The original NPE was for trying to do a string split on null data at the end of the file. Now I'm getting an NPE here according to the debugger.
if (epgsRecordXMP.equals(null)) {
break;
}
Am I doing this wrong? If I'm really at the end of the file, the readLine ought to return null right?
I did it this way too, but to my limited experience they feel like they are effectively the same thing. It too threw an NPE.
if (epgsRecordXMP.equals(null)) break;
Here's the code...
public static void main(String[] args) throws java.io.IOException {
String epgsRecordNMD = null;
String epgsRecordXMP = null;
BufferedWriter logSuccessWriter = null;
BufferedWriter logFailureWriter = null;
BufferedReader readXMP = null;
BufferedReader readNMD = null;
int successCount = 0;
readNMD = new BufferedReader(new FileReader("d:testdataNMD.txt"));
readXMP = new BufferedReader(new FileReader("d:testdataXMP.txt"));
do {
epgsRecordNMD = readNMD.readLine();
if (epgsRecordNMD.equals(null)) {
break;
}
String[] epgsSplitNMD = epgsRecordNMD.split("\\|");
String epgsCallSignNMD = epgsSplitNMD[0];
String epgsProgramIdNMD = epgsSplitNMD[2];
String epgsStartLongNMD = epgsSplitNMD[9];
System.out.println("epgsCallsignNMD: " + epgsCallSignNMD + " epgsProgramIdNMD: " + epgsProgramIdNMD + " epgsStartLongNMD: " + epgsStartLongNMD );
do {
epgsRecordXMP = readXMP.readLine();
if (epgsRecordXMP.equals(null)) {
break;
}
String[] epgsSplitXMP = epgsRecordXMP.split("\\|");
String epgsCallSignXMP = epgsSplitXMP[0];
String epgsProgramIdXMP = epgsSplitXMP[2];
String epgsStartLongXMP = epgsSplitXMP[9];
System.out.println("epgsCallsignXMP: " + epgsCallSignXMP + " epgsProgramIdXMP: " + epgsProgramIdXMP + " epgsStartLongXMP: " + epgsStartLongXMP);
if (epgsCallSignXMP.equals(epgsCallSignNMD) && epgsProgramIdXMP.equals(epgsProgramIdNMD) && epgsStartLongXMP.equals(epgsStartLongNMD)) {
logSuccessWriter = new BufferedWriter (new FileWriter("d:success.log", true));
logSuccessWriter.write("NMD match found in XMP " + "epgsCallsignNMD: " + epgsCallSignNMD + " epgsProgramIdNMD: " + epgsProgramIdNMD + " epgsStartLongNMD: " + epgsStartLongNMD);
logSuccessWriter.write("\n");
successCount++;
logSuccessWriter.write("Successful matches: " + successCount);
logSuccessWriter.write("\n");
logSuccessWriter.close();
System.out.println ("Match found");
System.out.println ("Successful matches: " + successCount);
}
} while (epgsRecordXMP != null);
readXMP.close();
if (successCount == 0) {
logFailureWriter = new BufferedWriter (new FileWriter("d:failure.log", true));
logFailureWriter.write("NMD match not found in XMP" + "epgsCallsignNMD: " + epgsCallSignNMD + " epgsProgramIdNMD: " + epgsProgramIdNMD + " epgsStartLongNMD: " + epgsStartLongNMD);
logFailureWriter.write("\n");
logFailureWriter.close();
System.out.println ("Match NOT found");
}
} while (epgsRecordNMD != null);
readNMD.close();
}
}
You should not make this:
if (epgsRecordXMP.equals(null)) {
break;
}
If you want to know if epgsRecordXMPis null then the if should be like this:
if (epgsRecordXMP == null) {
break;
}
To sum up: your app throws NPE when try to call equals method in epgsRecordXMP.

Crawling a URL in order to extract all the other URLs in that page

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems:
2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to.
2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?
How about:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+#)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
With one library. I used HtmlCleaner. It does the job.
you can find it at:
http://htmlcleaner.sourceforge.net/javause.php
another example (not tested) with jsoup:
http://jsoup.org/cookbook/extracting-data/example-list-links
rather readable.
You can enhance it, choose < A > tags or others, HREF, etc...
or be more precise with case (HreF, HRef, ...): for exercise
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}

Categories