Regex Matching Conflict with Overlapping Symbol

Regex Matching Conflict with Overlapping Symbol - java

I'm trying to match tokens that all the contain the symbol < or >, but there are some conflicts. In particular, my tokens are <, >, </, />, and a comment that starts with <!-- and ends with -->.
My regexes for these are as follows:
String LTHAN = "<";
String GTHAN = ">";
String LTHAN_SLASH = "</";
String GTHAN_SLASH = "/>";
String COMMENT = "<!--.*-->";
And I compile them by adding them to a list using the general method:
public void add(String regex, int token) {
tokenInfos.add(new TokenInfo(Pattern.compile("^(" + regex + ")"), token));
}
Here is what my TokenInfo class looks like:
private class TokenInfo {
public final Pattern regex;
public final int token;
public TokenInfo(Pattern regex, int token) {
super();
this.regex = regex;
this.token = token;
}
}
I match and display the list as follows:
public void tokenize(String str) {
String s = new String(str);
tokens.clear();
while (!s.equals("")) {
boolean match = false;
for (TokenInfo info : tokenInfos) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;
String tok = m.group().trim();
tokens.add(new Token(info.token, tok));
s = m.replaceFirst("");
break;
}
}
}
}
Read and display:
try {
BufferedReader br;
String curLine;
String EOF = null;
Scanner scan = new Scanner(System.in);
StringBuilder sb = new StringBuilder();
try {
File dir = new File("C:\\Users\\Me\\Documents\\input files\\example.xml");
br = new BufferedReader(new FileReader(dir));
while ((curLine = br.readLine()) != EOF) {
sb.append(curLine);
// System.out.println(curLine);
}
br.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
tokenizer.tokenize(sb.toString());
for (Tokenizer.Token tok : tokenizer.getTokens()) {
System.out.println("" + tok.token + " " + tok.sequence);
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
Sample input:
<!-- Sample input file with incomplete recipe -->
<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
<title>Basic bread</title>
<ingredient amount="3" unit="cups">Flour</ingredient>
<instructions>
<step>Mix all ingredients together.</step>
</instructions>
</recipe>
However, the outputted token list recognizes < and / (including whatever characters come after it) as separate tokens, meaning it can never seem to recognize the tokens </ and />. Same issue with the comments. Is this a problem with my regex? Why isn't it recognizing the patterns </ and />?
Hope my question is clear. Happy to provide more details/examples if necessary.

Problems:
Your initial regex ^(<) will match against the entire input. This regex means that the text has to start with < and the entire input string is just that. So you will have to fix it.
If the entire tag (without the text content - like Basic Bread, Mix all ingredients together) is considered a token. So the corresponding Regex should be a single regex.
Solution
Try changing the Regex to the following:
For a single tag - <[^>]*>
For a single closing tag - </[^>]*>;
For comments - <!--.*--> (This is already correct)
Sample Program
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
private static ArrayList<TokenInfo> tokenInfoList = new ArrayList<>();
private static ArrayList<String> tokensList = new ArrayList<>();
public static void add(String regex, int token) {
tokenInfoList.add(new TokenInfo(Pattern.compile(regex), token));
}
static {
String LTHAN = "<[^>]*>";
String LTHAN_SLASH = "</[^>]*>";
String COMMENT = "<!--.*-->";
add(LTHAN, 1);
add(LTHAN_SLASH, 3);
add(COMMENT, 5);
}
private static class TokenInfo {
public final Pattern regex;
public final int token;
public TokenInfo(Pattern regex, int token) {
super();
this.regex = regex;
this.token = token;
}
}
public static void tokenize(String str) {
String s = new String(str);
while (!s.equals("")) {
boolean match = false;
for (TokenInfo info : tokenInfoList) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;
String tok = m.group().trim();
tokensList.add(tok);
s = m.replaceFirst("");
break;
}
}
// The following is under the assumption that the Text nodes within the document are not considered tokens and replaced
if (!match) {
break;
}
}
}
public static void main(String[] args) {
try {
BufferedReader br;
String curLine;
String EOF = null;
StringBuilder sb = new StringBuilder();
try {
File dir = new File("recipe.xml");
br = new BufferedReader(new FileReader(dir));
while ((curLine = br.readLine()) != EOF) {
sb.append(curLine);
// System.out.println(curLine);
}
br.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
tokenize(sb.toString());
for (String eachToken : tokensList) {
System.out.println(eachToken);
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
References
http://www.regular-expressions.info/ is a great resource for learning regular expressions.

Related

StringTokenizer doesn't read the firs line of the file.txt

I'm trying to take every single words from a text file and put them into a ArrayList but the StringTokenizer doesn't read the first line of the text file... What's wrong?
public class BufferReader {
public static void main(String[] args) throws FileNotFoundException, IOException {
BufferedReader reader = new BufferedReader(new FileReader("C://Java-projects//EsameJava//prova.txt"));
String line = reader.readLine();
List<String> str = new ArrayList<>();
while ((line = reader.readLine()) != null) {
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens()) {
str.add(token.nextToken());
}
}
System.out.println(str);
The only solution I found is to start the text file from the second line but it's not what I want...

This is how you could marry the (very) old and the new(er) to provide a collection of words:
import java.text.BreakIterator;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Stream;
import java.nio.file.Files;
import java.nio.file.Paths;
public class WordCollector {
public static void main(String[] args) {
try {
List<String> words = WordCollector.getWords(Files.lines(Paths.get(args[0])));
System.out.println(words);
} catch (Throwable t) {
t.printStackTrace();
}
}
public static List<String> getWords(Stream<String> lines) {
List<String> result = new ArrayList<>();
BreakIterator boundary = BreakIterator.getWordInstance();
lines.forEach(line -> {
boundary.setText(line);
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
String candidate = line.substring(start, end).replaceAll("\\p{Punct}", "").trim();
if (candidate.length() > 0) {
result.add(candidate);
}
}
});
return result;
}
}

merging two array lists in java

I have two arraylists
arraylist dName has values:
mark, 22
peter, 34
ken, 55
arraylist dest has values:
mark, London
peter, Bristol
mark, Cambridge
I want to join merge them so that their output gives:
mark
London
Cambridge
peter
Bristol
Ken
this is the code i have for now, i'm not really usre how to split on the comma and search the other array
public class Sample {
BufferedReader br;
BufferedReader br2;
public Sample() {
ArrayList<String> dName = new ArrayList<String>();
ArrayList<String> dest = new ArrayList<String>();
String line = null;
String lines = null;
try {
br = new BufferedReader(new FileReader("taxi_details.txt"));
br2 = new BufferedReader(new FileReader("2017_journeys.txt"));
while ((line = br.readLine()) != null &&
(lines = br2.readLine()) != null){
String name [] = line.split(";");
String destination [] = lines.split(",");
// add values to ArrayList
dName.add(line);
dest.add(lines);
// iterate through destination
for (String str : destination) {
}
}
}
catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
} finally {
try {
if (br != null)
br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
public static void main(String[] args) throws IOException {
}
}

Now, I'm not sure whether this is the proper way, but at least it is working.
taxi_details.txt
mark, 22
peter, 34
ken, 55
2017_journeys.txt
mark, London
peter, Bristol
mark, Cambridge
FileReader
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.List;
import java.util.stream.Collectors;
public class FileReader {
public List<String> read(String fileName) throws IOException{
return Files.lines(new File(fileName).toPath()).collect(Collectors.toList());
}
}
This class lets you avoid all the messy try-catch blocks.
Line
public class Line{
public static final String DELIMITER = ",";
public static final int INDEX_NAME = 0;
public static final int INDEX_VALUE = 1;
private String line;
private String[] values;
public Line(String line) {
this.line = line;
this.values = line.split(DELIMITER);
}
public String getName(){
return this.values[INDEX_NAME];
}
public String getValue(){
return this.values[INDEX_VALUE];
}
public void emptyValue(){
this.values[INDEX_VALUE] = "";
}
#Override
public String toString() {
return this.line;
}
}
This class has the mere prupose of preparing the data as needed for merging.
Main
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.stream.Collectors;
public class Main {
public static void main(String[] args) throws IOException {
FileReader fileReader = new FileReader();
// Read lines
List<String> dName = fileReader.read("taxi_details.txt");
List<String> dest = fileReader.read("2017_journeys.txt");
// Convert into proper format
List<Line> dNameLines = dName.stream().map(Line::new).collect(Collectors.toList());
List<Line> destLines = dest.stream().map(Line::new).collect(Collectors.toList());
// Remove ID
dNameLines.forEach(Line::emptyValue);
// Merge lists
Map<String, String> joined = join(dNameLines, destLines);
// Print
for (Entry<String, String> line: joined.entrySet()) {
System.out.println(line.getKey() + " --> " + line.getValue());
}
}
public static Map<String, String> join(List<Line> a, List<Line> b){
Map<String, String> joined = new HashMap<>();
// Put first list into map, as there is no danger of overwriting existing values
a.forEach(line -> {
joined.put(line.getName(), line.getValue());
});
// Put second list into map, but check for existing keys
b.forEach(line -> {
String key = line.getName();
if(joined.containsKey(key)){ // Actual merge
String existingValue = joined.get(key);
String newValue = line.getValue();
if(!existingValue.isEmpty()){
newValue = existingValue + Line.DELIMITER + newValue;
}
joined.put(key, newValue);
}else{ // Add entry normally
joined.put(line.getName(), line.getValue());
}
});
return joined;
}
}
You might want to move the join method into its own class.
Output
peter --> Bristol
ken -->
mark --> London, Cambridge

You should iterate on array B.
For each string, split on the comma and search in A for a string that starts with the first part of the split.
Then append the second part of the split to the entry found in A.

Find the "FriendlyName" of a com port given its COMx name under windows in java

I need to determine the "friendly name" of a COM port given its COM# name.
I found some answers, but they were either for C# or C++.
Is there a method in (possibly pure) java?

After long search and I ended up writing the following class that seems to work for me; obviously it's a Win32-only thing.
I hope it helps other people.
package registry;
import static com.sun.jna.platform.win32.WinReg.HKEY_LOCAL_MACHINE;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.sun.jna.platform.win32.Advapi32Util;
import com.sun.jna.platform.win32.Win32Exception;
public class FriendlyName {
private static final String ENUM = "SYSTEM\\CurrentControlSet\\Enum\\USB";
private Map<String, String> friendlyNames;
private static final String KEY = "HARDWARE\\DEVICEMAP\\SERIALCOMM";
public FriendlyName() {
friendlyNames = new HashMap<>();
Pattern p = Pattern.compile(".*?\\(([^)]+)\\)");
for (String dev : Advapi32Util.registryGetKeys(HKEY_LOCAL_MACHINE, ENUM)) {
String sb = ENUM + "\\" + dev;
for (String itm : Advapi32Util.registryGetKeys(HKEY_LOCAL_MACHINE, sb)) {
String si = sb + "\\" + itm;
String fn = null;
try {
fn = Advapi32Util.registryGetStringValue(HKEY_LOCAL_MACHINE, si, "FriendlyName");
} catch (Win32Exception e) {}
if (fn != null) {
Matcher m = p.matcher(fn);
if (m.matches()) {
friendlyNames.put(m.group(1), fn);
}
}
}
}
}
String get(String key) {
return friendlyNames.get(key);
}
public String getCOM(String name) {
try {
for (Entry<String, Object> sub : Advapi32Util.registryGetValues(HKEY_LOCAL_MACHINE, KEY).entrySet()) {
String n = (String) sub.getValue();
String fn = get(n);
if (fn != null && fn.startsWith(name))
return n;
}
} catch (IllegalArgumentException e) {
System.err.println(e);
}
return null;
}
public static void main(String[] args) {
FriendlyName fn = new FriendlyName();
System.out.println(fn.getCOM(args[0]));
}
}

Trying to read a text file using regex to check each line

I am trying to write a program that will allow a user to input a name of a movie and the program would then generate the date associated with. I have a text file that has date and the movies that pertain to it. I am reading the file via Scanner and I created a movie class that stores an ArrayList and String for movies and date, respectively. I am having trouble with reading the files. Can anyone please assist me. Thank you!
Here is a part of the text file:
10/1/2014
Der Anstandige
"Men, Women and Children"
Nas: Time is Illmatic
10/2/2014
Bang Bang
Haider
10/3/2014
Annabelle
Bitter Honey
Breakup Buddies
La chambre bleue
Drive Hard
Gone Girl
The Good Lie
A Good Marriage
The Hero of Color City
Inner Demons
Left Behind
Libertador
The Supreme Price
Here is my movie class
import java.util.ArrayList;
public class movie
{
private ArrayList<String> movies;
private String date;
public movie(ArrayList<String> movies, String date)
{
this.movies = movies;
this.date = date;
}
public String getDate()
{
return date;
}
public void setDate(String date)
{
this.date = date;
}
public ArrayList<String> getMovies()
{
return movies;
}
}
Here is the readFile class
package Read;
import java.util.List;
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
public class readFile
{
public static List<movie> movies;
public static String realPath;
public static ArrayList<String> mov;
public static String j;
public static String i;
public static void main(String[]args)
{
//movies = new ArrayList<movie>();
realPath = "movie_release_dates.txt";
File f = new File(realPath);
try
{
String regex1 = "[^(0-9).+]";
String regex2 = "[^0-9$]";
Scanner sc = new Scanner(f);
while (sc.hasNextLine())
{
System.out.println("Hello");
//movies
if(!sc.nextLine().matches(regex2))
{
i = sc.nextLine();
System.out.println("Hello2");
System.out.println(i);
}
//date
while(sc.nextLine().matches(regex1))
{
System.out.println("Hello3");
if(!sc.nextLine().matches(regex1))
{
j = sc.nextLine();
mov.add(sc.nextLine());
System.out.println("Hello4");
}
}
movie movie = new movie(mov,i);
movies.add(movie);
}
// sc.close();
}
catch(Exception e)
{
System.out.println("CANT");
}
}
}

You shouldn't be calling sc.nextLine () in every check. Every NextLine () call reads next line.This means that you are checking one line and processing next line

package com.stackoverflow.q26269799;
import java.util.List;
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
public class ReadFile {
public static List<Movie> movies = new ArrayList<Movie>();
public static String realPath;
public static ArrayList<String> mov;
public static String j;
public static String i;
public static void main(String[] args) {
//movies = new ArrayList<movie>();
realPath = "movie_release_dates.txt";
File f = new File(realPath);
if ( !f.exists()) {
System.err.println("file path not specified");
}
try {
String regex1 = "[^(0-9).+]";
String regex2 = "[^0-9$]";
Scanner sc = new Scanner(f);
while (sc.hasNextLine()) {
System.out.println("Hello");
// movies
String nextLine = sc.nextLine();
if (nextLine != null) {
if ( !nextLine.matches(regex2)) {
i = nextLine;
System.out.println("Hello2");
System.out.println(i);
}
// date
while (nextLine != null && nextLine.matches(regex1)) {
System.out.println("Hello3");
if ( !nextLine.matches(regex1)) {
j = nextLine;
mov.add(nextLine);
System.out.println("Hello4");
}
nextLine = sc.nextLine();
}
}
Movie movie = new Movie(mov, i);
movies.add(movie);
}
// sc.close();
} catch(Exception e) {
throw new RuntimeException(e);
}
}
}
This is needed: //movies = new ArrayList<movie>();
Every time you call nextLine it will move the scanner point to the next line. So call it once a time and check if it match those regex. String nextLine = sc.nextLine();
Please check you whether the file path is specified.

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
import java.util.Map;
import java.util.Map.Entry;
import java.util.TreeMap;
public class ReadFile
{
Map<String, String> movies;
public static void main(String[] args) throws IOException
{
ReadFile readFile = new ReadFile();
readFile.movies = new TreeMap<>();
try
{
readFile.importData();
printf(readFile.queryData("Der Anstandige"));
printf(readFile.queryData("Bitter"));
printf(readFile.queryData("blah"));
printf(readFile.queryData("the"));
}
catch(IOException e)
{
throw(e);
}
}
void importData() throws IOException, FileNotFoundException
{
LineNumberReader reader = null;
File file = new File("c:/movie_release_dates.txt");
try
{
reader = new LineNumberReader(new FileReader(file), 1024*64); //
String line;
String date = null, movie = null;
while((line = reader.readLine()) != null)
{
line = line.trim();
if(line.equals("")) continue;
if(line.matches(PATTERN_DATE))
{
date = line;
date = strf("%s/%s",
date.substring(date.length() - 4),
date.substring(0, date.length() - 5));
continue;
}
else
{
movie = line.trim();
}
movies.put(movie, date);
}
}
catch(FileNotFoundException e)
{
throw(e);
}
finally
{
reader.close();
}
}
String queryData(String title)
{
String regex = "(?i)" + title.replaceAll("\\s", "\\s+");
String[] matches = new String[movies.size()];
int i = 0; for(Entry<String , String> movie : movies.entrySet())
{
String key = movie.getKey();
String val = movie.getValue();
if(key.matches(regex))
{
matches[i++] = strf("{movie=%s, date=%s}", key, val);
}
else if(key.toUpperCase().trim()
.contains(title.toUpperCase().trim()))
{
matches[i++] = strf("{movie=%s, date=%s}", key, val);
}
}
String string = "";
if(matches[0] == null)
{
string = "Not found\n";
}
else
{
i = 0; while(matches[i] != null)
{
string += matches[i++] + "\n";
}
}
return string;
}
final String strf(String arg0, Object ... arg1)
{
return String.format(arg0, arg1);
}
final static void printf(String format, Object ... args)
{
System.out.printf(format, args);
}
final static void println(String x)
{
System.out.println(x);
}
final String PATTERN_DATE = "\\d{1,2}\\/\\d{1,2}\\/\\d{4}";
}
Console output:
{movie=Der Anstandige, date=2014/10/1}
{movie=Bitter Honey, date=2014/10/3}
Not found
{movie=The Good Lie, date=2014/10/3}
{movie=The Hero of Color City, date=2014/10/3}
{movie=The Supreme Price, date=2014/10/3}

illegalStateException while using java matcher class

I am trying to get a webpage, load it into a string builder, using a BufferedReader and then use a regex to look for and retrieve words or in this case groups of words (department names like computer-science, Electrical-Engineering etc..) that match the regex pattern. I am using the Pattern and Matcher class that java provides but am running into an illegalStateException. I have been staring at this code for quite a while and would like some fresh perspective on what the problem might be. I know it has something to do with the m.find() and m.group() methods. Any help would be greatly appreciated.
I would say from the output I am getting, it recognizes the first words that matches the regex and start throwing illegalStateException after that.
I have also posted my code below:
public class Parser{
static StringBuilder theWebPage;
ArrayList<String> courseNames;
//ArrayList<parserObject> courseObjects;
public static void main(String[] args)
{
Parser p = new Parser();
theWebPage = new StringBuilder();
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine())!=null)
{
theWebPage.append(" ").append(str);
//System.out.println(theWebPage);
}
//System.out.println(theWebPage);
reader.close();
} catch (MalformedURLException e) {
System.out.println("MalformedURLException");
} catch (IOException e) {
System.out.println("IOException");
}
p.matchString();
}
public Parser()
{
//parserObject courseObject = new parserObject();
//courseObjects = new ArrayList<parserObject>();
courseNames = new ArrayList<String>();
//theWebPage=" ";
}
public void matchString()
{
String matchRegex = "#\\w+(-\\w+)+";
Pattern p = Pattern.compile(matchRegex);
Matcher m = p.matcher(theWebPage);
int i=0;
int x=0;
//m.reset();
while(!(m.matches()))
{
System.out.println("inside matches method " + i);
try{
m.find();
x = m.end();
System.out.println( m.group());
PrintStream out = new PrintStream(new FileOutputStream("/Users/xxxx/Desktop/output.txt"));
System.setOut(out);
//courseNames.add(i,m.group());
i++;
}catch(IllegalStateException e)
{
System.out.println("IllegalStateException");
} catch (FileNotFoundException e) {
System.out.println("FileNotFound Exception");
}
}
}
}

The problem is that you call:
x = m.end();
even though you may not have a match. Why not incorporate your call to find() into your while statement, thereby making it a guard statement also:
while (m.find()) {

Your solution overcomplicates things a bit. How about this?
package MitOpenCourseWareCrawler;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Parser {
private List<String> courseNames = new ArrayList<String>();
private URL url;
public Parser(String url) throws MalformedURLException {
this.url = new URL(url);
}
public static void main(String[] args) throws IOException {
Parser parser = new Parser("http://ocw.mit.edu/courses/");
parser.parse();
for (String courseName : parser.courseNames)
System.out.println(courseName);
}
public void parse() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
Pattern pattern = Pattern.compile(".*<u>(.+)</u>.*");
Matcher matcher;
String line;
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
if (matcher.matches())
courseNames.add(matcher.group(1));
}
reader.close();
}
}
Besides, I agree with Reimeus that it would probably be a better strategy to use a parsing tool or library than to try and to HTML parsing using regex patterns. But I guess as long as you know the structure of the page and know exactly what you want, a quick'n'dirty solution like yours or mine is okay.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex Matching Conflict with Overlapping Symbol - java

Related

StringTokenizer doesn't read the firs line of the file.txt

merging two array lists in java

Find the "FriendlyName" of a com port given its COMx name under windows in java

Trying to read a text file using regex to check each line

illegalStateException while using java matcher class

Categories

Resources