I'm trying to match tokens that all the contain the symbol < or >, but there are some conflicts. In particular, my tokens are <, >, </, />, and a comment that starts with <!-- and ends with -->.
My regexes for these are as follows:
String LTHAN = "<";
String GTHAN = ">";
String LTHAN_SLASH = "</";
String GTHAN_SLASH = "/>";
String COMMENT = "<!--.*-->";
And I compile them by adding them to a list using the general method:
public void add(String regex, int token) {
tokenInfos.add(new TokenInfo(Pattern.compile("^(" + regex + ")"), token));
Here is what my TokenInfo class looks like:
private class TokenInfo {
public final Pattern regex;
public final int token;
public TokenInfo(Pattern regex, int token) {
this.regex = regex;
this.token = token;
I match and display the list as follows:
public void tokenize(String str) {
String s = new String(str);
while (!s.equals("")) {
boolean match = false;
for (TokenInfo info : tokenInfos) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;
String tok = m.group().trim();
tokens.add(new Token(info.token, tok));
s = m.replaceFirst("");
Read and display:
try {
BufferedReader br;
String curLine;
String EOF = null;
Scanner scan = new Scanner(System.in);
StringBuilder sb = new StringBuilder();
try {
File dir = new File("C:\\Users\\Me\\Documents\\input files\\example.xml");
br = new BufferedReader(new FileReader(dir));
while ((curLine = br.readLine()) != EOF) {
// System.out.println(curLine);
} catch (IOException e) {
for (Tokenizer.Token tok : tokenizer.getTokens()) {
System.out.println("" + tok.token + " " + tok.sequence);
} catch (Exception e) {
Sample input:
<!-- Sample input file with incomplete recipe -->
<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
<title>Basic bread</title>
<ingredient amount="3" unit="cups">Flour</ingredient>
<step>Mix all ingredients together.</step>
However, the outputted token list recognizes < and / (including whatever characters come after it) as separate tokens, meaning it can never seem to recognize the tokens </ and />. Same issue with the comments. Is this a problem with my regex? Why isn't it recognizing the patterns </ and />?
Hope my question is clear. Happy to provide more details/examples if necessary.
Your initial regex ^(<) will match against the entire input. This regex means that the text has to start with < and the entire input string is just that. So you will have to fix it.
If the entire tag (without the text content - like Basic Bread, Mix all ingredients together) is considered a token. So the corresponding Regex should be a single regex.
Try changing the Regex to the following:
For a single tag - <[^>]*>
For a single closing tag - </[^>]*>;
For comments - <!--.*--> (This is already correct)
Sample Program
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
private static ArrayList<TokenInfo> tokenInfoList = new ArrayList<>();
private static ArrayList<String> tokensList = new ArrayList<>();
public static void add(String regex, int token) {
tokenInfoList.add(new TokenInfo(Pattern.compile(regex), token));
static {
String LTHAN = "<[^>]*>";
String LTHAN_SLASH = "</[^>]*>";
String COMMENT = "<!--.*-->";
add(LTHAN, 1);
add(LTHAN_SLASH, 3);
add(COMMENT, 5);
private static class TokenInfo {
public final Pattern regex;
public final int token;
public TokenInfo(Pattern regex, int token) {
this.regex = regex;
this.token = token;
public static void tokenize(String str) {
String s = new String(str);
while (!s.equals("")) {
boolean match = false;
for (TokenInfo info : tokenInfoList) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;
String tok = m.group().trim();
s = m.replaceFirst("");
// The following is under the assumption that the Text nodes within the document are not considered tokens and replaced
if (!match) {
public static void main(String[] args) {
try {
BufferedReader br;
String curLine;
String EOF = null;
StringBuilder sb = new StringBuilder();
try {
File dir = new File("recipe.xml");
br = new BufferedReader(new FileReader(dir));
while ((curLine = br.readLine()) != EOF) {
// System.out.println(curLine);
} catch (IOException e) {
for (String eachToken : tokensList) {
} catch (Exception e) {
http://www.regular-expressions.info/ is a great resource for learning regular expressions.
I'm trying to take every single words from a text file and put them into a ArrayList but the StringTokenizer doesn't read the first line of the text file... What's wrong?
public class BufferReader {
public static void main(String[] args) throws FileNotFoundException, IOException {
BufferedReader reader = new BufferedReader(new FileReader("C://Java-projects//EsameJava//prova.txt"));
String line = reader.readLine();
List<String> str = new ArrayList<>();
while ((line = reader.readLine()) != null) {
StringTokenizer token = new StringTokenizer(line);
while (token.hasMoreTokens()) {
The only solution I found is to start the text file from the second line but it's not what I want...
This is how you could marry the (very) old and the new(er) to provide a collection of words:
import java.text.BreakIterator;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Stream;
import java.nio.file.Files;
import java.nio.file.Paths;
public class WordCollector {
public static void main(String[] args) {
try {
List<String> words = WordCollector.getWords(Files.lines(Paths.get(args[0])));
} catch (Throwable t) {
public static List<String> getWords(Stream<String> lines) {
List<String> result = new ArrayList<>();
BreakIterator boundary = BreakIterator.getWordInstance();
lines.forEach(line -> {
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
String candidate = line.substring(start, end).replaceAll("\\p{Punct}", "").trim();
if (candidate.length() > 0) {
return result;
I have two arraylists
arraylist dName has values:
mark, 22
peter, 34
ken, 55
arraylist dest has values:
mark, London
peter, Bristol
mark, Cambridge
I want to join merge them so that their output gives:
this is the code i have for now, i'm not really usre how to split on the comma and search the other array
public class Sample {
BufferedReader br;
BufferedReader br2;
public Sample() {
ArrayList<String> dName = new ArrayList<String>();
ArrayList<String> dest = new ArrayList<String>();
String line = null;
String lines = null;
try {
br = new BufferedReader(new FileReader("taxi_details.txt"));
br2 = new BufferedReader(new FileReader("2017_journeys.txt"));
while ((line = br.readLine()) != null &&
(lines = br2.readLine()) != null){
String name [] = line.split(";");
String destination [] = lines.split(",");
// add values to ArrayList
// iterate through destination
for (String str : destination) {
catch (FileNotFoundException ex) {
} catch (IOException ex) {
} finally {
try {
if (br != null)
} catch (IOException ex) {
public static void main(String[] args) throws IOException {
Now, I'm not sure whether this is the proper way, but at least it is working.
mark, 22
peter, 34
ken, 55
mark, London
peter, Bristol
mark, Cambridge
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.List;
import java.util.stream.Collectors;
public class FileReader {
public List<String> read(String fileName) throws IOException{
return Files.lines(new File(fileName).toPath()).collect(Collectors.toList());
This class lets you avoid all the messy try-catch blocks.
public class Line{
public static final String DELIMITER = ",";
public static final int INDEX_NAME = 0;
public static final int INDEX_VALUE = 1;
private String line;
private String[] values;
public Line(String line) {
this.line = line;
this.values = line.split(DELIMITER);
public String getName(){
return this.values[INDEX_NAME];
public String getValue(){
return this.values[INDEX_VALUE];
public void emptyValue(){
this.values[INDEX_VALUE] = "";
public String toString() {
return this.line;
This class has the mere prupose of preparing the data as needed for merging.
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.stream.Collectors;
public class Main {
public static void main(String[] args) throws IOException {
FileReader fileReader = new FileReader();
// Read lines
List<String> dName = fileReader.read("taxi_details.txt");
List<String> dest = fileReader.read("2017_journeys.txt");
// Convert into proper format
List<Line> dNameLines = dName.stream().map(Line::new).collect(Collectors.toList());
List<Line> destLines = dest.stream().map(Line::new).collect(Collectors.toList());
// Remove ID
// Merge lists
Map<String, String> joined = join(dNameLines, destLines);
// Print
for (Entry<String, String> line: joined.entrySet()) {
System.out.println(line.getKey() + " --> " + line.getValue());
public static Map<String, String> join(List<Line> a, List<Line> b){
Map<String, String> joined = new HashMap<>();
// Put first list into map, as there is no danger of overwriting existing values
a.forEach(line -> {
joined.put(line.getName(), line.getValue());
// Put second list into map, but check for existing keys
b.forEach(line -> {
String key = line.getName();
if(joined.containsKey(key)){ // Actual merge
String existingValue = joined.get(key);
String newValue = line.getValue();
newValue = existingValue + Line.DELIMITER + newValue;
joined.put(key, newValue);
}else{ // Add entry normally
joined.put(line.getName(), line.getValue());
return joined;
You might want to move the join method into its own class.
peter --> Bristol
ken -->
mark --> London, Cambridge
You should iterate on array B.
For each string, split on the comma and search in A for a string that starts with the first part of the split.
Then append the second part of the split to the entry found in A.
I need to determine the "friendly name" of a COM port given its COM# name.
I found some answers, but they were either for C# or C++.
Is there a method in (possibly pure) java?
After long search and I ended up writing the following class that seems to work for me; obviously it's a Win32-only thing.
I hope it helps other people.
package registry;
import static com.sun.jna.platform.win32.WinReg.HKEY_LOCAL_MACHINE;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.sun.jna.platform.win32.Advapi32Util;
import com.sun.jna.platform.win32.Win32Exception;
public class FriendlyName {
private static final String ENUM = "SYSTEM\\CurrentControlSet\\Enum\\USB";
private Map<String, String> friendlyNames;
private static final String KEY = "HARDWARE\\DEVICEMAP\\SERIALCOMM";
public FriendlyName() {
friendlyNames = new HashMap<>();
Pattern p = Pattern.compile(".*?\\(([^)]+)\\)");
for (String dev : Advapi32Util.registryGetKeys(HKEY_LOCAL_MACHINE, ENUM)) {
String sb = ENUM + "\\" + dev;
for (String itm : Advapi32Util.registryGetKeys(HKEY_LOCAL_MACHINE, sb)) {
String si = sb + "\\" + itm;
String fn = null;
try {
fn = Advapi32Util.registryGetStringValue(HKEY_LOCAL_MACHINE, si, "FriendlyName");
} catch (Win32Exception e) {}
if (fn != null) {
Matcher m = p.matcher(fn);
if (m.matches()) {
friendlyNames.put(m.group(1), fn);
String get(String key) {
return friendlyNames.get(key);
public String getCOM(String name) {
try {
for (Entry<String, Object> sub : Advapi32Util.registryGetValues(HKEY_LOCAL_MACHINE, KEY).entrySet()) {
String n = (String) sub.getValue();
String fn = get(n);
if (fn != null && fn.startsWith(name))
return n;
} catch (IllegalArgumentException e) {
return null;
public static void main(String[] args) {
FriendlyName fn = new FriendlyName();
I am trying to write a program that will allow a user to input a name of a movie and the program would then generate the date associated with. I have a text file that has date and the movies that pertain to it. I am reading the file via Scanner and I created a movie class that stores an ArrayList and String for movies and date, respectively. I am having trouble with reading the files. Can anyone please assist me. Thank you!
Here is a part of the text file:
Der Anstandige
"Men, Women and Children"
Nas: Time is Illmatic
Bang Bang
Bitter Honey
Breakup Buddies
La chambre bleue
Drive Hard
Gone Girl
The Good Lie
A Good Marriage
The Hero of Color City
Inner Demons
Left Behind
The Supreme Price
Here is my movie class
import java.util.ArrayList;
public class movie
private ArrayList<String> movies;
private String date;
public movie(ArrayList<String> movies, String date)
this.movies = movies;
this.date = date;
public String getDate()
return date;
public void setDate(String date)
this.date = date;
public ArrayList<String> getMovies()
return movies;
Here is the readFile class
package Read;
import java.util.List;
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
public class readFile
public static List<movie> movies;
public static String realPath;
public static ArrayList<String> mov;
public static String j;
public static String i;
public static void main(String[]args)
//movies = new ArrayList<movie>();
realPath = "movie_release_dates.txt";
File f = new File(realPath);
String regex1 = "[^(0-9).+]";
String regex2 = "[^0-9$]";
Scanner sc = new Scanner(f);
while (sc.hasNextLine())
i = sc.nextLine();
j = sc.nextLine();
movie movie = new movie(mov,i);
// sc.close();
catch(Exception e)
You shouldn't be calling sc.nextLine () in every check. Every NextLine () call reads next line.This means that you are checking one line and processing next line
package com.stackoverflow.q26269799;
import java.util.List;
import java.io.File;
import java.util.ArrayList;
import java.util.Scanner;
public class ReadFile {
public static List<Movie> movies = new ArrayList<Movie>();
public static String realPath;
public static ArrayList<String> mov;
public static String j;
public static String i;
public static void main(String[] args) {
//movies = new ArrayList<movie>();
realPath = "movie_release_dates.txt";
File f = new File(realPath);
if ( !f.exists()) {
System.err.println("file path not specified");
try {
String regex1 = "[^(0-9).+]";
String regex2 = "[^0-9$]";
Scanner sc = new Scanner(f);
while (sc.hasNextLine()) {
// movies
String nextLine = sc.nextLine();
if (nextLine != null) {
if ( !nextLine.matches(regex2)) {
i = nextLine;
// date
while (nextLine != null && nextLine.matches(regex1)) {
if ( !nextLine.matches(regex1)) {
j = nextLine;
nextLine = sc.nextLine();
Movie movie = new Movie(mov, i);
// sc.close();
} catch(Exception e) {
throw new RuntimeException(e);
This is needed: //movies = new ArrayList<movie>();
Every time you call nextLine it will move the scanner point to the next line. So call it once a time and check if it match those regex. String nextLine = sc.nextLine();
Please check you whether the file path is specified.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
import java.util.Map;
import java.util.Map.Entry;
import java.util.TreeMap;
public class ReadFile
Map<String, String> movies;
public static void main(String[] args) throws IOException
ReadFile readFile = new ReadFile();
readFile.movies = new TreeMap<>();
printf(readFile.queryData("Der Anstandige"));
catch(IOException e)
void importData() throws IOException, FileNotFoundException
LineNumberReader reader = null;
File file = new File("c:/movie_release_dates.txt");
reader = new LineNumberReader(new FileReader(file), 1024*64); //
String line;
String date = null, movie = null;
while((line = reader.readLine()) != null)
line = line.trim();
if(line.equals("")) continue;
date = line;
date = strf("%s/%s",
date.substring(date.length() - 4),
date.substring(0, date.length() - 5));
movie = line.trim();
movies.put(movie, date);
catch(FileNotFoundException e)
String queryData(String title)
String regex = "(?i)" + title.replaceAll("\\s", "\\s+");
String[] matches = new String[movies.size()];
int i = 0; for(Entry<String , String> movie : movies.entrySet())
String key = movie.getKey();
String val = movie.getValue();
matches[i++] = strf("{movie=%s, date=%s}", key, val);
else if(key.toUpperCase().trim()
matches[i++] = strf("{movie=%s, date=%s}", key, val);
String string = "";
if(matches[0] == null)
string = "Not found\n";
i = 0; while(matches[i] != null)
string += matches[i++] + "\n";
return string;
final String strf(String arg0, Object ... arg1)
return String.format(arg0, arg1);
final static void printf(String format, Object ... args)
System.out.printf(format, args);
final static void println(String x)
final String PATTERN_DATE = "\\d{1,2}\\/\\d{1,2}\\/\\d{4}";
Console output:
{movie=Der Anstandige, date=2014/10/1}
{movie=Bitter Honey, date=2014/10/3}
Not found
{movie=The Good Lie, date=2014/10/3}
{movie=The Hero of Color City, date=2014/10/3}
{movie=The Supreme Price, date=2014/10/3}
I am trying to get a webpage, load it into a string builder, using a BufferedReader and then use a regex to look for and retrieve words or in this case groups of words (department names like computer-science, Electrical-Engineering etc..) that match the regex pattern. I am using the Pattern and Matcher class that java provides but am running into an illegalStateException. I have been staring at this code for quite a while and would like some fresh perspective on what the problem might be. I know it has something to do with the m.find() and m.group() methods. Any help would be greatly appreciated.
I would say from the output I am getting, it recognizes the first words that matches the regex and start throwing illegalStateException after that.
I have also posted my code below:
public class Parser{
static StringBuilder theWebPage;
ArrayList<String> courseNames;
//ArrayList<parserObject> courseObjects;
public static void main(String[] args)
Parser p = new Parser();
theWebPage = new StringBuilder();
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine())!=null)
theWebPage.append(" ").append(str);
} catch (MalformedURLException e) {
} catch (IOException e) {
public Parser()
//parserObject courseObject = new parserObject();
//courseObjects = new ArrayList<parserObject>();
courseNames = new ArrayList<String>();
//theWebPage=" ";
public void matchString()
String matchRegex = "#\\w+(-\\w+)+";
Pattern p = Pattern.compile(matchRegex);
Matcher m = p.matcher(theWebPage);
int i=0;
int x=0;
System.out.println("inside matches method " + i);
x = m.end();
System.out.println( m.group());
PrintStream out = new PrintStream(new FileOutputStream("/Users/xxxx/Desktop/output.txt"));
}catch(IllegalStateException e)
} catch (FileNotFoundException e) {
System.out.println("FileNotFound Exception");
The problem is that you call:
x = m.end();
even though you may not have a match. Why not incorporate your call to find() into your while statement, thereby making it a guard statement also:
while (m.find()) {
Your solution overcomplicates things a bit. How about this?
package MitOpenCourseWareCrawler;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Parser {
private List<String> courseNames = new ArrayList<String>();
private URL url;
public Parser(String url) throws MalformedURLException {
this.url = new URL(url);
public static void main(String[] args) throws IOException {
Parser parser = new Parser("http://ocw.mit.edu/courses/");
for (String courseName : parser.courseNames)
public void parse() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
Pattern pattern = Pattern.compile(".*<u>(.+)</u>.*");
Matcher matcher;
String line;
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
if (matcher.matches())
Besides, I agree with Reimeus that it would probably be a better strategy to use a parsing tool or library than to try and to HTML parsing using regex patterns. But I guess as long as you know the structure of the page and know exactly what you want, a quick'n'dirty solution like yours or mine is okay.