illegalStateException while using java matcher class - java

I am trying to get a webpage, load it into a string builder, using a BufferedReader and then use a regex to look for and retrieve words or in this case groups of words (department names like computer-science, Electrical-Engineering etc..) that match the regex pattern. I am using the Pattern and Matcher class that java provides but am running into an illegalStateException. I have been staring at this code for quite a while and would like some fresh perspective on what the problem might be. I know it has something to do with the m.find() and m.group() methods. Any help would be greatly appreciated.
I would say from the output I am getting, it recognizes the first words that matches the regex and start throwing illegalStateException after that.
I have also posted my code below:
public class Parser{
static StringBuilder theWebPage;
ArrayList<String> courseNames;
//ArrayList<parserObject> courseObjects;
public static void main(String[] args)
{
Parser p = new Parser();
theWebPage = new StringBuilder();
try {
URL theUrl = new URL("http://ocw.mit.edu/courses/");
BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
String str = null;
while((str = reader.readLine())!=null)
{
theWebPage.append(" ").append(str);
//System.out.println(theWebPage);
}
//System.out.println(theWebPage);
reader.close();
} catch (MalformedURLException e) {
System.out.println("MalformedURLException");
} catch (IOException e) {
System.out.println("IOException");
}
p.matchString();
}
public Parser()
{
//parserObject courseObject = new parserObject();
//courseObjects = new ArrayList<parserObject>();
courseNames = new ArrayList<String>();
//theWebPage=" ";
}
public void matchString()
{
String matchRegex = "#\\w+(-\\w+)+";
Pattern p = Pattern.compile(matchRegex);
Matcher m = p.matcher(theWebPage);
int i=0;
int x=0;
//m.reset();
while(!(m.matches()))
{
System.out.println("inside matches method " + i);
try{
m.find();
x = m.end();
System.out.println( m.group());
PrintStream out = new PrintStream(new FileOutputStream("/Users/xxxx/Desktop/output.txt"));
System.setOut(out);
//courseNames.add(i,m.group());
i++;
}catch(IllegalStateException e)
{
System.out.println("IllegalStateException");
} catch (FileNotFoundException e) {
System.out.println("FileNotFound Exception");
}
}
}
}

The problem is that you call:
x = m.end();
even though you may not have a match. Why not incorporate your call to find() into your while statement, thereby making it a guard statement also:
while (m.find()) {

Your solution overcomplicates things a bit. How about this?
package MitOpenCourseWareCrawler;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Parser {
private List<String> courseNames = new ArrayList<String>();
private URL url;
public Parser(String url) throws MalformedURLException {
this.url = new URL(url);
}
public static void main(String[] args) throws IOException {
Parser parser = new Parser("http://ocw.mit.edu/courses/");
parser.parse();
for (String courseName : parser.courseNames)
System.out.println(courseName);
}
public void parse() throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
Pattern pattern = Pattern.compile(".*<u>(.+)</u>.*");
Matcher matcher;
String line;
while ((line = reader.readLine()) != null) {
matcher = pattern.matcher(line);
if (matcher.matches())
courseNames.add(matcher.group(1));
}
reader.close();
}
}
Besides, I agree with Reimeus that it would probably be a better strategy to use a parsing tool or library than to try and to HTML parsing using regex patterns. But I guess as long as you know the structure of the page and know exactly what you want, a quick'n'dirty solution like yours or mine is okay.

Related

Hоw to convert website to .txt file for finding in this file some word?

Hоw to convert website to .txt file for finding in this file some word (ex. "Абрамов Николай Викторович")? My code read only html. In other words I want to re-check website every second. If my word appears ( by the author of the website), then my code print "Yes".
And how can I make a computer application to test any other word?
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class web {
public static void main(String[] args) {
for (;;) {
try {
// Create a URL for the desired page
URL url = new URL("http://abit.itmo.ru/page/195");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str = null;
while (in.readLine() != null) {
str = in.readLine().toString();
System.out.println(str);
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
Pattern p = Pattern.compile("Абрамов Николай Викторович");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println("Yes");
System.exit(0);
}
} catch (IOException ignored) {
}
}
}
}
You don't need to convert it to TXT.
If you want just to search for the word you can check it directly . But be careful it can appears as DDOS attack if the period is too short and you may be blocked
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class Main {
public static String wordToFind = "30 Day";
public static String siteURL = "https://stackoverflow.com/";
public static void checkSite()
{
try {
URL google = new URL(siteURL);
BufferedReader in = new BufferedReader(new InputStreamReader(google.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) { // Process each line.
if (inputLine.contains( wordToFind)) // System.out.println(inputLine);
{
System.out.println( "Yes" );
return;
}
}
in.close();
} catch (MalformedURLException me) {
System.out.println(me);
} catch (IOException ioe) {
System.out.println(ioe);
}
}
public static void main(String[] args) {
Integer initalDelay = 0;
Integer period = 10; //number of seconds to repeat
ScheduledExecutorService exec = Executors.newSingleThreadScheduledExecutor();
exec.scheduleAtFixedRate(new Runnable() {
#Override
public void run() {
checkSite();
// do stuff
}
}, initalDelay, period, TimeUnit.SECONDS);
}
}

My HTML fetcher program in java returns incomplete results

My java code is:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class celebGrepper {
static class CelebData {
URL link;
String name;
CelebData(URL link, String name) {
this.link=link;
this.name=name;
}
}
public static String grepper(String url) {
URL source;
String data = null;
try {
source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.connect();
InputStream is = connection.getInputStream();
/**
* Attempting to fetch an entire line at a time instead of just a character each time!
*/
StringBuilder str = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
while((data = br.readLine()) != null)
str.append(data);
data=str.toString();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
ArrayList<CelebData> list = new ArrayList<CelebData>();
Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\\s\\S]*<td class=\"name\"><a.*?>([\\w\\s]+)<\\/a>");
Matcher m = p.matcher(html);
while(m.find()) {
CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
list.add(current);
}
return list;
}
public static void main(String... args) throws MalformedURLException {
String html = grepper("https://www.forbes.com/celebrities/list/");
System.out.println("RAW Input: "+html);
System.out.println("Start Grepping...");
ArrayList<CelebData> celebList = parser(html);
for(CelebData item: celebList) {
System.out.println("Name:\t\t "+item.name);
System.out.println("Image URL:\t "+item.link+"\n");
}
System.out.println("Grepping Done!");
}
}
It's supposed to fetch the entire HTML content of https://www.forbes.com/celebrities/list/. However, when I compare the actual result below to the original page, I find the entire table that I need is missing! Is it because the page isn't completely loaded when I start getting the bytes from the page via the input stream? Please help me understand.
The Output of the page:
https://jsfiddle.net/e0771aLz/
What can I do to just extract the Image link and the names of the celebs?
I know it's an extremely bad practice to try to parse HTML using regex and is the stuff of nightmares, but on a certain video training course for android, that's exactly what the guy did, and I just wanna follow along since it's just in this one lesson.

Regex Matching Conflict with Overlapping Symbol

I'm trying to match tokens that all the contain the symbol < or >, but there are some conflicts. In particular, my tokens are <, >, </, />, and a comment that starts with <!-- and ends with -->.
My regexes for these are as follows:
String LTHAN = "<";
String GTHAN = ">";
String LTHAN_SLASH = "</";
String GTHAN_SLASH = "/>";
String COMMENT = "<!--.*-->";
And I compile them by adding them to a list using the general method:
public void add(String regex, int token) {
tokenInfos.add(new TokenInfo(Pattern.compile("^(" + regex + ")"), token));
}
Here is what my TokenInfo class looks like:
private class TokenInfo {
public final Pattern regex;
public final int token;
public TokenInfo(Pattern regex, int token) {
super();
this.regex = regex;
this.token = token;
}
}
I match and display the list as follows:
public void tokenize(String str) {
String s = new String(str);
tokens.clear();
while (!s.equals("")) {
boolean match = false;
for (TokenInfo info : tokenInfos) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;
String tok = m.group().trim();
tokens.add(new Token(info.token, tok));
s = m.replaceFirst("");
break;
}
}
}
}
Read and display:
try {
BufferedReader br;
String curLine;
String EOF = null;
Scanner scan = new Scanner(System.in);
StringBuilder sb = new StringBuilder();
try {
File dir = new File("C:\\Users\\Me\\Documents\\input files\\example.xml");
br = new BufferedReader(new FileReader(dir));
while ((curLine = br.readLine()) != EOF) {
sb.append(curLine);
// System.out.println(curLine);
}
br.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
tokenizer.tokenize(sb.toString());
for (Tokenizer.Token tok : tokenizer.getTokens()) {
System.out.println("" + tok.token + " " + tok.sequence);
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
Sample input:
<!-- Sample input file with incomplete recipe -->
<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
<title>Basic bread</title>
<ingredient amount="3" unit="cups">Flour</ingredient>
<instructions>
<step>Mix all ingredients together.</step>
</instructions>
</recipe>
However, the outputted token list recognizes < and / (including whatever characters come after it) as separate tokens, meaning it can never seem to recognize the tokens </ and />. Same issue with the comments. Is this a problem with my regex? Why isn't it recognizing the patterns </ and />?
Hope my question is clear. Happy to provide more details/examples if necessary.
Problems:
Your initial regex ^(<) will match against the entire input. This regex means that the text has to start with < and the entire input string is just that. So you will have to fix it.
If the entire tag (without the text content - like Basic Bread, Mix all ingredients together) is considered a token. So the corresponding Regex should be a single regex.
Solution
Try changing the Regex to the following:
For a single tag - <[^>]*>
For a single closing tag - </[^>]*>;
For comments - <!--.*--> (This is already correct)
Sample Program
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
private static ArrayList<TokenInfo> tokenInfoList = new ArrayList<>();
private static ArrayList<String> tokensList = new ArrayList<>();
public static void add(String regex, int token) {
tokenInfoList.add(new TokenInfo(Pattern.compile(regex), token));
}
static {
String LTHAN = "<[^>]*>";
String LTHAN_SLASH = "</[^>]*>";
String COMMENT = "<!--.*-->";
add(LTHAN, 1);
add(LTHAN_SLASH, 3);
add(COMMENT, 5);
}
private static class TokenInfo {
public final Pattern regex;
public final int token;
public TokenInfo(Pattern regex, int token) {
super();
this.regex = regex;
this.token = token;
}
}
public static void tokenize(String str) {
String s = new String(str);
while (!s.equals("")) {
boolean match = false;
for (TokenInfo info : tokenInfoList) {
Matcher m = info.regex.matcher(s);
if (m.find()) {
match = true;
String tok = m.group().trim();
tokensList.add(tok);
s = m.replaceFirst("");
break;
}
}
// The following is under the assumption that the Text nodes within the document are not considered tokens and replaced
if (!match) {
break;
}
}
}
public static void main(String[] args) {
try {
BufferedReader br;
String curLine;
String EOF = null;
StringBuilder sb = new StringBuilder();
try {
File dir = new File("recipe.xml");
br = new BufferedReader(new FileReader(dir));
while ((curLine = br.readLine()) != EOF) {
sb.append(curLine);
// System.out.println(curLine);
}
br.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
tokenize(sb.toString());
for (String eachToken : tokensList) {
System.out.println(eachToken);
}
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
References
http://www.regular-expressions.info/ is a great resource for learning regular expressions.

How to find simple word in Java file?

I need help. I'm beginning programmer, I try to make program with regular expression.
I try to find every life word in my file. I have code like this:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class myClass {
public int howManyWord () {
int count = 0;
try {
BufferedReader br = new BufferedReader(new FileReader("C:/myFile.txt"));
String line = "";
while ((line = br.readLine()) != null) {
Matcher m = Pattern.compile("life").matcher(line);
while (m.find()) {
System.out.println("found");
count++;
}
}
} catch (IOException e) {
e.printStackTrace();
}
return count;
}
}
That works. I try to change this because when I'm searching my word and when compilator find something like this "lifelife" count is 2.
What should I change?
Sorry for my English but help me, please.
Use Pattern p = Pattern.compile("\\blife\\b"); and set the pattern once before the while loop.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class myClass {
public int howManyWord () {
int count = 0;
try {
BufferedReader br = new BufferedReader(new FileReader("C:/myFile.txt"));
String line = "";
Pattern p = Pattern.compile("\\blife\\b"); // compile pattern only once
while ((line = br.readLine()) != null) {
Matcher m = p.matcher(line);
while (m.find()) {
System.out.println("found");
count++;
}
}
} catch (IOException e) {
e.printStackTrace();
}
return count;
}
}
"(?<=^|\\W)life(?=$|\\W)" will find words "life" but not "lifelife" or "xlife".

Java - Merging two sets of code

I've written two separate pieces of code. Now I want to merge both pieces of code. Now one part opens a text file and displays the contents of the text file and the second piece of code validates manually entered postcodes. Now I want to read a text file and then automatically validate postcodes within the text file. Not sure how I can merge them. Any questions please ask as I'm stuck.
package postcodesort;
import java.util.*;
import java.util.Random;
import java.util.Queue;
import java.util.TreeSet;
import java.io.File;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.LinkedList;
import java.util.StringTokenizer;
public class PostCodeSort
{
Queue<String> postcodeStack = new LinkedList<String>();
public static void main(String[] args) throws IOException
{
FileReader fileReader = null;
// Create the FileReader object
try {
fileReader = new FileReader("postcodes1.txt");
BufferedReader br = new BufferedReader(fileReader);
String str;
while((str = br.readLine()) != null)
{
System.out.println(str + "");
}
}
catch (IOException ex)
{
// handle exception;
}
finally
{
fileReader.close();
}
// Close the input
}
}
Second part that manually validates postcodes:
List<String> zips = new ArrayList<String>();
//Valid ZIP codes
zips.add("SW1W 0NY");
zips.add("PO16 7GZ");
zips.add("GU16 7HF");
zips.add("L1 8JQ");
//Invalid ZIP codes
zips.add("Z1A 0B1");
zips.add("A1A 0B11");
String regex = "^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}$";
Pattern pattern = Pattern.compile(regex);
for (String zip : zips)
{
Matcher matcher = pattern.matcher(zip);
System.out.println(matcher.matches());
}
You should create a class called something like ZipCodeValidator that contains the functionality of your second snippet. It will look something like this
public class ZipCodeValidator {
private static String regex = "^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}$";
private static Pattern pattern = Pattern.compile(regex);
public boolean isValid(String zipCode) {
Matcher matcher = pattern.matcher(zip);
return matcher.matches();
}
}
Then you can create an instance of this class
ZipCodeValidator zipCodeValidator = new ZipCodeValidator();
and then use it in your main method
boolean valid = zipCodeValidator.isValid(zipCode);
Merging your question and the answer by #hiflyer I posted this answer, this makes an assumption that the file postcodes1.txt has all the zip codes in separate lines.
package postcodesort;
import java.util.*;
import java.util.Random;
import java.util.Queue;
import java.util.TreeSet;
import java.io.File;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.LinkedList;
import java.util.StringTokenizer;
public class PostCodeSort
{
Queue<String> postcodeStack = new LinkedList<String>();
public static void main(String[] args) throws IOException
{
FileReader fileReader = null;
ZipCodeValidator zipCodeValidator = new ZipCodeValidator();
// Create the FileReader object
try {
fileReader = new FileReader("postcodes1.txt");
BufferedReader br = new BufferedReader(fileReader);
String str;
while((str = br.readLine()) != null)
{
if(zipCodeValidator.isValid(str)){
System.out.println(str + " is valid");
}
else{
System.out.println(str + " is not valid");
}
}
}
catch (IOException ex)
{
// handle exception;
}
finally
{
fileReader.close();
}
}
}
public class ZipCodeValidator {
private static String regex = "^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}$";
private static Pattern pattern = Pattern.compile(regex);
public boolean isValid(String zipCode) {
Matcher matcher = pattern.matcher(zip);
return matcher.matches();
}
}

Categories