Regex in scanner only finds first match - java

I have a text file containing a number of articles which I need to parse through.
I need to retrieve every single word in each article, excluding any full stops, commas, etc. The articles are separated by a specific two lines, and I'm trying to use a regex pattern to find these points.
An example of the document is as follows:
.I 1
.W
this is article one.
.I 2
.W
this is article two.
.I 3
.W
this is article three.
The code below seems to find the first occurrence .I 1 and add all subsequent words, but once it gets to the next separator it adds it as a word instead of skipping it.
Scanner scanner = new Scanner(document);
scanner.useDelimiter("[^\\w']+");
String separator;
while (scanner.hasNext()){
separator = scanner.findInLine(Pattern.compile(".I \\d"));
if (separator!= null) {
System.out.println("Found: " + separator);
scanner.nextLine();
scanner.nextLine();
}
list.add(scanner.next());
}
scanner.close();
If possible I'd also like to be able grab the actual article number, which is the number attached to each separator.
What's wrong in my code?

The problem is that since you tell the Scanner to use everything except word characters and ticks as delimiters, the dot in front of I is consumed by scanner.next() each time it is about to come up in your findInLine search.
You can fix this by reading input by line instead of reading it by word, like this:
list.add(scanner.nextLine());
To get article number, parse the delimiter starting at character 3:
int num = Integer.valueOf(separator.substring(3));
Here is a demo that reads from standard input:
Scanner scanner = new Scanner(System.in);
scanner.useDelimiter("[^\\w']+");
String separator;
Pattern rx = Pattern.compile(".I \\d");
while (scanner.hasNext()){
separator = scanner.findInLine(rx);
if (separator!= null) {
int num = Integer.valueOf(separator.substring(3));
System.out.println("Found: " + separator+", article number: "+num);
scanner.nextLine();
scanner.nextLine();
}
System.out.println(scanner.nextLine());
}
scanner.close();
Demo.

Related

How to split a string with space being the delimiter using Scanner

I am trying to split the input sentence based on space between the words. It is not working as expected.
public static void main(String[] args) {
Scanner scaninput=new Scanner(System.in);
String inputSentence = scaninput.next();
String[] result=inputSentence.split("-");
// for(String iter:result) {
// System.out.println("iter:"+iter);
// }
System.out.println("result.length: "+result.length);
for (int count=0;count<result.length;count++) {
System.out.println("==");
System.out.println(result[count]);
}
}
It gives the output below when I use "-" in split:
fsfdsfsd-second-third
result.length: 3
==
fsfdsfsd
==
second
==
third
When I replace "-" with space " ", it gives the below output.
first second third
result.length: 1
==
first
Any suggestions as to what is the problem here? I have already referred to the stackoverflow post How to split a String by space, but it does not work.
Using split("\\s+") gives this output:
first second third
result.length: 1
==
first
Change
scanner.next()
To
scanner.nextLine()
From the javadoc
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace.
Calling next() returns the next word.
Calling nextLine() returns the next line.
The next() method of Scanner already splits the string on spaces, that is, it returns the next token, the string until the next string. So, if you add an appropriate println, you will see that inputSentence is equal to the first word, not the entire string.
Replace scanInput.next() with scanInput.nextLine().
The problem is that scaninput.next() will only read until the first whitespace character, so it's only pulling in the word first. So the split afterward accomplishes nothing.
Instead of using Scanner, I suggest using java.io.BufferedReader, which will let you read an entire line at once.
One more alternative is to go with buffered Reader class that works well.
String inputSentence;
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
inputSentence=br.readLine();
String[] result=inputSentence.split("\\s+");
rintln("result.length: "+result.length);
for(int count=0;count<result.length;count++)
{
System.out.println("==");
System.out.println(result[count]);
}
}

Deliminter is not working for scanner

The user will enter a=(number here). I then want it to cut off the a= and retain the number. It works when I use s.next() but of course it makes me enter it two times which I don't want. With s.nextLine() I enter it once and the delimiter does not work. Why is this?
Scanner s = new Scanner(System.in);
s.useDelimiter("a=");
String n = s.nextLine();
System.out.println(n);
Because nextLine() doesn't care about delimiters. The delimiters only affect Scanner when you tell it to return tokens. nextLine() just returns whatever is left on the current line without caring about tokens.
A delimiter is not the way to go here; the purpose of delimiters is to tell the Scanner what can come between tokens, but you're trying to use it for a purpose it wasn't intended for. Instead:
String n = s.nextLine().replaceFirst("^a=","");
This inputs a line, then strips off a= if it appears at the beginning of the string (i.e. it replaces it with the empty string ""). replaceFirst takes a regular expression, and ^ means that it only matches if the a= is at the beginning of the string. This won't check to make sure the user actually entered a=; if you want to check this, your code will need to be a bit more complex, but the key thing here is that you want to use s.nextLine() to return a String, and then do whatever checking and manipulation you need on that String.
Try with StringTokenizer if Scanner#useDelimiter() is not suitable for your case.
Scanner s = new Scanner(System.in);
String n = s.nextLine();
StringTokenizer tokenizer = new StringTokenizer(n, "a=");
while (tokenizer.hasMoreTokens()) {
System.out.println(tokenizer.nextToken());
}
or try with String#split() method
for (String str : n.split("a=")) {
System.out.println(str);
}
input:
a=123a=546a=78a=9
output:
123
546
78
9

Problems with counting spaces - Java

I'm having problems with the code below. It asks for the user to type in a sentence basically.
System.out.println("Enter a string containing spaces: ");
inputString = keyboard.next();
int lengthString = inputString.length();
System.out.println("You entered: " + inputString + "\n" + "The string length is: " + lengthString);
The problem is when it prints the statement, it only prints the first word, then counts the characters contained within the first word. I'm really new to Java, so I was wondering what should I do to make the program count the WHOLE string.
You should use nextLine() method instead of next():
inputString = keyboard.nextLine();
Scanner#next() method reads the next token. Tokens are considered to be separated by a delimiter. And the default delimiter of Scanner is as recognized by Character.isWhitespace.
keyboard.nextLine() will pull in the entire line. next() only gets the next word. next() uses spaces by default to determine word sizes, but you can also set a custom delimiter if you want it to pull in different tokens.

Scanner class skips over whitespace

I am using a nested Scanner loop to extract the digit from a string line (from a text file) as follows:
String str = testString;
Scanner scanner = new Scanner(str);
while (scanner.hasNext()) {
String token = scanner.next();
// Here each token will be used
}
The problem is this code will skip all the spaces " ", but I also need to use those "spaces" too. So can Scanner return the spaces or I need to use something else?
My text file could contain something like this:
0
011
abc
d2d
sdwq
sda
Those blank lines contains 1 " " each, and those " " are what I need returned.
Use Scanner's hasNextLine() and nextLine() methods and you'll find your solution since this will allow you to capture empty or white-space lines.
By default, a scanner uses white space to separate tokens.
Use Scanner#nextLine method, Advances this scanner past the current line and returns the input that was skipped. This method returns the rest of the current line, excluding any line separator at the end. The position is set to the beginning of the next line.
To use a different token separator, invoke useDelimiter(), specifying
a regular expression. For example, suppose you wanted the token
separator to be a comma, optionally followed by white space. You would
invoke,
scanner.useDelimiter(",\\s*");
Read more from http://docs.oracle.com/javase/tutorial/essential/io/scanning.html
You have to understand what is a token. Read the documentation of Scanner:
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace.
You could use the nextLine() method to get the whole line and not "ignore" with any whitespace.
Better you could define what is a token by using the useDelimiter method.
This will work for you
Scanner scanner = new Scanner(new File("D:\\sample.txt"));
while (scanner.hasNextLine()) {
String token = scanner.nextLine();
System.out.println(token);
}
To use a more funtional approach you could use something like this:
String fileContent = new Scanner(new File("D:\\sample.txt"))
.useDelimiter("")
.tokens()
.reduce("", String::concat);

Scanner through a line with whitespace and comma

I am new to Java and looking for some help with Java's Scanner class. Below is the problem.
I have a text file with multiple lines and each line having multiple pairs of digit.Such that each pair of digit is represented as ( digit,digit ). For example 3,3 6,4 7,9. All these multiple pairs of digits are seperated from each other by a whitespace. Below is an exampel from the text file.
1 2,3 3,2 4,5
2 1,3 4,2 6,13
3 1,2 4,2 5,5
What i want is that i can retrieve each digit seperately. So that i can create an array of linkedlist out it. Below is what i have acheived so far.
Scanner sc = new Scanner(new File("a.txt"));
Scanner lineSc;
String line;
Integer vertix = 0;
Integer length = 0;
sc.useDelimiter("\\n"); // For line feeds
while (sc.hasNextLine()) {
line = sc.nextLine();
lineSc = new Scanner(line);
lineSc.useDelimiter("\\s"); // For Whitespace
// What should i do here. How should i scan through considering the whitespace and comma
}
Thanks
Consider using a regular expression, and data that doesn't conform to your expectation will be easily identified and dealt with.
CharSequence inputStr = "2 1,3 4,2 6,13";
String patternStr = "(\\d)\\s+(\\d),";
// Compile and use regular expression
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
while (matcher.find()) {
// Get all groups for this match
for (int i=0; i<=matcher.groupCount(); i++) {
String groupStr = matcher.group(i);
}
}
Group one and group two will correspond to the first and second digit in each pairing, respectively.
1. use nextLine() method of Scanner to get the each Entire line of text from the File.
2. Then use BreakIterator class with its static method getCharacterInstance(), to get the individual character, it will automatically handle commas, spaces, etc.
3. BreakIterator also give you many flexible methods to separate out the sentences, words etc.
For more details see this:
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html
Use the StringTokenizer class. http://docs.oracle.com/javase/1.4.2/docs/api/java/util/StringTokenizer.html
//this is in the while loop
//read each line
String line=sc.nextLine();
//create StringTokenizer, parsing with space and comma
StringTokenizer st1 = new StringTokenizer(line," ,");
Then each digit is read as a string when you call nextToken() like this, if you wanted all digits in the line
while(st1.hasMoreTokens())
{
String temp=st1.nextToken();
//now if you want it as an integer
int digit=Integer.parseInt(temp);
//now you have the digit! insert it into the linkedlist or wherever you want
}
Hope this helps!
Use split(regex), more simple :
while (sc.hasNextLine()) {
final String[] line = sc.nextLine().split(" |,");
// What should i do here. How should i scan through considering the whitespace and comma
for(int num : line) {
// Do your job
}
}

Categories