Java Scanner class fails in tokenization when 1024 character is the delimeter

Java Scanner class fails in tokenization when 1024 character is the delimeter - java

I've found a strange behaviour of java.util.Scanner class.
I need to split a String variable into a set of tokens separated by ";".
If I consider a string of "a[*1022]" + ";[*n]" I expect a number n of token.
However if n=3 the Scanner class fails: it "see" just 2 tokens instead of 3. I think it's something related to internal char buffer size of Scanner class.
a[x1022]; -> 1 token: correct
a[x1022];; -> 2 token: correct
a[x1022];;; -> 2 token: wrong (I expect 3 tokens)
a[x1022];;;; -> 4 token: correct
I attach a simple example:
import java.util.Scanner;
public static void main(String[] args) {
// generate test string: (1022x "a") + (3x ";")
String testLine = "";
for (int i = 0; i < 1022; i++) {
testLine = testLine + "a";
}
testLine = testLine + ";;;";
// set up the Scanner variable
String delimeter = ";";
Scanner lineScanner = new Scanner(testLine);
lineScanner.useDelimiter(delimeter);
int p = 0;
// tokenization
while (lineScanner.hasNext()){
p++;
String currentToken = lineScanner.next();
System.out.println("token" + p + ": '" + currentToken + "'");
}
lineScanner.close();
}
I would like to skip the "incorrect" behaviour, could you help me?
Thanks

My recommendation is to report the bug to Oracle, and then work around it by using a BufferedReader to read your InputStream (you'll also need the InputStreamReader class). What Scanner does isn't magic, and working directly with BufferedReader in this case only requires slightly more code than you were already using.

Related

How to break a file into tokens based on regex using Java

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.
<?aaaaa>
<?bbbb
bb>
<?cccccc>
I have the code:
FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
Scanner scanner = new Scanner(fs);
scanner.useDelimiter(Pattern.compile("<\\?"));
if (scanner.hasNext()) {
String line = scanner.next();
System.out.println(line);
}
scanner.close();
But the result I got have the begining <\? removed:
aaaaa>
bbbb
bb>
cccccc>
I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.
Is there a way to NOT have the delimeter pattern removed?

Break on a newline only when preceded by a ">" char:
scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly
\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >
Plus it's cool because <=> looks like Darth Vader's TIE fighter.

I'm assuming you want to ignore the newline character '\n' everywhere.
I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:
String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
... //your code
Feel free to ask any further questions you might have!

Here is one way of doing it by using a StringBuilder:
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new File("C:\\test.txt"));
StringBuilder builder = new StringBuilder();
String input = null;
while (in.hasNextLine() && null != (input = in.nextLine())) {
for (int x = 0; x < input.length(); x++) {
builder.append(input.charAt(x));
if (input.charAt(x) == '>') {
System.out.println(builder.toString());
builder = new StringBuilder();
}
}
}
in.close();
}
Input:
<?aaaaa>
<?bbbb
bb>
<?cccccc>
Output:
<?aaaaa>
<?bbbb bb>
<?cccccc>

How to solve 'NumberFormatException.forInputString()'?

My code is not working. The text file is in the same folder as my classes. I used the pathname, which worked, but I don't think that would work if I send the file to someone else. And converting the Strings to primitive type using parse methods isn't working, either. Not sure what I'm doing wrong. Can anyone help?
Here is my code:
import java.util.Scanner;
import java.util.StringTokenizer;
import java.io.FileNotFoundException;
import java.io.FileInputStream;
public class TestInventory {
public static void main(String[] args) {
// TODO Auto-generated method stub
Inventory movieList = new Inventory();
Scanner inputStream = null;
try{
inputStream = new Scanner(new FileInputStream("movies_db.txt"));
}
catch(FileNotFoundException e){
System.out.println("File not found or could not be opened");
System.exit(0);
}
while(inputStream.hasNextLine()){
String s = inputStream.nextLine();
StringTokenizer st = new StringTokenizer(s, " - ");
String t1 = st.nextToken();
String t2 = st.nextToken();
String t3 = st.nextToken();
String t4 = st.nextToken();
int y = Integer.parseInt(t2);
double r = Double.parseDouble(t4);
int d = Integer.parseInt(t3);
Movie m = new Movie(t1, y, r, d);
movieList.addMovie(m);
}
}
}
And this is the output I get:
run:
Exception in thread "main" java.lang.NumberFormatException: For input string: "America:"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at TestInventory.main(TestInventory.java:29)
C:\Users\customer\AppData\Local\NetBeans\Cache\8.1\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 0 seconds)

The error message occurs because you are parsing the String "America:" into parseInt().
All characters in the delim argument are the delimiters for separating tokens. source
That means that instead of splitting the text when ever there is a " - " it will split whenever there is a " " or "-".
I think you would be better of using string.split(String regex). This would allow you to parse " - " and get a String array in return.

You've fallen into one trap with the StringTokenizer class, the second parameter is read as a set of distinct characters to use as a delimiter, not as a string that must be present as a whole.
This means that instead of splitting on the exact string " - ", it will split where-ever there is a space or an -. This means that t2 probably does not contain what you think it will contain.
Assuming each line should always contain 4 tokens, you can test this be checking if st.hasMoreTokens() is true, in which case it has split the string into more parts than you intended.

NoSuchToken exception for StringTokenizer.nextToken()

When I try to run the code:
import java.io.*;
import java.util.*;
class dothis {
public static void main (String [] args) throws IOException {
BufferedReader f = new BufferedReader(new FileReader("ride.in"));
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("ride.out")));
StringTokenizer st = new StringTokenizer(f.readLine());
String s1 = st.nextToken();
String s2 = st.nextToken();
char[] arr = new char[6];
if (find(s1, arr, 1) == find(s2, arr, 1)) {
out.print("one");
} else {
out.println("two");
}
out.close();
}
}
With the data file:
ABCDEF
WERTYU
it keeps on outputting:
Exception_in_thread_"main"_java.util.NoSuchElementException
at_java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
at_dothis.main(Unknown_Source)
I did see a similar question on Stack Overflow, but in that case, the second line of the text file is blank, therefore there wasn't a second token to be read. However, the two first lines of this data file both contain a String. How come a token would not be read for the second line?

the StringTokenizer docs say that if you don't pass a token delimiter in the constructor, it's assumed to be:
" \t\n\r\f": the space character, the tab character, the newline character, the carriage-return character, and the form-feed character
. you're then asking for two tokens to be read out of the string returned by f.readLine(), which is ABCDEF (with no delimiters), so an exception is thrown.

You are creating the StringTokenizer class from the first line and so when you are setting the value of s2, there is no token left because s1 has the first and only token (ABCDEF). If I am not wrong, you are getting the exception when you are trying to set s2?

When you read a line it will return a String till it found "\n" or "\r" and in your case you have token in each line (i believe).
You really don't need StringTokenizer.
Each line you read is actually a token for you.
Also, if you are expecting each-line to have more than one token than you need to make sure you supply the delimiter to your tokenizer to understand the same.
StringTokenizer uses the default delimiter set, which is " \t\n\r\f": the space character, the tab character, the newline character, the carriage-return character, and the form-feed character. Delimiter characters themselves will not be treated as tokens.

If you are ready for Java 7/8, you could make it even simpler without the need of StringTokenizer.
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
class dothis {
public static void main (String [] args) throws IOException {
String contents = new String(Files.readAllBytes(Paths.get("ride.in")));
String[] lines= contents.split("\n");
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("ride.out")));
String s1 = lines[0];
String s2 = lines[1];
char[] arr = new char[6];
if (find(s1, arr, 1) == find(s2, arr, 1)) {
out.print("one");
} else {
out.println("two");
}
out.close();
}
}

In your code:
StringTokenizer st = new StringTokenizer(f.readLine());//f.readLine() method reads only 1 line
String s1 = st.nextToken();
String s2 = st.nextToken();//there is no second line, so this will give you error
Try this:
String result = "";
String tmp;
while((tmp = f.readLine())!= null)
{
result += tmp+"\n";
}
StringTokenizer st = new StringTokenizer(result);
ArrayList<String> str = new ArrayList<String>();
int count = st.countTokens();
for(int i=0; i< count; i++)
{
str.add(st.nextToken());
}
Now check by using the above arraylist.
As StringTokenizer is depricated, it is suggested to use split() method of String class.

Deal with PatternSyntaxException and scanning texts

I want to find names in a collection of text documents from a huge list of about 1 million names. I'm making a Pattern from the names of the list first:
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));
String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)
String combined = "";
while (dataRow != null) {
String[] dataArray = dataRow.split("\t");
String name = dataArray[1];
combined += name.replace("\"", "") + "|";
dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
Pattern all = Pattern.compile(combined);
After doing so I got an IllegalPatternSyntax Exception because some names contain a '+' in their names or other Regex expressions. I tried solving this by either ignoring the few names by:
if(name.contains("\""){
//ignore this name }
Didn't work properly but also messy because you have to escape everything manually and run it many times and waste your time.
Then I tried using the quote method:
Pattern all = Pattern.compile(Pattern.quote(combined));
However now, I don't find any matches in the text documents anymore, even when I also use quote on the them. How can I solve this issue?

I agree with the comment of #dragon66, you should not quote pipe "|". So your code would be like the code below using Pattern.quote() :
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));
String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)
String combined = "";
while (dataRow != null) {
String[] dataArray = dataRow.split("\t");
String name = dataArray[1];
combined += Pattern.quote(name.replace("\"", "")) + "|"; //line changed
dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
Pattern all = Pattern.compile(combined);
Also I suggest to verify if your problem domain needs optimization replacing the use of the String combined = ""; over an Immutable StringBuilder class to avoid the creation of unnecessary new strings inside a loop.

guilhermerama presented the bugfix to your code.
I will add some performance improvements. As I pointed out the regex library of java does not scale and is even slower if used for searching.
But one can do better with Multi-String-Seach algorithms. For example by using StringsAndChars String Search:
//setting up a test file
Iterable<String> lines = createLines();
Files.write(Paths.get("names.tsv"), lines , CREATE, WRITE, TRUNCATE_EXISTING);
// read the pattern from the file
BufferedReader TSVFile = new BufferedReader(new FileReader("names.tsv"));
Set<String> combined = new LinkedHashSet<>();
String dataRow = TSVFile.readLine();
dataRow = TSVFile.readLine();// skip first line (header)
while (dataRow != null) {
String[] dataArray = dataRow.split("\t");
String name = dataArray[1];
combined.add(name);
dataRow = TSVFile.readLine(); // Read next line of data.
}
TSVFile.close();
// search the pattern in a small text
StringSearchAlgorithm stringSearch = new AhoCorasick(new ArrayList<>(combined));
StringFinder finder = stringSearch.createFinder(new StringCharProvider("test " + name(38) + "\n or " + name(799) + " : " + name(99999), 0));
System.out.println(finder.findAll());
The result will be
[5:10(00038), 15:20(00799), 23:28(99999)]
The search (finder.findAll()) does take (on my computer) < 1 millisecond. Doing the same with java.util.regex took around 20 milliseconds.
You may tune this performance by using other algorithms provided by RexLex.
Setting up needs following code:
private static Iterable<String> createLines() {
List<String> list = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
list.add(i + "\t" + name(i));
}
return list;
}
private static String name(int i) {
String s = String.valueOf(i);
while (s.length() < 5) {
s = '0' + s;
}
return s;
}

Splitting up data file in Java Scanner

I have the following data which I want to split up.
(1,167,2,'LT2A',45,'Weekly','1,2,3,4,5,6,7,8,9,10,11,12,13'),
to obtain each of the values:
1
167
2
'LT2A'
45
'Weekly'
'1,2,3,4,5,6,7,8,9,10,11,12,13'
I am using the Scanner class to do that and with , as the delimiter.
But I face problems due to the last string: ('1,2,3,4,5,6,7,8,9,10,11,12,13').
I would hence like some suggestions on how I could split this data.
I have also tried using ,' as the delimiter but the string contains data without ''.
The question is quite specific to my needs but I would appreciate if someone could give me suggestions on how I could split this data up.
Thanks!

you can use simple logic for example:
String str="1,167,2,'LT2A',45,'Weekly','1,2,3,4,5,6,7,8,9,10,11,12,13'";
Scanner s = new Scanner(str);
s.useDelimiter(",");
while(s.hasNext())
{
String element = s.next();
if(element.startsWith("'") && ! element.endsWith("'"))
{
while(s.hasNext())
{
element += "," + s.next();
if(element.endsWith("'"))
break;
}
}
System.out.println(element);
}

try
String s = "1,167,2,'LT2A',45,'Weekly','1,2,3,4,5,6,7,8,9,10,11,12,13'";
Scanner sc = new Scanner(s);
sc.useDelimiter(",");
while (sc.hasNext()) {
String n = sc.next();
if (n.startsWith("'") && !n.endsWith("'")) {
n = n + sc.findInLine(".+?'");
}
System.out.println(n);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Scanner class fails in tokenization when 1024 character is the delimeter - java

Related

How to break a file into tokens based on regex using Java

How to solve 'NumberFormatException.forInputString()'?

NoSuchToken exception for StringTokenizer.nextToken()

Deal with PatternSyntaxException and scanning texts

Splitting up data file in Java Scanner

Categories

Resources