string tokenizer wrong usage in java - java

I believe I am not using correctly String Tokenizer. Here is my code:
buffer = new byte[(int) (end - begin)];
fin.seek(begin);
fin.read(buffer, 0, (int) (end - begin));
StringTokenizer strk = new StringTokenizer(new String(buffer),
DELIMS,true);
As you can see I am reading a chunk of lines from a file(end and begin are line numbers) and I am transfering the data to a string tokenizer. My delimitators are:
DELIMS = "\r\n ";
because I want to separate words that have a space between them, or are on the next line.
However this code sometimes separates whole words also. What could be the explanation?? Is my DELIMS string conceived wrong?
Also I am passing "true" as an argument to the tokenizer because I want the delimitators to be treated as tokens as well.( I want this because I want to count the line I am currently at)
Could you please help me. Thanks a lot.

To start with, your method for converting bytes into a String is a bit suspect, and this overall method will be less-than-efficient, especially for a larger file.
Are you required to use StringTokenizer? If not, I'd strongly recommend using Scanner instead. I'd provide you with an example, but will ask that you just refer to the Javadocs instead, which are quite comprehensive and already contain good examples. That said, it accepts delimiters as well - but as Regular Expressions, so just be aware.

You could always wrap your input stream in a LineNumberReader. That will keep track of the line number for you. LineNumberReader extends BufferedReader, which has a readLine() method. With that, you could use a regular StringTokenizer to get your words as tokens. You could use regular expressions or Scanner, but for this case, StringTokenizer is simpler for beginners to understand and quicker.
You must have a RandomAccessFile. You didn't specify that, but I'm guessing based on the methods you used. Try something like:
byte [] buffer = ...; // you know how to get this.
ByteArrayInputStream stream = new ByteArrayInputStream(buffer);
// if you have java.util.Scanner
{
int lineNumber = 0;
Scanner s = new Scanner(stream);
while (s.hasNextLine()) {
lineNum++;
String line = s.nextLine();
System.out.format("I am on line %s%n", lineNum);
Scanner lineScanner = new Scanner(line);
while (lineScanner.hasNext()) {
String word = lineScanner.next();
// do whatever with word
}
}
}
// if you don't have java.util.Scanner, or want to use StringTokenizer
{
LineNumberReader reader = new LineNumberReader(
new InputStreamReader(stream));
String line = null;
while ((line = reader.nextLine()) != null) {
System.out.println("I am on line " + reader.getLineNumber());
StringTokenizer tok = new StringTokenizer(line);
while (tok.hasMoreTokens()) {
String word = tok.nextToken();
// do whatever with word
}
}
}

Related

How to break a file into tokens based on regex using Java

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.
<?aaaaa>
<?bbbb
bb>
<?cccccc>
I have the code:
FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
Scanner scanner = new Scanner(fs);
scanner.useDelimiter(Pattern.compile("<\\?"));
if (scanner.hasNext()) {
String line = scanner.next();
System.out.println(line);
}
scanner.close();
But the result I got have the begining <\? removed:
aaaaa>
bbbb
bb>
cccccc>
I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.
Is there a way to NOT have the delimeter pattern removed?
Break on a newline only when preceded by a ">" char:
scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly
\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >
Plus it's cool because <=> looks like Darth Vader's TIE fighter.
I'm assuming you want to ignore the newline character '\n' everywhere.
I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:
String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
... //your code
Feel free to ask any further questions you might have!
Here is one way of doing it by using a StringBuilder:
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new File("C:\\test.txt"));
StringBuilder builder = new StringBuilder();
String input = null;
while (in.hasNextLine() && null != (input = in.nextLine())) {
for (int x = 0; x < input.length(); x++) {
builder.append(input.charAt(x));
if (input.charAt(x) == '>') {
System.out.println(builder.toString());
builder = new StringBuilder();
}
}
}
in.close();
}
Input:
<?aaaaa>
<?bbbb
bb>
<?cccccc>
Output:
<?aaaaa>
<?bbbb bb>
<?cccccc>

Find a string in a very large formatted text file in java

Here is the thing:
I have a really big text file and it has a format like this:
0007476|000011434982|00249626000|R|2008-01-11 00:00:00|9999-12-31 23:59:59|000019.99
0007476|000014017887|00313865000|R|2011-04-19 00:00:00|9999-12-31 23:59:59|000599.99
...
...
And I need to find if a particular pattern exists in the file, say
0007476|whatever|00313865000|whatever
All I need is a boolean saying yes or no.
Now what I have done is to read the file line by line and do a regular expression matching:
Pattern pattern = Pattern.compile(regex);
Scanner scanner = new Scanner(new File(fileName));
String line;
while (scanner.hasNextLine()) {
line = scanner.nextLine();
if (pattern.matcher(line).matches()) {
scanner.close();
return true;
}
}
and the regex has a form of
"0007476\|\d{12}\|0031386500.*
This method works, but it takes usually 15 seconds to search for a string that is far from the start line. Is there a faster way to achieve that? Thanks
The java String class has a contains method which returns a boolean. If your string is fixed, this is a lot faster than a regular expression:
if (string.contains("0007476|") && string.contains("|00313865000|")) {
// whatever
}
Hope that helped, if not, leave a comment.
I assume that you need the Scanner because the file is too big to read into a single String instead?
If that is not the case, you can probably use a regular expression that finds the match directly. Depending on whether or not you care about the specific text at the start of the line you can you something along the lines of:
"(?m)^0007476\|\d{12}\|0031386500.*$
If you do need to break it up into smaller chunks because of memory usage I would suggest not reading on a per line basis, (since the lines are rather short), but process bigger chunks using something like a BufferedReader instead?
I fiddled around a bit with a 1.25GB file and the following is about 2.5 times faster than your implementation:
private static boolean matches() throws IOException {
String regex = "(?m)^0007476\|\d{12}\|0031386500.*$";
Pattern pattern = Pattern.compile(regex);
try(BufferedReader br = new BufferedReader(new FileReader(FILENAME))) {
for(String lines; (lines = readLines(br, 10000)) != null; ) {
if (pattern.matcher(lines).find()) {
return true;
}
}
}
return false;
}
private static String readLines(BufferedReader br, int amount) throws IOException {
StringBuilder builder = new StringBuilder();
int lineCounter = 0;
for(String line; (line = br.readLine()) != null && lineCounter < amount; lineCounter++ ) {
builder.append(line).append(System.lineSeparator());
}
return lineCounter > 0 ? builder.toString() : null;
}

Why does this take so long to run?

I'm a newbie to java, and I'm reading in a ~25 MB file, and it takes forever to just load... Are there any alternatives to make this faster? Is it the Scanner that can't handle large files?
String text = "";
Scanner sc = new Scanner(new File("text.txt"));
while(sc.hasNext()) {
text += sc.next();
}
You are concatenating to text every iteration, and Strings are immutable in Java. This means it creates a new String object in memory every time text is "modified," resulting in long load times for large files. You should always try and use a StringBuilder when you are continuously altering a String.
You could do:
StringBuilder text = new StringBuilder();
Scanner sc = new Scanner(new File("text.txt");
while(sc.hasNext()) {
text.append(sc.next());
}
When you want to access the contents of text, you can call text.toString().
It is the String +=, which creates everytime an evergrowing new String object.
In fact for smaller than 25 MB one could do (undermore):
StringBuilder sb = new StringBuilder();
BufferReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(new File("text.txt"), "UTF-8")));
for (;;) {
String line = in.readLine();
if (line == null)
break;
sb.append(line).append("\n");
}
in.close();
String text = sb.toString();
readLine yields the line upto the newline character(s), not including them.
In Java 7 one could do:
Path path = Paths.get("text.txt");
String text = new String(Files.readAllBytes(path), "UTF-8");
The encoding is given explicitly, as UTF-8. "Windows-1252" would be for Windows Latin-1 etcetera.
Try to use BufferedStreams, e.g, BufferedInputStream, BufferedReader they will accelerate it. For more information about BufferedStreams take a look at here;
http://docs.oracle.com/javase/tutorial/essential/io/buffers.html
And instead of String use StringBuilder since Strings are immutable in Java, it will create a new String within each iteration of while loop

no line found exception

Help again guys, why do I always get this kind of error when using scanner, even though I'm sure that the file exists.
java.util.NoSuchElementException: No line found
I am trying to count the number of occurences of a by using for loop. the text file contain lines of sentence. At the same time, I want to print the exact format of sentences.
Scanner scanLine = new Scanner(new FileReader("C:/input.txt"));
while (scanLine.nextLine() != null) {
String textInput = scanLine.nextLine();
char[] stringArray = textInput.toCharArray();
for (char c : stringArray) {
switch (c) {
case 'a':
default:
break;
}
}
}
while(scanLine.nextLine() != null) {
String textInput = scanLine.nextLine();
}
I'd say the problem is here:
In your while condition, you scan the last line and come to EOF. After that, you enter loop body and try to get next line, but you've already read the file to its end. Either change the loop condition to scanLine.hasNextLine() or try another approach of reading files.
Another way of reading the txt file can be like this:
BufferedReader reader = new BufferedReader(new InputStreamReader(new BufferedInputStream(new FileInputStream(new File("text.txt")))));
String line = null;
while ((line = reader.readLine()) != null) {
// do something with your read line
}
reader.close();
or this:
byte[] bytes = Files.readAllBytes(Paths.get("text.txt"));
String text = new String(bytes, StandardCharsets.UTF_8);
You should use : scanner.hasNextLine() instead of scanner.nextLine() in the while condition
Scanner implements the Iterator interface which works by this pattern:
See if there is a next item (hasNext())
Retrieve the next item (next())
To count the number of occurrences of "a" or for that matter any string in a string, you can use StringUtils from apache-commons-lang like:
System.out.println(StringUtils.countMatches(textInput,"a"));
I think it will be more efficient than converting the string to character array and then looping over the whole array to find the number of occurrences of "a". Moreover, StringUtils methods are null safe

JAVA - import CSV to ArrayList

I'm trying import CSV file to Arraylist using StringTokenizer:
public class Test
{
public static void main(String [] args)
{
List<ImportedXls> datalist = new ArrayList<ImportedXls>();
try
{
FileReader fr = new FileReader("c:\\temp.csv");
BufferedReader br = new BufferedReader(fr);
String stringRead = br.readLine();
while( stringRead != null )
{
StringTokenizer st = new StringTokenizer(stringRead, ",");
String docNumber = st.nextToken( );
String note = st.nextToken( ); /** PROBLEM */
String index = st.nextToken( ); /** PROBLEM */
ImportedXls temp = new ImportedXls(docNumber, note, index);
datalist.add(temp);
// read the next line
stringRead = br.readLine();
}
br.close( );
}
catch(IOException ioe){...}
for (ImportedXls item : datalist) {
System.out.println(item.getDocNumber());
}
}
}
I don't understand how the nextToken works, because if I keep the initialize three variables (docNumber, note and index) as nextToken(), it fails on:
Exception in thread "main" java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(Unknown Source)
at _test.Test.main(Test.java:32)
If I keep docNumber only, it works. Could you help me?
It seems that some of the rows of your input file have less then 3 comma separated fields.You should always check if tokenizer has more tokens (StringTokenizer.hasMoreTokens), unless you are are 100% sure your input is correct.
CORRECT parsing of CSV files is not so trivial task. Why not to use a library that can do it very well - http://opencsv.sourceforge.net/ ?
Seems like your code is getting to a line that the Tokenizer is only breaking up into 1 part instead of 3. Is it possible to have lines with missing data? If so, you need to handle this.
Most probably your input file doesn't contain another element delimited by , in at least one line. Please show us your input - if possible the line that fails.
However, you don't need to use StringTokenizer. Using String#split() might be easier:
...
while( stringRead != null )
{
String[] elements = stringRead.split(",");
if(elements.length < 3) {
throw new RuntimeException("line too short"); //handle missing entries
}
String docNumber = elements[0];
String note = elements[1];
String index = elements[2];
ImportedXls temp = new ImportedXls(docNumber, note, index);
datalist.add(temp);
// read the next line
stringRead = br.readLine();
}
...
You should be able to check your tokens using the hasMoreTokens() method. If this returns false, then it's possible that the line you've read does not contain anything (i.e., an empty string).
It would be better though to use the String.split() method--if I'm not mistaken, there were plans to deprecate the StringTokenizer class.

Categories