Scanner.next() from a BufferedReader

Scanner.next() from a BufferedReader - java

I am trying to scan text from a reader based on a delimiter "()()" by using the Scanner.next() method. Here is my code:
public static void main(String[] args)
{
String buffer = "i want this text()()without this text()()";
InputStream in = new ByteArrayInputStream(buffer.getBytes());
InputStreamReader isr = new InputStreamReader(in);
BufferedReader reader = new BufferedReader(isr);
Scanner scan = new Scanner(reader);
scan.useDelimiter("/(/)/(/)");
String found = scan.next();
System.out.println(found);
}
The problem is, the entire buffer is returned:
i want this text()()without this text()()
I only want the first next() iteration to return:
i want this text
and the next next() iteration to return:
without this text
Any suggestions how I can scan the reader by only delimiting strings ending in ()()?

Your useDelimiter argument is incorrect - you're using forward slashes instead of backslashes when you try to escape the parentheses. You need to escape the backslashes in Java terms, too:
scan.useDelimiter("\\(\\)\\(\\)");
EDIT: Rather than escaping this yourself, you can use Pattern.quote:
String rawDelimiter = "()()";
String escaped = Pattern.quote(rawDelimiter);
scan.useDelimiter(escaped);

Related

How to break a file into tokens based on regex using Java

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.
<?aaaaa>
<?bbbb
bb>
<?cccccc>
I have the code:
FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
Scanner scanner = new Scanner(fs);
scanner.useDelimiter(Pattern.compile("<\\?"));
if (scanner.hasNext()) {
String line = scanner.next();
System.out.println(line);
}
scanner.close();
But the result I got have the begining <\? removed:
aaaaa>
bbbb
bb>
cccccc>
I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.
Is there a way to NOT have the delimeter pattern removed?

Break on a newline only when preceded by a ">" char:
scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly
\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >
Plus it's cool because <=> looks like Darth Vader's TIE fighter.

I'm assuming you want to ignore the newline character '\n' everywhere.
I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:
String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
... //your code
Feel free to ask any further questions you might have!

Here is one way of doing it by using a StringBuilder:
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new File("C:\\test.txt"));
StringBuilder builder = new StringBuilder();
String input = null;
while (in.hasNextLine() && null != (input = in.nextLine())) {
for (int x = 0; x < input.length(); x++) {
builder.append(input.charAt(x));
if (input.charAt(x) == '>') {
System.out.println(builder.toString());
builder = new StringBuilder();
}
}
}
in.close();
}
Input:
<?aaaaa>
<?bbbb
bb>
<?cccccc>
Output:
<?aaaaa>
<?bbbb bb>
<?cccccc>

reading String with spaces java

I am trying to read from scanner with spaces, i want to read even the spaces.
for example "john smith" to be read "john smith".
my code is as follow:
when it gets to the space after john it just hangs and doesn't read any more.
any help would be appreciated.
Scanner in = new Scanner(new InputStreamReader(sock.getInputStream()));
String userName = "";
while (in.hasNext()) {
userName.concat(in.next());
}

Scanner.next() returns the next token, delimited by whitespace. If you would like to read the entire line, along with the spaces, use nextLine() instead:
String userName = in.nextLine();

Scanner scan = new Scanner(file);
scan.useDelimiter("\\Z");
String content = scan.next();
or
private String readFileAsString(String filePath) throws IOException {
StringBuffer fileData = new StringBuffer();
BufferedReader reader = new BufferedReader(
new FileReader(filePath));
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
}
reader.close();
return fileData.toString();
}

When we use Scanner.next() to read token there is what we call a delimiter, the default delimiter used in by Scanner is \p{javaWhitespace}+ , you can get it by calling Scanner.delimiter(), which is any char that validate the Character.isWhitespace(char). you can use a customized delimiter for your Scanner using Scanner.useDelimiter().
If you want to take one line as a string so you can use nextLine() , if you already know what is the type of the next token in the input stream, scanner gives you a list of method next*() take convert the token to the specified type. see Scanner's doc here for more info.

Why does this take so long to run?

I'm a newbie to java, and I'm reading in a ~25 MB file, and it takes forever to just load... Are there any alternatives to make this faster? Is it the Scanner that can't handle large files?
String text = "";
Scanner sc = new Scanner(new File("text.txt"));
while(sc.hasNext()) {
text += sc.next();
}

You are concatenating to text every iteration, and Strings are immutable in Java. This means it creates a new String object in memory every time text is "modified," resulting in long load times for large files. You should always try and use a StringBuilder when you are continuously altering a String.
You could do:
StringBuilder text = new StringBuilder();
Scanner sc = new Scanner(new File("text.txt");
while(sc.hasNext()) {
text.append(sc.next());
}
When you want to access the contents of text, you can call text.toString().

It is the String +=, which creates everytime an evergrowing new String object.
In fact for smaller than 25 MB one could do (undermore):
StringBuilder sb = new StringBuilder();
BufferReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(new File("text.txt"), "UTF-8")));
for (;;) {
String line = in.readLine();
if (line == null)
break;
sb.append(line).append("\n");
}
in.close();
String text = sb.toString();
readLine yields the line upto the newline character(s), not including them.
In Java 7 one could do:
Path path = Paths.get("text.txt");
String text = new String(Files.readAllBytes(path), "UTF-8");
The encoding is given explicitly, as UTF-8. "Windows-1252" would be for Windows Latin-1 etcetera.

Try to use BufferedStreams, e.g, BufferedInputStream, BufferedReader they will accelerate it. For more information about BufferedStreams take a look at here;
http://docs.oracle.com/javase/tutorial/essential/io/buffers.html
And instead of String use StringBuilder since Strings are immutable in Java, it will create a new String within each iteration of while loop

string tokenizer wrong usage in java

I believe I am not using correctly String Tokenizer. Here is my code:
buffer = new byte[(int) (end - begin)];
fin.seek(begin);
fin.read(buffer, 0, (int) (end - begin));
StringTokenizer strk = new StringTokenizer(new String(buffer),
DELIMS,true);
As you can see I am reading a chunk of lines from a file(end and begin are line numbers) and I am transfering the data to a string tokenizer. My delimitators are:
DELIMS = "\r\n ";
because I want to separate words that have a space between them, or are on the next line.
However this code sometimes separates whole words also. What could be the explanation?? Is my DELIMS string conceived wrong?
Also I am passing "true" as an argument to the tokenizer because I want the delimitators to be treated as tokens as well.( I want this because I want to count the line I am currently at)
Could you please help me. Thanks a lot.

To start with, your method for converting bytes into a String is a bit suspect, and this overall method will be less-than-efficient, especially for a larger file.
Are you required to use StringTokenizer? If not, I'd strongly recommend using Scanner instead. I'd provide you with an example, but will ask that you just refer to the Javadocs instead, which are quite comprehensive and already contain good examples. That said, it accepts delimiters as well - but as Regular Expressions, so just be aware.

You could always wrap your input stream in a LineNumberReader. That will keep track of the line number for you. LineNumberReader extends BufferedReader, which has a readLine() method. With that, you could use a regular StringTokenizer to get your words as tokens. You could use regular expressions or Scanner, but for this case, StringTokenizer is simpler for beginners to understand and quicker.
You must have a RandomAccessFile. You didn't specify that, but I'm guessing based on the methods you used. Try something like:
byte [] buffer = ...; // you know how to get this.
ByteArrayInputStream stream = new ByteArrayInputStream(buffer);
// if you have java.util.Scanner
{
int lineNumber = 0;
Scanner s = new Scanner(stream);
while (s.hasNextLine()) {
lineNum++;
String line = s.nextLine();
System.out.format("I am on line %s%n", lineNum);
Scanner lineScanner = new Scanner(line);
while (lineScanner.hasNext()) {
String word = lineScanner.next();
// do whatever with word
}
}
}
// if you don't have java.util.Scanner, or want to use StringTokenizer
{
LineNumberReader reader = new LineNumberReader(
new InputStreamReader(stream));
String line = null;
while ((line = reader.nextLine()) != null) {
System.out.println("I am on line " + reader.getLineNumber());
StringTokenizer tok = new StringTokenizer(line);
while (tok.hasMoreTokens()) {
String word = tok.nextToken();
// do whatever with word
}
}
}

Check line for unprintable characters while reading text file

My program must read text files - line by line.
Files in UTF-8.
I am not sure that files are correct - can contain unprintable characters.
Is possible check for it without going to byte level?
Thanks.

Open the file with a FileInputStream, then use an InputStreamReader with the UTF-8 Charset to read characters from the stream, and use a BufferedReader to read lines, e.g. via BufferedReader#readLine, which will give you a string. Once you have the string, you can check for characters that aren't what you consider to be printable.
E.g. (without error checking), using try-with-resources (which is in vaguely modern Java version):
String line;
try (
InputStream fis = new FileInputStream("the_file_name");
InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(isr);
) {
while ((line = br.readLine()) != null) {
// Deal with the line
}
}

While it's not hard to do this manually using BufferedReader and InputStreamReader, I'd use Guava:
List<String> lines = Files.readLines(file, Charsets.UTF_8);
You can then do whatever you like with those lines.
EDIT: Note that this will read the whole file into memory in one go. In most cases that's actually fine - and it's certainly simpler than reading it line by line, processing each line as you read it. If it's an enormous file, you may need to do it that way as per T.J. Crowder's answer.

Just found out that with the Java NIO (java.nio.file.*) you can easily write:
List<String> lines=Files.readAllLines(Paths.get("/tmp/test.csv"), StandardCharsets.UTF_8);
for(String line:lines){
System.out.println(line);
}
instead of dealing with FileInputStreams and BufferedReaders...

If you want to check a string has unprintable characters you can use a regular expression
[^\p{Print}]

How about below:
FileReader fileReader = new FileReader(new File("test.txt"));
BufferedReader br = new BufferedReader(fileReader);
String line = null;
// if no more lines the readLine() returns null
while ((line = br.readLine()) != null) {
// reading lines until the end of the file
}
Source: http://devmain.blogspot.co.uk/2013/10/java-quick-way-to-read-or-write-to-file.html

I can find following ways to do.
private static final String fileName = "C:/Input.txt";
public static void main(String[] args) throws IOException {
Stream<String> lines = Files.lines(Paths.get(fileName));
lines.toArray(String[]::new);
List<String> readAllLines = Files.readAllLines(Paths.get(fileName));
readAllLines.forEach(s -> System.out.println(s));
File file = new File(fileName);
Scanner scanner = new Scanner(file);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}

The answer by #T.J.Crowder is Java 6 - in java 7 the valid answer is the one by #McIntosh - though its use of Charset for name for UTF -8 is discouraged:
List<String> lines = Files.readAllLines(Paths.get("/tmp/test.csv"),
StandardCharsets.UTF_8);
for(String line: lines){ /* DO */ }
Reminds a lot of the Guava way posted by Skeet above - and of course same caveats apply. That is, for big files (Java 7):
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
for (String line = reader.readLine(); line != null; line = reader.readLine()) {}

If every char in the file is properly encoded in UTF-8, you won't have any problem reading it using a reader with the UTF-8 encoding. Up to you to check every char of the file and see if you consider it printable or not.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scanner.next() from a BufferedReader - java

Related

How to break a file into tokens based on regex using Java

reading String with spaces java

Why does this take so long to run?

string tokenizer wrong usage in java

Check line for unprintable characters while reading text file

Categories

Resources