I'm trying to scan a file that has the DOS ^M as end-of-line using something like:
Scanner file = new Scanner(new File(saveToFilePath)).useDelimiter("(?=\^M)")
In other words, I want to read the text line by line but also keep the ^M that marks the end of the line. This would be easy with \n but I'm not good with regexes and the DOS end-of-line is driving me crazy.
After some research I finally got it. The following is the correct regex for finding and keeping ^M. I didn't know that it meant CTRL-M, so some of your responses helped with that. For some reason, the "M" is not included in the regex and I'm not sure why it works, but it does. This gives us a delimiter for lines that includes the delimiter (with a lookahead regex) when searching for the elusive "^M".
Scanner file = new Scanner(source).useDelimiter("(?=\p{Cntrl})")
Thank you, everyone.
Related
I am having issues using my delimiter in my scanner. I am currently using a scanner to read a text file and put tokens into a string. My tutor told me to use the delimiter (useDelimiter("\t|\n")). However each token that it is grabbing is ending in /r (due to a return in the text file). This is fine for printing purposes, however i need to get the string length. And instead of returning the number of actual characters, it is returning the number of characters including that /r. Is there a better delimiter I can use that will accomplish the same thing (without grabbing the /r)? code is as follows:
studentData.useDelimiter("\t|\n");
while (studentData.hasNext())
{
token = studentData.next();
int tokenLength = token.length();
statCalc(tokenLength);
}
I am well aware that I could simply remove the last character of the string token. However, for many reasons, I just want it to grab the token without the /r. Any and all help would be greatly appreciated.
Try this:
studentData.useDelimiter("\\t|\\R");
The \R pattern matches any linebreak, see documentation.
I guess the remaining \r char is a partially consumed linebreak in Windows environment. With the aforementioned delimiter, the scanner will properly consume the line.
Replace all Carriage and form return from your string.Try this
s = s.replaceAll("\\n", "");
s = s.replaceAll("\\r", "");
Windows-style line ending is usually: \r\n but you are ignoring \r as delimiter. Your regex pattern (\t|\n) can be improved by using:
(\t|\r\n|\r|\n)
However, it looks to me like what you're trying to accomplish is to create a "tokenizer" which breaks a text file into words (since you're also looking for \t) so my guess is that you're better of with:
studentData.useDelimiter("\\s*");
which will take in consideration any white-space.
You can learn more about regular expressions.
I've ran into a bit of a rough spot in this Java program I'm writing an thought I would ask for some help. I'm using regex to replace certain lines in a file being read in and not getting the desired result. I want to replace all series of 3 new lines in my file and thought this would be straight forward since my regex is working in notepad++ but I guess not. Below is what an example of what the file is like:
FIRST SENTENCECRLF
CRLF
CRLF
CRLF
CRLF
CRLF
SECOND SENTENCECRLF
So, in other words, I am wanting to remove 3 of those carriage return\line feed instances between the first and second sentence lines. Below is what I've tried so far. The first tried in Java results in no change to the file (works in Notepad++ fine). The second, pretty much the same as the first works in notepad++ but not Java. The third is pretty much the exact same case as the other two. Anyone have any helpful suggestions as to what might work in this situation. At this point anything would be greatly appreciated!
^(\r\n){3}
^\r\n(\r\n)(\r\n)
^\r\n\r\n\r\n
Try the following regex:
(?m)^(\r\n){3}
The (?m) enables multi-line mode in Java, as explained in How to use java regex to match a line
I'm helping my sisters with a simple java program and I'm stumped. They've only learned scanner classes to read file contents, so I think they're supposed to use the scanner class. Each line contains letters and potentially a blank space, and we're hoping to store each line in an array. This works fine and dandy until one of the lines contains something like:
abcde f (the blank space after f should be read in as part of the
line).
However, scanner.nextLine() seems to disregard this last blank space. I figured I could set my scanner delimiter to \n like so:
scanner.useDelimiter("\n")
and then use scanner.Next() from there, but this still doesn't seem to work. I've googled around and taken a look at a few stackoverflow questions. This question here seems to suggest this is not easily done with the scanner class: How to read whitespace with scanner.next()
Any ideas? I feel like there's an easy way I'm overlooking.
This is how I'm reading in the lines:
While(scanner.hasNextLine(){
String nextLine = scanner.nextLine();
Using the above example, my string would read abcde f. It will get rid of the empty space at the end.
I've also tried to use hasNext and next.
Pardon my formatting, I'm editing on a phone.
Save your text file as ANSI encoding and try again.
By right scanner.nextLine() will capture everything in the line, including whitespace.
scanner.next() will not capture whitespace as the delimiter is whitespace by default.
I am using Java's Scanner to parse some text. Say I have set as a delimiter a variety of characters [#$]
With next I get the text till that delimiter, but I would like for a way to learn if parsing stopped because it found # or because it found $.
Is there some way to do that? Or should I break it in two, as in try with the first delimiter, and if you fail try with the second?
Found it! :)
You can use
scanner.findWithinHorizon("[\\#]", 2)
to see if # was the delimeter found.
I've got some very basic code like
while (scan.hasNextLine())
{
String temp = scan.nextLine();
System.out.println(temp);
}
where scan is a Scanner over a file.
However, on one particular line, which is about 6k chars long, temp cuts out after something like 2470 characters. There's nothing special about when it cuts out; it's in the middle of the word "Australia." If I delete characters from the line, the place where it cuts out changes; e.g. if I delete characters 0-100 in the file then Scanner will get what was previously 100-2570.
I've used Scanner for larger strings before. Any idea what could be going wrong?
At a guess, you may have a rogue character at the cut-off point: look at the file in a hex editor instead of just a text editor. Perhaps there's an embedded null character, or possibly \r in the middle of the string? It seems unlikely to me that Scanner.nextLine() would just chop it arbitrarily.
As another thought, are you 100% sure that it's not all there? Perhaps System.out.println is chopping the string - again due to some "odd" character embedded in it? What happens if you print temp.length()?
EDIT: I'd misinterpreted the bit about what happens if you cut out some characters. Sorry about that. A few other things to check:
If you read the lines with BufferedReader.readLine() instead of Scanner, does it get everything?
Are you specifying the right encoding? I can't see why this would show up in this particular way, but it's something to think about...
If you replace all the characters in the line with "A" (in the file) does that change anything?
If you add an extra line before this line (or remove a line before it) does that change anything?
Failing all of this, I'd just debug into Scanner.nextLine() - one of the nice things about Java is that you can debug into the standard libraries.