I'm currently in the progress of developing a parser for a subset of Java, and I was wondering;
Is there any cases, in which newlines are more than token separators?
That is, where they couldn't just be replaced by a space.
Should I ignore newlines, in the same way that I ignore white-space?
That is, just use them to detect token separation.
Yes all newline characters in Java source code can be replaced by a space or be removed. However, do not remove \n (backslash n), because that are the newline characters inside a String literal.
And, yes newlines are for the parser the same as spaces, as long as you are outside String literals. If you are in a String literal, and you would remove a newline, then you would surpress a syntax error. Because it is not allowed in Java to have newline characters in a String literal. So, this is wrong:
String str = "first line
same line";
So, it depends on the fact if you want to detect syntax errors with your parser or not. Do you only parse valid code or not? That is the question you should ask yourself.
The only situation I can think of where it makes a difference is within String-literals.
If there is a linebreak between two "s it would cause a syntax error while a space would not.
you have to notice that it could come in string \n, and of course if you want to make this replace you have to increase the lines number +1 because you will need it in the next phases of your project.
Related
To remove quotation marks in Java,I understand I can use
replaceAll("\"", "");
Ex: "Hello World" becomes Hello World.
However, it only removes this type of quotation marks "". Is there a way to remove quotes like this “Hello World” ?
If you simply want to remove those 3 kinds of double-quotes, irrespective of the context:
replaceAll("[\"“”]", "");
If there are other kinds of quote characters that you want to remove, just add them before the ].
These pages list some of the other quote characters that you might encounter:
https://unicode-table.com/en/sets/quotation-marks/
https://en.wikipedia.org/wiki/Quotation_mark
And also see:
Is there a regex to grab all quotation marks?
which talks about the difficulty in creating a regex to match all of them in a future-proof fashion.
Note that since we are including some "funky" characters (non-ASCII) in the source code (above), it is important that the Java compiler is aware of the character encoding that the source code uses. We could avoid that by using Unicode escapes instead. For example:
replaceAll("[\"\u201c\u201d]", "");
You may try a regex replacement here, e.g.
String input = "“Hello World”";
System.out.println(input.replaceAll("“(.*?)”", "$1")); // prints Hello World
I am trying to split the string "HI. HOW ARE YOU? I AM FINE!" into a string array using split function with the following syntax
String[] i = "HI. HOW ARE YOU? I AM FINE! ".split("[\\. |? |! ]+");
Expected output
HI
HOW ARE YOU
I AM FINE
But in intellij, it's saying "Duplicate character literal" and it's considering space as a separate delimiter.
How do I make sure that it take the full stop plus space, question mark plus space and exclamation plus space without it considering space as a separate delimiter?
Which would be the correct regex for it?
If it can be done without regex, even that is okay.
Thanks.
A better pattern would be
"[?!.]\\s*"
As for the error you are getting, that is because the | operator does not work within a character class. If you want your pattern to work, change the square brackets [...] to parentheses (...).
I'd like to parse some text using an hand-written descending parser. I used Scanner with the following delimiter : "\\s*". Unfortunately, the fact that this pattern matches an empty String seems to make every hasNextFoo and nextFoo matching nothing any more.
The documention doesn't say anything about possibly empty delimiters.
You have some objection to the '+' character?
Are you sure you want to use a regular expression at all, and not just an if statement testing for space characters? You say 'runtime'. Is your data in a string, or coming on a stream, or what?
Yes, because i want to use the scanner as a runtime lexer. In short, I want to be able to ask scanner.next(pattern), that would either return the matched string, or return an exception while not consuming the stream. Spaces should be ignored. If there is a better class to do this than scanner, I would be glad to use it.
I cannot think of any off-the-shelf library class that will do this for you. The normal model of a scanner / lexer is that any invalid character sequence (i.e. one that results in an exception) will be consumed. So, I think you are going to have to implement your own scanner by hand, taking care to treat the read-ahead characters as unconsumed. You could do this with a "pushback" reader or (if that model is not convenient) by explicitly buffering the characters yourself with some kind of mark / reset model. If all you are doing is splitting into tokens separated by one or more spaces, then the pushback reader approach should be fine.
You might also consider StreamTokenizer. Here is an example of using it for one-symbol look-ahead in a recursive-descent parser.
It's possible to use lookbehinds/lookaheads to explicitly define which delimiters are omittable.
For instance this scanner uses whitespaces as a delimiter but doesn't need them between numbers and words:
new Scanner("1A.23 4 BC-5")
.useDelimiter("\\s+|(?<=\\d)(?=[A-Z])|(?<=[A-Z])(?=[-+.\\d])");
It produces:
1
A
.23
4
BC
-5
The regex consists of three alternations:
\s+ consecutive whitespaces are a delimiter.
(?<=\d)(?=[A-Z]) an empty string between a digit and a letter is a delimiter.
(?<=[A-Z])(?=[-+.\d]) an empty string between a letter and '-', '+', '.' or
a digit is a delimiter.
(Note: \w can't be used here as it matches digits.)
I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)
OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"
If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.
This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens
Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"
This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)
I dont want to replace it with /u000A, do NOT want it to look like "asdxyz/u000A"
I want to replace it with the actual newline CHARACTER.
Based on your response to Natso's answer, it seems that you have a fundamental misunderstanding of what's going on. The two-character sequence \n isn't the new-line character. But it's the way we represent that character in code because the actual character is hard to see. The compiler knows that when it encounters that two-character sequence in the context of a string literal, it should interpret them as the real new-line character, which has the ASCII value 10.
If you print that two-character sequence to the console, you won't see them. Instead, you'll see the cursor advance to the next line. That's because the compiler has already replaced those two characters with the new-line character, so it's really the new-line character that got sent to the console.
If you have input to your program that contains backslashes and lowercase N's, and you want to convert them to new-line characters, then Zach's answer might be sufficient. But if you want your program to allow real backslashes in the input, then you'll need some way for the input to indicate that a backslash followed by a lowercase N is really supposed to be those two characters. The usual way to do that is to prefix the backslash with another backslash, escaping it. If you use Zach's code in that situation, you may end up turning the three-character sequence \\n into the two-character sequence consisting of a backslash followed by a new-line character.
The sure-fire way to read strings that use backslash escaping is to parse them one character at a time, starting from the beginning of the input. Copy characters from the input to the output, except when you encounter a backslash. In that case, check what the next character is, too. If it's another backslash, copy a single backslash to the output. If it's a lowercase N, then write a new-line character to the output. If it's any other character, so whatever you define to be the right thing. (Examples include rejecting the whole input as erroneous, pretending the backslash wasn't there, and omitting both the backslash and the following character.)
If you're trying to observe the contents of a variable in the debugger, it's possible that the debugger may detect the new-line character convert it back to the two-character sequence \n. So, if you're stepping through your code trying to figure out what's in it, you may fall victim to the debugger's helpfulness. And in most situations, it really is being helpful. Programmers usually want to know exactly what characters are in a string; they're less concerned about how those characters will appear on the screen.
Just use String.replace():
newtext = text.replace("\\n", "\n");
Use the "(char)(10)" code to generate the true ascii value.
newstr = oldstr.replaceAll("\\n",(char)(10));
// -or-
newstr = oldstr.replaceAll("\\n","" + ((char)(10)));
//(been a while)