Strange String.split("\n") behavior - java

I have a .txt text file, containing some lines..
I load the contain using the RequestBuilder object, and split the responseText with
words = String.split("\n"); but i wonder, why the result is contains the "\n" part..
For example, my text:
abc
def
ghi
the result is,
words[0] = "abc\n"
words[1] = "def\n"
words[2] = "ghi\n"
Any help is highly appreciated. Thanks in advance.

Try using string.split("\\n+"). Or even better - split("[\\r\\n]+")

You may also want to consider String[] lines = text.split("\\\\n");

Windows carriage returns ("\r\n") shouldn't make a visible difference to your results, nor should you need to escape the regular expression you pass to String.split().
Here's proof of both the above using str.split("\n"): http://ideone.com/4PnZi
If you do have Windows carriage returns though, you should (although strictly not necessary) use str.split("\r\n"): http://ideone.com/XcF3C

If split uses regex you should use "\\n" instead of "\n"

Try using string.split("\\\\n") . It works for me.

Maybe it's trivial, but the .split method is sensitive to spaces and text breaks. If we don't know how the original text is written, we have to consider that this could make some differences (single line, breaks, multilines, etc).
Single line text:
const inlineText = "Hello world!";
console.log(inlineText.split(' ')) //['Hello', 'world!']
Multi lines text:
const multilinesText = `
Hello world!
`
console.log(multilinesText.split(' ')) // ['\nHello', 'world!', '\n']

Related

Java Western + Arabic String concatenation issues

I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.

Use RegEx in Java to extract parameters in between parentheses

I'm writing a utility to extract the names of header files from JSPs. I have no problem reading the JSPs line by line and finding the lines I need. I am having a problem extracting the specific text needed using regex. After looking at many similar questions I'm hitting a brick wall.
An example of the String I'll be matching from within is:
<jsp:include page="<%=Pages.getString(\"MY_HEADER\")%>" flush="true"></jsp:include>
All I need is MY_HEADER for this example. Any time I have this tag:
<%=Pages.getString
I need what comes between this:
<%=Pages.getString(\" and this: )%>
Here is what I have currently (which is not working, I might add) :
String currentLine;
while ((currentLine = fileReader.readLine()) != null)
{
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}}
I need to be able to use the Java RegEx API and regex to extract those header names.
Any help on this issue is greatly appreciated. Thanks!
EDIT:
Resolved this issue, thankfully. The tricky part was, after being given the right regex, it had to be taken into account that the String I was feeding to the regex was always going to have two " / " characters ( (/"MY_HEADER"/) ) that needed to be escaped in the pattern.
Here is what worked (thanks to the help ;-)):
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\\"]*)");
This should do the trick:
<%=Pages\\.getString\\(\\\\\"([^\\\\]*)
Yeah that's a scary number of back slashes. matcher.group(1) should return MY_HEADER. It starts at the \" and matches everything until the next \ (which I assume here will be at \")%>.)
Of course, if your target text contains a backslash (\), this will not work. But you didn't give an indication that you'd ever be looking for something like <%=Pages.getString(\"Fun!\Yay!\")%> -- where this regex would only return Fun! and ignore the rest.
EDIT
The reason your test case was failing is because you were using this test string:
String currentLine = "<%=Pages.getString(\"MY_HEADER\")%>";
This is the equivalent of reading it in from a file and seeing:
<%=Pages.getString("MY_HEADER")%>
Note the lack of any \. You need to use this instead:
String sCurrentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Which is the equivalent of what you want.
This is test code that works:
String currentLine = "<%=Pages.getString(\\\"MY_HEADER\\\")%>";
Pattern pattern = Pattern.compile("<%=Pages\\.getString\\(\\\\\"([^\\\\]*)");
Matcher matcher = pattern.matcher(currentLine);
while(matcher.find()) {
System.out.println(matcher.group(1).toString());
}

Cut <br/>-Tags from String end

I am currently developing a Web-Application using Java EE where I'm using a Rich-Javascript-Editor (http://www.primefaces.org/showcase/ui/editor.jsf).
As the user can easily add too many linebreaks that will be convertet to linebreak-tags, I need to remove all these Tags from the end of a String.
Is there an elegant way of using Regex to accomplish this?
An example String would be:
"This is a test <b>bold</b><br/><br/>"
Where obviously the last two tags have to be removed.
Thank you in advance for any help
Best Regards,
Robert
Something like this:
String s = "This is a test <b>bold</b><br/><br/>";
String s2 = s.replaceAll("(\\s*<[Bb][Rr]\\s*/?>)+\\s*$", "");
// s2 = "This is a test <b>bold</b>";
Note that it will also remove trailing whitespace; you can delete the final \\s* if you don't want that.
Here is a simple line of Java code to remove all instances of the substring "<br/>" from the end of a string myString:
myString = myString.replaceAll("(<br/>)+$", "");
This should do the trick for most cases.
This regex considers most of scenarios where different forms for <br> occurs
All of the following are valid <br> statements :
-<br>
-<br/>
-<br />
-<br / >
-<br/ >
-<br //// >
This regex identifies all such forms of <br>
string = string.replaceAll("<br[\\s/]*>", "");
Try this one.Its working for me
myString =myString .replaceAll("<br/>", "");

replace \n and \r\n with <br /> in java

This has been asked several times for several languages but I can't get it to work.
I have a string like this
String str = "This is a string.\nThis is a long string.";
And I'm trying to replace the \n with <br /> using
str = str.replaceAll("(\r\n|\n)", "<br />");
but the \n is not getting replaced.
I tried to use this RegEx Tool to verify and I see the same result. The input string does not have a match for "(\r\n|\n)". What am i doing wrong ?
It works for me.
public class Program
{
public static void main(String[] args) {
String str = "This is a string.\nThis is a long string.";
str = str.replaceAll("(\r\n|\n)", "<br />");
System.out.println(str);
}
}
Result:
This is a string.<br />This is a long string.
Your problem is somewhere else.
For me, this worked:
rawText.replaceAll("(\\\\r\\\\n|\\\\n)", "\\\n");
Tip: use regex tester for quick testing without compiling in your environment
A little more robust version of what you're attempting:
str = str.replaceAll("(\r\n|\n\r|\r|\n)", "<br />");
Since my account is new I can't up-vote Nino van Hooff's answer. If your strings are coming from a Windows based source such as an aspx based server, this solution does work:
rawText.replaceAll("(\\\\r\\\\n|\\\\n)", "<br />");
Seems to be a weird character set issue as the double back-slashes are being interpreted as single slash escape characters. Hence the need for the quadruple slashes above.
Again, under most circumstances "(\\r\\n|\\n)" should work, but if your strings are coming from a Windows based source try the above.
Just an FYI tried everything to correct the issue I was having replacing those line endings. Thought at first was failed conversion from Windows-1252 to UTF-8. But that didn't working either. This solution is what finally did the trick. :)
It works for me. The Java code works exactly as you wrote it. In the tester, the input string should be:
This is a string.
This is a long string.
...with a real linefeed. You can't use:
This is a string.\nThis is a long string.
...because it treats \n as the literal sequence backslash 'n'.
That should work, but don't kill yourself trying to figure it out. Just use 2 passes.
str = str.replaceAll("(\r\n)", "<br />");
str = str.replaceAll("(\n)", "<br />");
Disclaimer: this is not very efficient.
This should work. You need to put in two slashes
str = str.replaceAll("(\\r\\n|\\n)", "<br />");
In this Reference, there is an example which shows
private final String REGEX = "\\d"; // a single digit
I have used two slashes in many of my projects and it seems to work fine!

How to escape text for regular expression in Java?

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Categories