How to remove hard spaces with Jsoup? - java

I'm trying to remove hard spaces (from entities in the HTML). I can't remove it with .trim() or .replace(" ", ""), etc! I don't get it.
I even found on Stackoverflow to try with \\u00a0 but didn't work neither.
I tried this (since text() returns actual hard space characters, U+00A0):
System.out.println( "'"+fields.get(6).text().replace("\\u00a0", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().replace(" ", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().trim()+"'"); //'94,00 '
System.out.println( "'"+fields.get(6).html().replace(" ", "")+"'"); //'94,00' works
But I can't figure out why I can't remove the white space with .text().

Your first attempt was very nearly it, you're quite right that Jsoup maps to U+00A0. You just don't want the double backslash in your string:
System.out.println( "'"+fields.get(6).text().replace("\u00a0", "")+"'" ); //'94,00'
// Just one ------------------------------------------^
replace doesn't use regular expressions, so you aren't trying to pass a literal backslash through to the regex level. You just want to specify character U+00A0 in the string.

The question has been edited to reflect the true problem.
New answer;
The hardspace, ie. entity (Unicode character NO-BREAK SPACE U+00A0 ) can in Java be represented by the character \u00a0, thus code becomes, where str is the string gotten from the text() method
str.replaceAll ("\u00a0", "");
Old answer;
Using the JSoup library,
import org.jsoup.parser.Parser;
String str1 = Parser.unescapeEntities("last week, Ovokerie Ogbeta", false);
String str2 = Parser.unescapeEntities("Entered » Here", false);
System.out.println(str1 + " " + str2);
Prints out:
last week, Ovokerie Ogbeta Entered » Here

Related

What does regex "\\p{Z}" mean?

I am working with some code in java that has an statement like
String tempAttribute = ((String) attributes.get(i)).replaceAll("\\p{Z}","")
I am not used to regex, so what is the meaning of it? (If you could provide a website to learn the basics of regex that would be wonderful) I've seen that for a string like
ept as y it gets transformed into eptasy, but this doesn't seem right. I believe the guy who wrote this wanted to trim leading and trailing spaces maybe.
It removes all the whitespace (replaces all whitespace matches with empty strings).
A wonderful regex tutorial is available at regular-expressions.info.
A citation from this site:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
The OP stated that the code fragment was in Java. To comment on the statement:
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
the sample code below shows that this does not apply in Java.
public static void main(String[] args) {
// some normal white space characters
String str = "word1 \t \n \f \r " + '\u000B' + " word2";
// various regex patterns meant to remove ALL white spaces
String s = str.replaceAll("\\s", "");
String p = str.replaceAll("\\p{Space}", "");
String b = str.replaceAll("\\p{Blank}", "");
String z = str.replaceAll("\\p{Z}", "");
// \\s removed all white spaces
System.out.println("s [" + s + "]\n");
// \\p{Space} removed all white spaces
System.out.println("p [" + p + "]\n");
// \\p{Blank} removed only \t and spaces not \n\f\r
System.out.println("b [" + b + "]\n");
// \\p{Z} removed only spaces not \t\n\f\r
System.out.println("z [" + z + "]\n");
// NOTE: \p{Separator} throws a PatternSyntaxException
try {
String t = str.replaceAll("\\p{Separator}","");
System.out.println("t [" + t + "]\n"); // N/A
} catch ( Exception e ) {
System.out.println("throws " + e.getClass().getName() +
" with message\n" + e.getMessage());
}
} // public static void main
The output for this is:
s [word1word2]
p [word1word2]
b [word1
word2]
z [word1
word2]
throws java.util.regex.PatternSyntaxException with message
Unknown character property name {Separator} near index 12
\p{Separator}
^
This shows that in Java \\p{Z} removes only spaces and not "any kind of whitespace or invisible separator".
These results also show that in Java \\p{Separator} throws a PatternSyntaxException.
First of all, \p means you are going to match a class, a collection of character, not single one. For reference, this is Javadoc of Pattern class. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Unicode scripts, blocks, categories and binary properties are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property.
And then Z is the name of a class (collection,set) of characters. In this case, it's abbreviation of Separator . Separator containts 3 sub classes: Space_Separator(Zs), Line_Separator(Zl) and Paragraph_Separator(Zp).
Refer here for which characters those classes contains here: Unicode Character Database or
Unicode Character Categories
More document: http://www.unicode.org/reports/tr18/#General_Category_Property

REGEX to format phone number in java

given a phone number with spaces and + allowed, how would you right a regular expression to format it so that non-digits and extra spaces are removed?
I have this so far
String num = " Ken's Phone is + 123 2213 123 (night time)";
System.out.println(num.replaceAll("[^\\d|+|\\s]", "").replaceAll("\\s\\s+", " ").replaceAll("\\+ ", "\\+").trim());
Would you simplify it so that the same result is obtained?
Thank you
I would put trim() first, or at least before you replace every multiple spaces.
Also keep in mind that \s means whitespaces: [ \t\n\x0B\f\r], if you only mean ' ' then use it.
A nicer way to express that you only want at least two spaces to be replaced would be
replaceAll("\\s{2,}", " ")
First extract the number-with-spaces part, then compress multiple spaces to single spaces. then finally remove all spaces that follow a plus sign:
String numberWithSpaces = str.replaceAll("^[^\\d+]*([+\\d\\s]+)[^\\d]*$", "$1").replaceAll("\\s+", " ").replaceAll("\\+\\s*", "+");
I tested this code and it works.
You can simplify it as:
num.replaceAll("[^\\d+\\s]", "") // [^\\d|+|\\s] => [^\\d+\\s]
.replaceAll("\\s{2,}", " ") // \\s\\s+ => \\s{2,}
.replaceAll("\\+\\s", "+") // \\+ => +
.trim()

removing white spaces from string value

i have a link http://localhost:8080/reporting/pvsUsageAction.do?form_action=inline_audit_view&days=7&projectStatus=scheduled&justificationId=5&justificationName= No Technicians in Area in my struts based web application.
The variable in URL justificationName have some spaces before its vales as shown. when i get value of justificationName using request.getParameter("justificationName") it gives me that value with spaces as given in the URL. i want to remove those spaces. i tried trim() i tries str = str.replace(" ", ""); but any of them did not removed those spaces. can any one tell some other way to remove the space.
Noted one more thing that i did right click on the link and opened the link into new tab there i noticed that link looks like.
http://localhost:8080/reporting/pvsUsageAction.do?form_action=inline_audit_view&days=7&projectStatus=scheduled&justificationId=5&justificationName=%A0%A0%A0%A0%A0%A0%A0%A0No%20Technicians%20in%20Area
Notable point is that in the address bar it shows %A0 for white spaces and also show %20 for space as well see the link and tell the difference please if any one have idea about it.
EDIT
Here is my code
String justificationCode = "";
if (request.getParameter("justificationName") != null) {
justificationCode = request.getParameter("justificationName");
}
justificationCode = justificationCode.replace(" ", "");
Note: replace function remove the space from inside the string but not removing starting spaces.
e-g if my string is " This is string" after using replace it becomes " Thisisstring"
Thanks in advance
Strings are immutable in Java, so the method doesn't change the string you pass but returns a new one. You must use the returned value :
str = str.replace(" ", "");
Manual trim
You need to remove the spaces the string. This will remove any number of consecutive spaces.
String trimmed = str.replaceAll(" +", "");
If you want to replace all whitespace characters:
String trimmed = str.replaceAll("\\s+", "");
URL Encoding
You could also use an URLEncoder, which sounds like a more appropriate way to go:
import java.net.UrlEncoder;
String url = "http://localhost:8080/reporting/" + URLEncoder.encode("pvsUsageAction.do?form_action=inline_audit_view&days=7&projectStatus=scheduled&justificationId=5&justificationName= No Technicians in Area", "ISO-8859-1");
You have to assign the result of the replace(String regex, String replacement) operation to another variable. See the Javadoc for the replace(String regex, String replacement) method. It returns a brand new String object and this is because the String(s) in Java are immutable. In your case, you can simply do the following
String noSpacesString = str.replace("\\s+", "");
You can use replaceAll("\\s","") It will remove all white space.
If you are trying to remove the trailing and ending white spaces, then
s = s.trim();
Or if you want to remove all the spaces the use :
s = s.replace(" ","");
There are two ways of doing one is regular expression based or your own way of implementing the logic
replaceAll("\\s","")
or
if (text.contains(" ") || text.contains("\t") || text.contains("\r")
|| text.contains("\n"))
{
//code goes here
}

Java String Parsing Without Regular Expressions

From a server, I get strings of the following form:
String x = "fixedWord1:var1 data[[fixedWord2:var2 fixedWord3:var3 data[[fixedWord4] [fixedWord5=var5 fixedWord6=var6 fixedWord7=var7]]] , [fixedWord2:var2 fixedWord3:var3 data[[fixedWord4][fixedWord5=var5 fixedWord6=var6 fixedWord7=var7]]]] fixedWord8:fixedWord8";
(only spaces divide groups of word-var pairs)
Later, I want to store them in a Hashmap, like myHashMap.put(fixedWord1, var1); and so on.
Problem:
Inside the first "data[......]"-tag, the number of other "data[..........]"-tags is variable, and I don't know the length of the string in advance.
I don't know how to process such Strings without resorting to String.split(), which is discouraged by our assignment task givers (university).
I have searched the internet and couldn't find appropriate websites explaining such things.
It would be of great help, if experienced people could give me some links to websites or something like a "diagrammatic plan" so that I could code something.
EDIT:
got mistake in String (off-topic-begin "please don't lynch" off-topic-end), the right string is (changed fixedWord7=var7 ---to---> fixedWord7=[var7]):
String x = "fixedWord1:var1 data[[fixedWord2:var2 fixedWord3:var3 data[[fixedWord4] [fixedWord5=var5 fixedWord6=var6 fixedWord7=[var7]]]] , [fixedWord2:var2 fixedWord3:var3 data[[fixedWord4][fixedWord5=var5 fixedWord6=var6 fixedWord7=[var7]]]]] fixedWord8:fixedWord8";
I assume your string follows a same pattern, which has "data" and "[", "]" in it. And the variable name/value will not include these strings
remove string "data[", "[", "]", and "," from the original string
replaceAll("data[", "")
replaceAll("[", "")
etc
separate the string by space: " " by using StringTokenizer or loop through the String char by char.
then you will get array of strings like
fixedWorld1:var1
fixedWorld2:var2
......
fixedWorld4
fixedWorld5=var5
......
then again separate the sub strings by ":" or "=". and put the name/value into the Map
Problem is not absolutely clear but may be something like this will work for you:
Pattern p = Pattern.compile("\\b(\\w+)[:=]\\[?(\\w+)");
Matcher m = p.matcher( x );
while( m.find() ) {
System.out.println( "matched: " + m.group(1) + " - " + m.group(2) );
hashMap.put ( m.group(1), m.group(2) );
}

How to remove newlines from beginning and end of a string?

I have a string that contains some text followed by a blank line. What's the best way to keep the part with text, but remove the whitespace newline from the end?
Use String.trim() method to get rid of whitespaces (spaces, new lines etc.) from the beginning and end of the string.
String trimmedString = myString.trim();
String.replaceAll("[\n\r]", "");
This Java code does exactly what is asked in the title of the question, that is "remove newlines from beginning and end of a string-java":
String.replaceAll("^[\n\r]", "").replaceAll("[\n\r]$", "")
Remove newlines only from the end of the line:
String.replaceAll("[\n\r]$", "")
Remove newlines only from the beginning of the line:
String.replaceAll("^[\n\r]", "")
tl;dr
String cleanString = dirtyString.strip() ; // Call new `String::string` method.
String::strip…
The old String::trim method has a strange definition of whitespace.
As discussed here, Java 11 adds new strip… methods to the String class. These use a more Unicode-savvy definition of whitespace. See the rules of this definition in the class JavaDoc for Character::isWhitespace.
Example code.
String input = " some Thing ";
System.out.println("before->>"+input+"<<-");
input = input.strip();
System.out.println("after->>"+input+"<<-");
Or you can strip just the leading or just the trailing whitespace.
You do not mention exactly what code point(s) make up your newlines. I imagine your newline is likely included in this list of code points targeted by strip:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
It is '\t', U+0009 HORIZONTAL TABULATION.
It is '\n', U+000A LINE FEED.
It is '\u000B', U+000B VERTICAL TABULATION.
It is '\f', U+000C FORM FEED.
It is '\r', U+000D CARRIAGE RETURN.
It is '\u001C', U+001C FILE SEPARATOR.
It is '\u001D', U+001D GROUP SEPARATOR.
It is '\u001E', U+001E RECORD SEPARATOR.
It is '\u001F', U+0
If your string is potentially null, consider using StringUtils.trim() - the null-safe version of String.trim().
If you only want to remove line breaks (not spaces, tabs) at the beginning and end of a String (not inbetween), then you can use this approach:
Use a regular expressions to remove carriage returns (\\r) and line feeds (\\n) from the beginning (^) and ending ($) of a string:
s = s.replaceAll("(^[\\r\\n]+|[\\r\\n]+$)", "")
Complete Example:
public class RemoveLineBreaks {
public static void main(String[] args) {
var s = "\nHello world\nHello everyone\n";
System.out.println("before: >"+s+"<");
s = s.replaceAll("(^[\\r\\n]+|[\\r\\n]+$)", "");
System.out.println("after: >"+s+"<");
}
}
It outputs:
before: >
Hello world
Hello everyone
<
after: >Hello world
Hello everyone<
I'm going to add an answer to this as well because, while I had the same question, the provided answer did not suffice. Given some thought, I realized that this can be done very easily with a regular expression.
To remove newlines from the beginning:
// Trim left
String[] a = "\n\nfrom the beginning\n\n".split("^\\n+", 2);
System.out.println("-" + (a.length > 1 ? a[1] : a[0]) + "-");
and end of a string:
// Trim right
String z = "\n\nfrom the end\n\n";
System.out.println("-" + z.split("\\n+$", 2)[0] + "-");
I'm certain that this is not the most performance efficient way of trimming a string. But it does appear to be the cleanest and simplest way to inline such an operation.
Note that the same method can be done to trim any variation and combination of characters from either end as it's a simple regex.
Try this
function replaceNewLine(str) {
return str.replace(/[\n\r]/g, "");
}
String trimStartEnd = "\n TestString1 linebreak1\nlinebreak2\nlinebreak3\n TestString2 \n";
System.out.println("Original String : [" + trimStartEnd + "]");
System.out.println("-----------------------------");
System.out.println("Result String : [" + trimStartEnd.replaceAll("^(\\r\\n|[\\n\\x0B\\x0C\\r\\u0085\\u2028\\u2029])|(\\r\\n|[\\n\\x0B\\x0C\\r\\u0085\\u2028\\u2029])$", "") + "]");
Start of a string = ^ ,
End of a string = $ ,
regex combination = | ,
Linebreak = \r\n|[\n\x0B\x0C\r\u0085\u2028\u2029]
Another elegant solution.
String myString = "\nLogbasex\n";
myString = org.apache.commons.lang3.StringUtils.strip(myString, "\n");
For anyone else looking for answer to the question when dealing with different linebreaks:
string.replaceAll("(\n|\r|\r\n)$", ""); // Java 7
string.replaceAll("\\R$", ""); // Java 8
This should remove exactly the last line break and preserve all other whitespace from string and work with Unix (\n), Windows (\r\n) and old Mac (\r) line breaks: https://stackoverflow.com/a/20056634, https://stackoverflow.com/a/49791415. "\\R" is matcher introduced in Java 8 in Pattern class: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
This passes these tests:
// Windows:
value = "\r\n test \r\n value \r\n";
assertEquals("\r\n test \r\n value ", value.replaceAll("\\R$", ""));
// Unix:
value = "\n test \n value \n";
assertEquals("\n test \n value ", value.replaceAll("\\R$", ""));
// Old Mac:
value = "\r test \r value \r";
assertEquals("\r test \r value ", value.replaceAll("\\R$", ""));
String text = readFileAsString("textfile.txt");
text = text.replace("\n", "").replace("\r", "");

Categories