Sentence split with<sup></sup>

Sentence split with<sup></sup> - java

I have the following sentence:
String str = " And God said, <sup>c</sup>“Let there be light,” and there was light.";
How do I retrieve all of the words in the sentence, expecting the following?
And
God
said
Let
there
be
light
and
there
was
light

First, get rid of any leading or trailing space:
.trim()
Then get rid of HTML entities (&...;):
.replaceAll("&.*?;", "")
& and ; are literal chars in Regex, and .*? is the non-greedy version of "any character, any number of times".
Next get rid of tags and their contents:
.replaceAll("<(.*?)>.*?</\\1>", "")
< and > will be taken literally again, .*? is explained above, (...) defined a capturing group, and \\1 references that group.
And finally, split on any sequence of non-letters:
.split("[^a-zA-Z]+")
[a-zA-Z] means all characters from a to z and A to Z, ^ inverts the match, and + means "once or more".
So everything together would be:
String words = str.trim().replaceAll("&.*?;", "").replaceAll("<(.*?)>.*?</\\1>", "").split("[^a-zA-Z]+");
Note that this doesn't handle self-closing tags like <img src="a.png" />.
Also note that if you need full HTML parsing, you should think about letting a real engine parse it, as parsing HTML with Regex is a bad idea.

You can use String.replaceAll(regex, replacement) with the regex [^A-Za-z]+ like this to get only characters. Which will also include the sup tag and the c. Which is why you replace the tags and all between them with the first statement.
String str = " And God said, <sup>c</sup>“Let there be light,” and there was light.".replaceAll("<sup>[^<]</sup>", "");
String newstr = str.replaceAll("[^A-Za-z]+", " ");

Related

How this Regex eliminates html?

I saw one code example and didn't understand how this prints only Print statement.
Appreciate your help on this.
String str = "<a href=/utility/ReportResult.jsp?reportId=5>Print</a>";
System.out.println(str.replaceAll("\\<.*?\\>", ""));
OutPut: Print
How to modify my regex expression to print Print<>Report instead of PrintReport. Below is my regex and statement.
String str = "Print<>Report";
System.out.println(str.replaceAll("<.*?>", ""));

In order to print Print<>Report instead of PrintReport, change the * by +:
System.out.println(str.replaceAll("<.+?>", ""));
// here __^
* means 0 or more precedent character
+ means 1 or more precedent character

You don't have to escape the < (angular braces). So in java str.replaceAll("<.*?>", "") will be sufficient.
How it works :
<.*?> --> Search for first < then match everything until the next >. Note that .*? is called lazy selector / matcher.

Its a Regex says anything b/w "<" and ">" must be repalce by ""(blank string)
So
<a href=/utility/ReportResult.jsp?reportId=5>==> ""(blank)
</a>==>""(blank)
and only "Print" left

First, the leading backslashes are treated as an escape sequence for Java, so the actual regular expression is \<.*?\>
The \<' matches the<` character (the backslash again is an escape sequence, which indicates that the following character should be interpreted literally and not as a regex operator). This is the beginning of an html tag.
The . token matches any character.
The ? is a reluctant quantifier that indicates that the preceding token (any character in this case) should be matched zero or more times.
The /> matches the end of a tag. Because the ? is reluctant, the . does not match the character(s) that can be matched by this token.

Regex required to update a character

I have a String : testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing
I want to replace the character s with some other character sequence suppose : <b>X</b> but i want the character sequence s to remain intact i.e. regex should not update the character s with a previous character as "<".
I used the JAVA code :
String str = testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;
str = str.replace("s[^<]", "<b>X</b>");
The problem is that the regex would match 2 characters, s and following character if it is not ">" and Sting.replace would replace both the characters. I want only s to be replaced and not the following character.
Any help would be appreciated. Since i have lots of such replacements i don't want to use a loop matching each character and updating it sequentially.

There are other ways, but you could, for example, capture the second character and put it back:
str = str.replaceAll("s([^<])", "<b>X\\1</b>");

Looks like you want a negative lookahead:
s(?!<)
String str = "testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;";
System.out.println(str.replaceAll("s(?!<)", "<b>X</b>"));
output:
te<b>X</b>ting<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;

Use look arounds to assert, but not capture, surrounding text:
str = str.replaceAll("s(?![^<]))", "whatever");
Or, capture and put back using a back reference $1:
str = str.replaceAll("s([^<])", "whatever$1");
Note that you need to use replaceAll() (which use regex), rather than replace() (which uses plain text).

Cutting String java

I need to cut certain strings for an algorithm I am making. I am using substring() but it gets too complicated with it and actually doesn't work correctly. I found this topic how to cut string with two regular expression "_" and "."
and decided to try with split() but it always gives me
java.util.regex.PatternSyntaxException: Dangling meta character '+' near index 0
+
^
So this is the code I have:
String[] result = "234*(4-5)+56".split("+");
/*for(int i=0; i<result.length; i++)
{
System.out.println(result[i]);
}*/
Arrays.toString(result);
Any ideas why I get this irritating exception ?
P.S. If I fix this I will post you the algorithm for cutting and then the algorithm for the whole calculator (because I am building a calculator). It is gonna be a really badass calculator, I promise :P

+ in regex has a special meaning. to be treated as a normal character, you should escape it with backslash.
String[] result = "234*(4-5)+56".split("\\+");
Below are the metacharaters in regex. to treat any of them as normal characters you should escape them with backslash
<([{\^-=$!|]})?*+.>
refer here about how characters work in regex.

The plus + symbol has meaning in regular expression, which is how split parses it's parameter. You'll need to regex-escape the plus character.
.split("\\+");

You should split your string like this: -
String[] result = "234*(4-5)+56".split("[+]");
Since, String.split takes a regex as delimiter, and + is a meta-character in regex, which means match 1 or more repetition, so it's an error to use it bare in regex.
You can use it in character class to match + literal. Because in character class, meta-characters and all other characters loose their special meaning. Only hiephen(-) has a special meaning in it, which means a range.

+ is a regex quantifier (meaning one or more of) so needs to be escaped in the split method:
String[] result = "234*(4-5)+56".split("\\+");

Need Regex Expression Advice

<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
I know this regex expression is used to retrieve the value of src. Can anyone teach me how i should interpret this expression? stucked at it.

Explaining:
<img matches exactly the string "<img"
[^>]+ matches multiple times of everything but >, so the tag will not be closed
src matches exactly the string "src"
\\s* matches any number of whitespace characters
= matches exactly the string "="
\\s* matches any number of whitespace characters
['\"] matches the two quotes. The double quote is escaped, because otherwise it will terminate the string of the regex
([^'\"]+) mathches multiple times everything but quotes. The contents are wrapped in brackets, so that they are declared as group and can be retrieved later
['\"] matches the two quotes. The double quote is escaped, because otherwise it will terminate the string of the regex
[^>]* matches the remaining non ">" characters
> matches exactly the string ">", the closing bracket of the tag.
I would not agree this expression is a crap, just a bit complex.
EDIT Here you go some examplary code:
String str = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";
String text = "<img alt=\"booo\" src=\"image.jpg\"/>";
Pattern pattern = Pattern.compile (str);
Matcher matcher = pattern.matcher (text);
if (matcher.matches ())
{
int n = matcher.groupCount ();
for (int i = 0; i <= n; ++i)
System.out.println (matcher.group (i));
}
The output is:
<img alt="booo" src="image.jpg"/>
image.jpg
So matcher.group(1) returns what you want. experiment a bit with this code.

Hi check one of the tutorials available on the net - e.g. http://www.vogella.com/articles/JavaRegularExpressions/article.html. Section 3.1 and 3.2 common matching symbols explains briefly each symbol and what it replaces as well as metacharacters. Break what you have here into smaller chunks to understand it easier. For example you have \s in two places it is a metacharacter for a whitespace character. Backslash is an escape character in Java thus you have \s instead of \s. After each of them you have a . Section 3.3 explains the quantifiers - this particular one means it occurs 0 or more times. Thus the \s means "search for a whitespace character that occurs 0 or more times". You do the same with other chunks.
Hope it helps.

java replaceAll and '+' match

I have some code setup to remove extra spaces in between the words of a title
String formattedString = unformattedString.replaceAll(" +"," ");
My understanding of this type of regex is that it will match as many spaces as possible before stopping. However, my strings that are coming out are not changing in any way. Is it possible that it's only matching one space at a time, and then replacing that with a space? Is there something to the replaceAll method, since it's doing multiple matches, that would alter the way this type of match would work here?

A better approach might be to use "\\s+" to match runs of all possible whitespace characters.
EDIT
Another approach might be to extract all matches for "\\b([A-Za-z0-9]+)\\b" and then join them using a space which would allow you to remove everything except for valid words and numbers.
If you need to preserve punctuation, use "(\\S+)" which will capture all runs of non-whitespace characters.

Are you sure you string is spaces and not tabs? The following is a bit more "aggressive" on whitespace.
String formattedString = unformattedString.replaceAll("\\s+"," ");

all responses should work.
Both:
String formattedString = unformattedString.replaceAll(" +"," ");
or
String formattedString = unformattedString.replaceAll("\\s+"," ");
Maybe your unformattedString is a multiline expression. In that case you can instantiate an Pattern object
String unformattedString = " Hello \n\r\n\r\n\r World";
Pattern manySpacesPattern = Pattern.compile("\\s+",Pattern.MULTILINE);
Matcher formatMatcher = manySpacesPattern.matcher(unformattedString);
String formattedString = formatMatcher.replaceAll(" ");
System.out.println(unformattedString.replaceAll("\\s+", " "));
Or maybe unformattedString have special characters in that case you can play with Pattern flags en compile method.
Examples:
Pattern.compile("\\s+",Pattern.MULTILINE|Pattern.UNIX_LINES);
or
Pattern.compile("\\s+",Pattern.MULTILINE|Pattern.UNICODE_CASE);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Sentence split with<sup></sup> - java

I have the following sentence: String str = " And God said, <sup>c</sup>“Let there be light,” and there was light."; How do I retrieve all of the words in the sentence, expecting the following? And God said Let there be light and there was light

Related

How this Regex eliminates html?

Regex required to update a character

Cutting String java

Need Regex Expression Advice

java replaceAll and '+' match

Categories

Resources