How this Regex eliminates html? - java

I saw one code example and didn't understand how this prints only Print statement.
Appreciate your help on this.
String str = "<a href=/utility/ReportResult.jsp?reportId=5>Print</a>";
System.out.println(str.replaceAll("\\<.*?\\>", ""));
OutPut: Print
How to modify my regex expression to print Print<>Report instead of PrintReport. Below is my regex and statement.
String str = "Print<>Report";
System.out.println(str.replaceAll("<.*?>", ""));

In order to print Print<>Report instead of PrintReport, change the * by +:
System.out.println(str.replaceAll("<.+?>", ""));
// here __^
* means 0 or more precedent character
+ means 1 or more precedent character

You don't have to escape the < (angular braces). So in java str.replaceAll("<.*?>", "") will be sufficient.
How it works :
<.*?> --> Search for first < then match everything until the next >. Note that .*? is called lazy selector / matcher.

Its a Regex says anything b/w "<" and ">" must be repalce by ""(blank string)
So
<a href=/utility/ReportResult.jsp?reportId=5>==> ""(blank)
</a>==>""(blank)
and only "Print" left

First, the leading backslashes are treated as an escape sequence for Java, so the actual regular expression is \<.*?\>
The \<' matches the<` character (the backslash again is an escape sequence, which indicates that the following character should be interpreted literally and not as a regex operator). This is the beginning of an html tag.
The . token matches any character.
The ? is a reluctant quantifier that indicates that the preceding token (any character in this case) should be matched zero or more times.
The /> matches the end of a tag. Because the ? is reluctant, the . does not match the character(s) that can be matched by this token.

Related

Sentence split with<sup></sup>

I have the following sentence:
String str = " And God said, <sup>c</sup>“Let there be light,” and there was light.";
How do I retrieve all of the words in the sentence, expecting the following?
And
God
said
Let
there
be
light
and
there
was
light
First, get rid of any leading or trailing space:
.trim()
Then get rid of HTML entities (&...;):
.replaceAll("&.*?;", "")
& and ; are literal chars in Regex, and .*? is the non-greedy version of "any character, any number of times".
Next get rid of tags and their contents:
.replaceAll("<(.*?)>.*?</\\1>", "")
< and > will be taken literally again, .*? is explained above, (...) defined a capturing group, and \\1 references that group.
And finally, split on any sequence of non-letters:
.split("[^a-zA-Z]+")
[a-zA-Z] means all characters from a to z and A to Z, ^ inverts the match, and + means "once or more".
So everything together would be:
String words = str.trim().replaceAll("&.*?;", "").replaceAll("<(.*?)>.*?</\\1>", "").split("[^a-zA-Z]+");
Note that this doesn't handle self-closing tags like <img src="a.png" />.
Also note that if you need full HTML parsing, you should think about letting a real engine parse it, as parsing HTML with Regex is a bad idea.
You can use String.replaceAll(regex, replacement) with the regex [^A-Za-z]+ like this to get only characters. Which will also include the sup tag and the c. Which is why you replace the tags and all between them with the first statement.
String str = " And God said, <sup>c</sup>“Let there be light,” and there was light.".replaceAll("<sup>[^<]</sup>", "");
String newstr = str.replaceAll("[^A-Za-z]+", " ");

How to match ^(d+) in a particular text using regex

For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).
[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.
In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]
If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $

Need Regex Expression Advice

<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
I know this regex expression is used to retrieve the value of src. Can anyone teach me how i should interpret this expression? stucked at it.
Explaining:
<img matches exactly the string "<img"
[^>]+ matches multiple times of everything but >, so the tag will not be closed
src matches exactly the string "src"
\\s* matches any number of whitespace characters
= matches exactly the string "="
\\s* matches any number of whitespace characters
['\"] matches the two quotes. The double quote is escaped, because otherwise it will terminate the string of the regex
([^'\"]+) mathches multiple times everything but quotes. The contents are wrapped in brackets, so that they are declared as group and can be retrieved later
['\"] matches the two quotes. The double quote is escaped, because otherwise it will terminate the string of the regex
[^>]* matches the remaining non ">" characters
> matches exactly the string ">", the closing bracket of the tag.
I would not agree this expression is a crap, just a bit complex.
EDIT Here you go some examplary code:
String str = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";
String text = "<img alt=\"booo\" src=\"image.jpg\"/>";
Pattern pattern = Pattern.compile (str);
Matcher matcher = pattern.matcher (text);
if (matcher.matches ())
{
int n = matcher.groupCount ();
for (int i = 0; i <= n; ++i)
System.out.println (matcher.group (i));
}
The output is:
<img alt="booo" src="image.jpg"/>
image.jpg
So matcher.group(1) returns what you want. experiment a bit with this code.
Hi check one of the tutorials available on the net - e.g. http://www.vogella.com/articles/JavaRegularExpressions/article.html. Section 3.1 and 3.2 common matching symbols explains briefly each symbol and what it replaces as well as metacharacters. Break what you have here into smaller chunks to understand it easier. For example you have \s in two places it is a metacharacter for a whitespace character. Backslash is an escape character in Java thus you have \s instead of \s. After each of them you have a . Section 3.3 explains the quantifiers - this particular one means it occurs 0 or more times. Thus the \s means "search for a whitespace character that occurs 0 or more times". You do the same with other chunks.
Hope it helps.

regex help in java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.
You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+
Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.
Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr
there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

Regular expression to match strings enclosed in square brackets or double quotes

I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")
\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.
Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";
Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""

Categories