Regex now finding all matching strings - java

I am using this regex ("http:|"https:)\/\/.*\/content\/amc\/tdd\/.*?" to find all the urls which starts with http or https and contains /content/amc/tdd
But for the text
"<a id='cdq_element_175_link' href='http://google.com' data-href='edit' >
<img src=\"http://localhost:8080/content/amc/tdd/abc/download_1.jpeg?
ch_ck=1548843340209\" alt=\"\" id=\"element_175\" style=\"height: 135.575px; width: 135.575px;\" data-href=\"edit\">
<img src=\"http://localhost:8080/content/amc/tdd/abc/download_1.jpeg?ch_ck=1548843340209\" alt=\"\" id=\"element_175\" style=\"height: 135.575px; width: 135.575px;\" data-href=\"edit\">
</a>"
I am not getting two strings which matches the pattern, instead I am getting the complete string starting from first instance to the last.
What am I doing wrong ?

Because inside your regex .* is greedy match, it will eat all your string.
You should change it to .*?
Like this:
("http:|"https:)\/\/.*?\/content\/amc\/tdd\/.*?"

Try this Regex:
"https?:\/\/(?:[^\/]*\/)*?content\/amc\/tdd[^"]*"
Click for Demo
Explanation:
"https?:\/\/ - matches "http:// or "https://
(?:[^\/]*\/)*? - matches 0+ occurrences of any character which is not a / followed by /. This subpattern is repeated 0 or more times, as least as possible.
content\/amc\/tdd - matches content/amc/tdd
[^"]*" - matches 0+ occurrences of any character that is not a " followed by "

Related

Java regex to put a string that doesn't start with # inside <p> </p> tag

Input: #This is a header\nSome line\nAnother line
Desired output:
#This is a header
<p>Some Line</p>
<p>Another line</p>
I've tried this:
s = s.replaceAll("(^|\\n)*(?!#)([^\\n]+)(\\n|$)", "$1<p>$2</p>$3");
But it won't work correctly.
So i'd appreciate your help.
a.replaceAll("(?m)^(#[^\r\n]+)\r?\n([^\r\n]+)\r?\n([^\r\n]+)$", "<p>$1</p><p>$2</p><p>$3</p>")
Regex: (?m)^(#[^\r\n]+)\r?\n([^\r\n]+)\r?\n([^\r\n]+)$
(?m) adds the multine flag. This means The ^ and $ characters behave differently.
^ start of any line
(#[^\n\r]+) Group 1: literal # 1 or more non-newline and non-carriage-return characters
\r?\n optional carriage-return character, newline and non-carriage-return character
([^\n]+) Group 2: 1 or more non-newline and non-carriage-return characters
\r?\n optional carriage-return character, newline and non-carriage-return character
([^\n]+) Group 3: 1 or more non-newline and non-carriage-return characters
$ end of any line
Replacer: <p>$1</p><p>$2</p><p>$3</p>
$n will insert that group number. So $1 will become "#This is a header".

How to match the information with the regex expression inside the html tag if the tag is repeating?

Like if I have the tags
<td class="cit-borderleft cit-data">437</td>
<td class="cit-borderleft cit-data">394</td>
<td class="cit-borderleft cit-data">12</td>
<td class="cit-borderleft cit-data">**12**</td>
But I need to match number 12 in the last tag. I am using the regex expression "<td class=\"cit-borderleft cit-data\">(.*?)</td>" but it will match all four of the tags.
Don't use regex. Use proper XML/HTML parser like jsoup. If you simply want to get text from last element of type td with classes cit-borderleft cit-data you can use
String html =
"<table>" +
"<td class=\"cit-borderleft cit-data\">437</td>\r\n" +
"<td class=\"cit-borderleft cit-data\">394</td>\r\n" +
"<td class=\"cit-borderleft cit-data\">12</td>\r\n" +
"<td class=\"cit-borderleft cit-data\">**12**</td>" +
"</table>";
Document doc = Jsoup.parse(html);
Element last = doc.select("td.cit-borderleft.cit-data").last();
System.out.println(last.text());
Output: **12**
If you then want to remove these * simply call replace("*","") on that string and you will get new one without asterisks.
Try this:
<td class=\"cit-borderleft cit-data\">\*\*(.*?)\*\*<\/td>
or simple way, this:
\*\*(\d+)\*\*
Based on your attempt
<td class=\"cit-borderleft cit-data\">(.*?)<\/td>(?![\s\S]*<\/td>)
Demo
added this part (?![\s\S]*<\/td>)
(?! # Negative Look-Ahead
[\s\S] # Character in [\s\S] Character Class
* # (zero or more)(greedy)
< # "<"
\/ # "/"
td> # "td>"
) # End of Negative Look-Ahead
I don't get why you're using [tag:regex] to parse an HTML tag but here it is
Regex101(?<=<td class=\"cit-borderleft cit-data\">\*\*)\d*(?=\*\*<\/td>)

java HTML regexp issue

I am trying to transform the following string:
<img src="image.jpg" ... />
with this one
<img src="cid:image" ... />
the "image" string needs to be maintained but the string itself could be different. In the html document there are different img tags each one with a different image file.
so for instance if I have:
<img src="mylogo.jpg" ... />
it should transform to:
<img src="cid:mylogo" ... />
The images could be jpg or gif.
Thanks for any help,
Note:
Apart from the fact that Regex is not the right tool to parse HTML, as mentioned in comments, because in Java there are many tools for parsing HTML maybe you can take a look at jsoup for example, I will give you a solution that fits your needs of using Regex.
Solution:
You can use the following Regex:
src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"
This is the code you need:
String html = "<img src=\"folder1/mylogo.jpg\" ... />";
Pattern pattern = Pattern.compile("src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
//This line will give you the wanted output.
System.out.println("src=\"cid:"+matcher.group(1)+"\"");
System.out.println("Final Result: "+html.replaceAll("src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"", "src=\"cid:$1\""));
}
And this is a Working DEMO.
Explanation:
src= matches the characters src= literally.
\" matches the character " literally.
([\\w\\/]+) is a capturing group to match all the wanted text.
\. matches the character . literally.
\w{3,4} match any word character [a-zA-Z0-9_] between 3 and 4 times for extensions, you can use jpg|gif instead if you are not willing to use any other image extensins.
\" matches the character " literally
EDIT:
Desired output:
And to replace this expression with the wanted result just use this regex on the replaceAll() method with your HTML, as follow:
html.replaceAll("src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"", "src=\"cid:$1\"");
We use $1 to point to the first capturing group.

Ignoring the line break in regex?

I have below content in text file
some texting content <img src="cid:part123" alt=""> <b> Test</b>
I read it from file and store it in String i.e inputString
expectedString = inputString.replaceAll("\\<img.*?cid:part123.*?>",
"NewContent");
I get expected output i.e
some texting content NewContent <b> Test</b>
Basically if there is end of line character in between img and src like below, it does not work for example below
<img
src="cid:part123" alt="">
Is there a way regex ignore end of line character in between while matching?
If you want your dot (.) to match newline also, you can use Pattern.DOTALL flag. Alternativey, in case of String.replaceAll(), you can add a (?s) at the start of the pattern, which is equivalent to this flag.
From the Pattern.DOTALL - JavaDoc : -
Dotall mode can also be enabled via the embedded flag expression (?s).
(The s is a mnemonic for "single-line" mode, which is what this is
called in Perl.)
So, you can modify your pattern like this: -
expectedStr = inputString.replaceAll("(?s)<img.*?cid:part123.*?>", "Content");
NOTE: - You don't need to escape your angular bracket(<).
By default, the . character will not match newline characters. You can enable this behavior by specifying the Pattern.DOTALL flag. In String.replaceAll(), you do this by attaching a (?s) to the front of your pattern:
expectedString = inputString.replaceAll("(?s)\\<img.*?cid:part123.*?>",
"NewContent");
See also Pattern.DOTALL with String.replaceAll
You need to use Pattern.DOTALL mode.
replaceAll() doesn't take mode flags as a separate argument, but you can enable them in the expression as follows:
expectedString = inputString.replaceAll("(?s)\\<img.*?cid:part123.*?>", ...);
Note, however, that it's not a good idea to parse HTML with regular expressions. It would be better to use HTML parser instead.

how to match string using regular expression

I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.
I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?
P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.
To explain my problem more : i am doing the following
Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str);
result array contains only 1 item and it is the whole string
and the following is a portion of the file that i am reading :
<BODY>
<SYNC Start=200>
<P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
</SYNC>
<SYNC Start=2440>
<P Class=ENCC> </P>
</SYNC>
<SYNC Start=2560>
<P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
</SYNC>
<SYNC Start=4560>
<P Class=ENCC> </P>
</SYNC>
<SYNC Start=66160>
<P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
</SYNC>
UPDATE ::::
hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.
Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.
That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).
That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.
EDIT:
Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:
You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.
Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.
Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.
The .* may match <. You can try :
<p class=a>([^<]*)</p>
I guess the problem is that your pattern is greedy. You should use this instead.
"<p class=a>(.*?)</p>"
If you have this string:
"<p class=a>fist</p><p class=a>second</p>"
Your pattern ("<p class=a>(.*)</p>") will match this
"<p class=a>fist</p><p class=a>second</p>"
While "<p class=a>(.*?)</p>" only matches
"<p class=a>fist</p>"

Categories