I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.
I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?
P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.
To explain my problem more : i am doing the following
Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str);
result array contains only 1 item and it is the whole string
and the following is a portion of the file that i am reading :
<BODY>
<SYNC Start=200>
<P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
</SYNC>
<SYNC Start=2440>
<P Class=ENCC> </P>
</SYNC>
<SYNC Start=2560>
<P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
</SYNC>
<SYNC Start=4560>
<P Class=ENCC> </P>
</SYNC>
<SYNC Start=66160>
<P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
</SYNC>
UPDATE ::::
hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.
Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.
That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).
That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.
EDIT:
Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:
You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.
Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.
Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.
The .* may match <. You can try :
<p class=a>([^<]*)</p>
I guess the problem is that your pattern is greedy. You should use this instead.
"<p class=a>(.*?)</p>"
If you have this string:
"<p class=a>fist</p><p class=a>second</p>"
Your pattern ("<p class=a>(.*)</p>") will match this
"<p class=a>fist</p><p class=a>second</p>"
While "<p class=a>(.*?)</p>" only matches
"<p class=a>fist</p>"
Related
I am trying to convert markdown hyperlinks into html hyperlinks in the Apache Velocity template language (for Marketo). I am nearly there by splitting on ']' and then removing the first character from the remaining '[link text' piece, and the first and last characters from the remaining '(url)' piece.
It will let me remove the first character in each, but doesn't like my code for removing the last character. This is simple code so I don't know why it isn't working.
#set( $refArr = $reference.split(']',2) )
<li>
<a href=$refArr[1].substring(1,$refArr[1].length()-1)>$refArr[0].substring(1)</a>
</li>
It just doesn't like the '-1' part, see error. Velocity is supposed to have full Java method access, but it appears that it may be confusing Java for html.
Cannot get email content- <div>An error occurred when procesing the email Body! </div> <p>Encountered "-1" near</p>
I've also tried using regex with the replace method as well but that doesn't work either, whether with the '(' character escaped, double escaped, or not escaped.
Apparently you have to use the MathTool class for Velocity:
#set( $refArr = $reference.split(']',2) )
<li>
<a href=$refArr.get(1).substring(1,$math.sub($refArr.get(1).length(),1))>$refArr.get(0).substring(1)</a>
</li>
Your code should work in recent versions, otherwise you can do it in two steps:
#set($len = $refArr.get(1).length() - 1)
<a href=$refArr[1].substring(1,$len)>$refArr[0].substring(1)</a>
I have a source code file which I am trying to read using a automatic Regex processor Class in java.
Although I am unable to form a correct regex pattern to get the values if it appears multiple times in the line.
The input text is:
<input name="id" type="radio" bgcolor="<bean:write name='color'/>" value="<bean:write name='nameProp' property='nameVal'/>" <logic:equal name="checkedStatus" value="0">checked</logic:equal>>
And I want the matcher.find to output following terms:
<bean:write name='color'/>
<bean:write name='nameProp' property='nameVal'/>
Kindly help to form the regex pattern for this scenario.
Use this regex to find those terms:
<bean:write[^\/]*\/>
It will search for the words <bean:write and then everything up until a />
Use it like this:
List<String> matches = new ArrayList<>();
Matcher m = Pattern.compile("<bean:write[^\\/]*\\/>")
.matcher(inputText);
while (m.find()) {
matches.add(m.group());
}
Regex101 Tested
I caution you though with parsing HTML with regex. If you need anything more complicated than this, you should probably consider using an XML parser instead. See this famous answer.
I am trying to transform the following string:
<img src="image.jpg" ... />
with this one
<img src="cid:image" ... />
the "image" string needs to be maintained but the string itself could be different. In the html document there are different img tags each one with a different image file.
so for instance if I have:
<img src="mylogo.jpg" ... />
it should transform to:
<img src="cid:mylogo" ... />
The images could be jpg or gif.
Thanks for any help,
Note:
Apart from the fact that Regex is not the right tool to parse HTML, as mentioned in comments, because in Java there are many tools for parsing HTML maybe you can take a look at jsoup for example, I will give you a solution that fits your needs of using Regex.
Solution:
You can use the following Regex:
src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"
This is the code you need:
String html = "<img src=\"folder1/mylogo.jpg\" ... />";
Pattern pattern = Pattern.compile("src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
//This line will give you the wanted output.
System.out.println("src=\"cid:"+matcher.group(1)+"\"");
System.out.println("Final Result: "+html.replaceAll("src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"", "src=\"cid:$1\""));
}
And this is a Working DEMO.
Explanation:
src= matches the characters src= literally.
\" matches the character " literally.
([\\w\\/]+) is a capturing group to match all the wanted text.
\. matches the character . literally.
\w{3,4} match any word character [a-zA-Z0-9_] between 3 and 4 times for extensions, you can use jpg|gif instead if you are not willing to use any other image extensins.
\" matches the character " literally
EDIT:
Desired output:
And to replace this expression with the wanted result just use this regex on the replaceAll() method with your HTML, as follow:
html.replaceAll("src=\"([\\:\\w\\s\\/]+)\\.\\w{3}\"", "src=\"cid:$1\"");
We use $1 to point to the first capturing group.
I have a web application running Java Tapestry, with a lot of user-inputted content. The only formatting that users may input is linebreaks.
I call a text string from a database, and output it into a template. The string contains line breaks as /r, which I replace with < br >. However, these are filtered on output, so the text looks like b<br>text text b<br> text. I think I can use outputRaw or writeRaw to fix this, but I can't find any info for how to add outputRaw or writeRaw to a Tapestry class or template.
The class is:
public String getText() {
KMedium textmedium = getTextmedium();
return (textmedium == null || textmedium.getTextcontent() == null) ? "" : textmedium.getTextcontent().replaceAll("\r", "<br>");
}
The tml is:
<p class="categorytext" id="${currentCategory.id}">
${getText()}
</p>
Where would I add the raw output handling to have my line breaks display properly?
To answer my own question, this is how to output the results of $getText() as raw html:
Change the tml from this:
<p class="categorytext" id="${currentCategory.id}">
${getText()}
</p>
To this:
<p class="categorytext" id="${currentCategory.id}">
<t:outputraw value="${getText()}"/>
</p>
Note that this is quite dangerous as you are likely opening your site to an XSS attack. You may need to use jsoup or similar to sanitize the input.
An alternative might be:
<p class="categorytext" id="${currentCategory.id}">
<t:loop source="textLines" value="singleLine">
${singleLine} <br/>
</t:loop>
</p>
This assumes a a getTextLines() method that returns a List or array of Strings; it could use the same logic as your getText() but split the result on CRs. This would do a better job when the text lines contain unsafe characters such as & or <. With a little more work, you could add the <br> only between lines (not after each line) ... and this feels like it might be a nice component as well.
I have a large number (>1500) of JSP files that I am trying to convert to JSPX. I am using a tool that will parse well-formed JSPs and convert to JSPX, however, my JSPs are not all well-formed :)
My solution is to pre-process the JSPs and convert untidy code so the tool will parse them correctly. The main problem I am trying to resolve is that of unquoted attribute values. Examples:
<INPUT id="foo" size=1>
<input id=body size="2">
My current regex for finding these is (in Java string format):
"(\\w+)=([^\"' >]+)"
And my replacement string is (in Java string format):
"$1=\"$2\""
This works well, EXCEPT for a few patterns, both of which involve inline scriptlets. For example:
<INPUT id=foo value="<%= someBean.method("a=b") %>">
In this case, my pattern matches the string literal "a=b", which I don't want to do. What I'd like to have happen is that the regex would IGNORE anything between <% and %>. Is there a regular expression that will do what I am trying to do?
EDIT:
Changed to title to clarify that I am NOT trying to parse HTML / JSP with regexes... I am doing a simple syntactic transformation to prepare the input for parsing.
If a sentence contains an arbitrary number of matching tokens such as double quotes, then this sentence belongs to a context-free language, which simply cannot be parsed with Regex designed to handle regular languages.
Either there could be some simplification assumptions (e.g. there are no unmatched double quotes and there is only a certain number of those etc.) that would permit the use of Regex, or your need to think about using (creating) a lexer/parser for a case of context-free language. ANTLR is a good tool for this.
Based on the assumption that there are NO unquoted attribute values inside the scriptlets, the following construct might work for you:
Note: this approach is fragile. Just for your reference.
import java.util.regex.*;
public class test{
public static void main(String args[]){
String s = "<INPUT id=foo abbr='ip ' name = bar color =\"blue\" value=\" <%= someBean.method(\" a = b \") %>\" nickname =box >";
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*(\\w+[^\"'\\s])");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println("Return Value :"+m.group(1)+"="+m.group(2));
}
}
}
Output:
Return Value:id=foo
Return Value:name=bar
Return Value:nickname=box