I need to use regular expression with a string from a rss feed. It is following string.
private String text = "<![CDATA[<table border="0" cellspacing="0" cellpadding="6px"> <tr><td valign="top" align="center" width="150px"><img src="http://static.desktopnexus.com/thumbnails/978620-thumbnail.jpg" border="1" style="border: 1px solid #000000;"><br><strong>Princess,Aurora,Sleeping,Beauty</strong></td><td valign="top" style="font-size: 10pt;">A new wallpaper has been posted to the Entertainment gallery on Desktop Nexus.<br><br>Uploaded By: Jessowey<br>Category: Movies<br>Date Uploaded: 02/23/12<br>Native Resolution: 1024x768<br>Points: +1<br>Download Count: 0<br><br><b>View This Wallpaper Now</b></td></tr></table>]]>";
As you can see there are " inside the string, so I can't use it as a statement. I tried to use raw string but as you know it is not possible in java.
How can I extract img tag from the above statement. I need to do it programatically.
Thanks in advance!
It's very, very difficult (in general) to use regular expressions to parse XML/HTML, there is another post which lists good XML parsers for Java and their strengths, I suggest you use one of these.
If you wnat to use " in String literal in java you have to escape them with backslash like that
String stringWithParenthesis = "text \"in parenthesis\" out ";
Use \" in a string to use the " inside it. For example:
String yourFeed = "My so \"called\" String";
This works for some other special character like the backslash itself:
String antoherFeed = "Hello \"World\", what a nice \\ day \\ ";
You can parse your RSS-FEED for such special characters. Like in this question:
JAVA: check a string if there is a special character in it
and format them to valid string characters.
Related
i have problems while concatenating strings and variables. I tried to add quotes and slashes, i tried to move them back and forth, but i wasnt able to find a solution.
I have a class that 'write' a div. I wrote this
String var = "width:100px";
String div ="<div class=\"divClass\" style="+var+">";
The code i wrote give me
<div class="divClass" style=width:100px>
But, in order to write a good code i would need this
<div class="divClass" style="width:100px">
with the value of style between quote "".
You need to escape the " symbol
String var = "\"width:100px\"";
String div ="<div class=\"divClass\" style="+var+">";
Then div would be
<div class="divClass" style="width:100px">
The reason we need to do this is that we need to tell the compiler that the quotes symbol " is a part of the String and we are not closing the String literal yet.
Example
System.out.println("hello"); => hello
System.out.println("\"hello\""); => "hello"
When the compiler sees \" it reads \ and knows that it has to ignore the next character ie ".
try
String var = "\"width:100px\"";
as you will need to escape your quotes
Just try like this.
String var = "width:100px";
String div ="<div class=\"divClass\" style=\""+var+"\">";
I'm working with JSP pages, and I need to append some HTML and Java codes inside a DIV, I only remember that I should escape " like this \", but I don't know about the other characters and I don't know if all non-letter characters should be escaped, here is the String.
String s ="<% ResultSet joinedRooms = myJavaDB.updateJoinedRooms(loginBean.getId());
while(joinedRooms.next()){%> <div id="<%=joinedRooms.getString(1)%>" class="chatRoom">
<div class="chatRoomName"><%=myJavaDB.getRoomName(joinedRooms.getInt(1))%></div></div><% } %>"
No need to roll your own, take a look at Apache Commons StringEscapeUtils.
I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.
I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?
P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.
To explain my problem more : i am doing the following
Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str);
result array contains only 1 item and it is the whole string
and the following is a portion of the file that i am reading :
<BODY>
<SYNC Start=200>
<P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
</SYNC>
<SYNC Start=2440>
<P Class=ENCC> </P>
</SYNC>
<SYNC Start=2560>
<P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
</SYNC>
<SYNC Start=4560>
<P Class=ENCC> </P>
</SYNC>
<SYNC Start=66160>
<P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
</SYNC>
UPDATE ::::
hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.
Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.
That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).
That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.
EDIT:
Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:
You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.
Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.
Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.
The .* may match <. You can try :
<p class=a>([^<]*)</p>
I guess the problem is that your pattern is greedy. You should use this instead.
"<p class=a>(.*?)</p>"
If you have this string:
"<p class=a>fist</p><p class=a>second</p>"
Your pattern ("<p class=a>(.*)</p>") will match this
"<p class=a>fist</p><p class=a>second</p>"
While "<p class=a>(.*?)</p>" only matches
"<p class=a>fist</p>"
I have a html file in which the html elements have name as follows :
<input type="text" name="HCFA_DETAIL_SUPPLEMENTAL" value="" size="64" />
My requirement is to rename the name attribute value in java naming convention as follows :
<input type="text" name="hcfaDetailSupplemental" value="" size="64" />
Since there are large number of such elements, I want to accomplish that using regex. Can anyone suggest my how to achieve that using regex ?
Do not use regular expressions to go over HTML (why here). Using an appropriate framework such as HTML Parser should do the trick.
A series of samples to get you started are available here.
Using jQuery to get the name, and then regexes to replace all the _[a-z] occurances:
$('input').each(function () {
var s = $(this).attr('name').toLowerCase();
while (s.match("_[a-z]"))
s = s.replace(new RegExp("_[a-z]"), s.match("_[a-z]").toString().toUpperCase());
$(this).attr('name', s);
});
In most cases using regex with html is bad practice, but if you must use it, then here is one of solutions.
So first you can find text in name="XXX" attribute. It can be done by using this regex (?<=name=")[a-zA-Z_]+(?="). When you find it, replace "_" by "" and don't forget to lowercase rest of letters. Now you can replace old value by new one using same regex we used before.
This should do the trick
String html="<input type=\"text\" name=\"HCFA_DETAIL_SUPPLEMENTAL\" value=\"\" size=\"64\"/>";
String reg="(?<=name=\")[a-zA-Z_]+(?=\")";
Pattern pattern=Pattern.compile(reg);
Matcher matcher=pattern.matcher(html);
if (matcher.find()){
String newName=matcher.group(0);
//System.out.println(newName);
newName=newName.toLowerCase().replaceAll("_", "");
//System.out.println(newName);
html=html.replaceFirst(reg, newName);
}
System.out.println(html);
//out -> <input type="text" name="hcfadetailsupplemental" value="" size="64"/>
I have a large number (>1500) of JSP files that I am trying to convert to JSPX. I am using a tool that will parse well-formed JSPs and convert to JSPX, however, my JSPs are not all well-formed :)
My solution is to pre-process the JSPs and convert untidy code so the tool will parse them correctly. The main problem I am trying to resolve is that of unquoted attribute values. Examples:
<INPUT id="foo" size=1>
<input id=body size="2">
My current regex for finding these is (in Java string format):
"(\\w+)=([^\"' >]+)"
And my replacement string is (in Java string format):
"$1=\"$2\""
This works well, EXCEPT for a few patterns, both of which involve inline scriptlets. For example:
<INPUT id=foo value="<%= someBean.method("a=b") %>">
In this case, my pattern matches the string literal "a=b", which I don't want to do. What I'd like to have happen is that the regex would IGNORE anything between <% and %>. Is there a regular expression that will do what I am trying to do?
EDIT:
Changed to title to clarify that I am NOT trying to parse HTML / JSP with regexes... I am doing a simple syntactic transformation to prepare the input for parsing.
If a sentence contains an arbitrary number of matching tokens such as double quotes, then this sentence belongs to a context-free language, which simply cannot be parsed with Regex designed to handle regular languages.
Either there could be some simplification assumptions (e.g. there are no unmatched double quotes and there is only a certain number of those etc.) that would permit the use of Regex, or your need to think about using (creating) a lexer/parser for a case of context-free language. ANTLR is a good tool for this.
Based on the assumption that there are NO unquoted attribute values inside the scriptlets, the following construct might work for you:
Note: this approach is fragile. Just for your reference.
import java.util.regex.*;
public class test{
public static void main(String args[]){
String s = "<INPUT id=foo abbr='ip ' name = bar color =\"blue\" value=\" <%= someBean.method(\" a = b \") %>\" nickname =box >";
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*(\\w+[^\"'\\s])");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println("Return Value :"+m.group(1)+"="+m.group(2));
}
}
}
Output:
Return Value:id=foo
Return Value:name=bar
Return Value:nickname=box