Java (Android) regular expression to strip out HTML paragraph - java

I have an Android application which grabs some data from an external XML source. I've stripped out some HTML from one of the XML elements, but it's in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don't really want to start including external HTML parsing libraries.

RegEx match open tags except XHTML self-contained tags
So, I'll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I'd try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three <p> elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.

If it's simple, just do a regex.
If you are getting XML from an external source that you own, I would parse it there.

just doing a split: http://developer.android.com/reference/java/lang/String.html#split(java.lang.String)
on "</p><p>" and taking the second entry in the returned array would actually do it pretty quickly

The regex would probably look something like: .*?>(.*?)<.*
And you access the grouped content by calling group(1) on the Matcher object.

If you are going to parse an XML file downloaded from website, then there is nothing to do with Android.

Related

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">
Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

how to capture content in a string using a regular expression in java

I would like to parse an HTML form and pull our filename's of any embedded images.
So the string could look like:
{
...
random HTML content
image1.png
more random HTML content
image3.png
...
}
From the above I would like to write a function in Java that returns to me
{image1.png, image3.png}.
I have a regular expression that returns to me only the last image name (image3.png) but it disregards previous image names. How can I capture all of them using regex?
All / any help would be appreciated.
https://stackoverflow.com/a/2059614/684934 give a good hint. More specifically, you're probably looking for something like [a-zA-Z0-9_\-]+\.(png|jpg|gif|jpeg|tif)
Note, however, that this is regex and is only looking for sequences of characters. If you are looking at a site that serves up dynamic images using servlets for example, and the resource URI doesn't happen to end with a normal image file extension (such as .jsp or .do), then the regex will completely fail. It will also pick up any "image names" from any sort of text that happens to match, which does not actually represent an image on the page.
To do the job properly, you will need to use some sort of DOM and traverse the <img> elements. (And the <button> elements, which may be of type image... there are probably more tags that can have images.)

Parsing HTML content with sibling tags in Java (or) Finding content between two <open> tags

Background: I'm writing a Java program to go through HTML files and replace all the content in tags that are not <script> or <style> with Lorem Ipsum. I originally did this with a regex just removing everything between a > and a <, which actually worked quite well (blasphemous I know), but I'm trying to turn this into a tool others may find useful so I wouldn't dare threaten the sanctity of the universe any more by trying to use regex on html.
I'm trying to use HtmlCleaner, a Java library that attracted me because it has no other dependencies. However, trying to implement it I've been unable to deal with html like this:
<div>
This text is in the div <span>but this is also in a span.</span>
</div>
The problem is simple. When the TagNodeVisitor reaches the div, if I replace its contents with the right amount of lipsum, it will eliminate the span tag. But if I drill down to only TagNodes with no other children, I would miss the first bit of text.
HtmlCleaner has a ContentNode object, but that object has no replace method. Anything I can think of to deal with this seems like it must be far too complicated. Is anyone familiar with a way to deal with this, with HtmlCleaner or some other parsing library you're more familiar with?
You can pretty much do anything you want with JSoup setters
Would that suit you ?
Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
HtmlCleaner's ContentNode has a method getContent() that returns a java.lang.StringBuilder. This is mutable and can be changed to whatever value you want.

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

url rewriting with antlr

My java program needs to rewrite urls in html (just in time). I am looking for the right tool and wonder if antlr is doing the job for me?
For example:
<html><body> <img src="foo.jpg" /> </body></html>
should be rewritten as:
<html><body> <img src="http://foo.com/foo.jpg" /> </body></html>
I want to read/write from/to a stream (byte by byte).
As khmarbaise said, first make sure, if regular expressions can do it. But there are cases, in which they can't [*], and then I think, ANTLR might really be a legitimate choice.
[*] For the mathematical background on this, see http://en.wikipedia.org/wiki/Formal_grammar#The_Chomsky_hierarchy
Update
Now that you updated your question, I see what you really want to do: For modifying a complete HTML file, I'd use a parser like NekoHTML, or something similar: http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
Then you can use these to extract the URL. Then
parse only the URL itself - e. g. with Regexes, Java's URL class (or sometimes better: URI), or maybe ANTLR
modify the parsed URL
and write out the HTML again, using NekoHTML/...
Do not use regular expressions to parse the entire HTML file! You could use ANTLR for that in theory, but it would be very hard to make that work reliably.
What about Regular expressions ?

Categories