How to delete specific html class with content using Java Html Class - java

Recently I am working on a android project. I am parsing data from wordpress api. But detail post content are in html formet. I have to remove html tags. Using Html.fromHtml().toString() java method I deleted all tags. But there are some image caption which I have to delete. For delete the caption I have to find tag class. So how can I delete this content using Html Class?
<p class="wp-caption-text">android m marshmallow</
EDIT :
Using regular Expression I solved My problem.
Insert Your specific Html in Regex and you will get your Regular Expression.
yourHtml = yourHtml.replaceAll("Your_Regular_Expression","");
yourHtml = Html.fromHtml(yourHtml).toString();

If you want to get a match you can try this:
<(\w+).*?class="wp-caption-text".*?>[\s\S]*?<\/\1>
Regex101
I'd like to mention that this is not a perfect solution. Regular expressions are not very good at parsing html since the structures in that markup language are actually too complex to 100% be parseable by regular expressions. See here

Related

Java - Extract html information from string

All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.

correcting parsed URLs in java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">
Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");
If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

regular expressions in java

how can use a regular expression to extract a links in a web page(suppose i get the html page as a text file) using java?
This previously posted question should help you
How to use regular expressions to parse HTML in Java?
Essentially you should really look at using a HTML parser
Agree that HTML parser will make your life easier if you can include it with your build - I've used Jericho HTML Parser for something similar in the past...

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

Java (Android) regular expression to strip out HTML paragraph

I have an Android application which grabs some data from an external XML source. I've stripped out some HTML from one of the XML elements, but it's in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don't really want to start including external HTML parsing libraries.
RegEx match open tags except XHTML self-contained tags
So, I'll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I'd try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three <p> elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.
If it's simple, just do a regex.
If you are getting XML from an external source that you own, I would parse it there.
just doing a split: http://developer.android.com/reference/java/lang/String.html#split(java.lang.String)
on "</p><p>" and taking the second entry in the returned array would actually do it pretty quickly
The regex would probably look something like: .*?>(.*?)<.*
And you access the grouped content by calling group(1) on the Matcher object.
If you are going to parse an XML file downloaded from website, then there is nothing to do with Android.

Categories