correcting parsed URLs in java - java

I am creating a HTML parser that gets the HTML from a given URL, finds the navigation menu html, and puts it into a String. The URLs in the HTML that are being copied into the String need part of the URL added (the "www.stackoverflow.com" part). How can I go about finding the existing URLs in the String and adding the missing part to it so that they work.
The URLs in the String are of the form:
<a href="/qestions/11744851.cfm">
and I need to make them in the following form:
<a href="www.stackoverflow.com/questions/11744851.cfm">

Try using this regular expression with the ReplaceAll() method:
str = subString.replaceAll("<a href=\"(.*)\">", "<a href=\"http://www.stackoverflow/$1\">");

If the XHTML is valid XML, the easiest way is to parse it as XML and use XPath (for example /body/div/a#href , where /body/div is path to menu section in HTML.
There is also a project called HTMLParser (http://htmlparser.sourceforge.net/), you may want to give it a try (according to the page, it has 'link extraction, for crawling through web pages or harvesting email addresses'; but I've never used it, so I can't help much).
If on the other hand the HTML is anything but valid, you may want to use http://ccil.org/~cowan/XML/tagsoup/ - it might work, or it might not, on websites we've tried, it did pretty good.
Edit: adding missing part may be done using simple concatenation after finding interesting parts

Related

How to fix hanging html tags in HTML fragment?

I am getting a possibly ill-composed HTML fragment from an external source:
<p>Include all the information someone would need to answer your <i><i>question<p>
How to make it safe for rendering within a bigger HTML document, closing all hanging HTML tags in Java?
You can try to parse incoming string to XML - there is plenty of tools that do that. If it fails it means that HTML is wrongly formatted (for instance not all tags are correctly closed).
If you need better validation you may additionally validate it against XSD.
You can achieve that by writing your own Java custom parser and fixing the tags.
Idea will be like this, get all open tags and find its relevant closing tag in the string.
You can replace with if there is no closing tag founds.
You need to handle duplicates and pre , post valid tags.
Else you can try this opensource handy parses which helps in achieving that.
http://java-source.net/open-source/html-parsers
http://htmlcleaner.sourceforge.net/ looks good option.
Hope this helps.

How to delete specific html class with content using Java Html Class

Recently I am working on a android project. I am parsing data from wordpress api. But detail post content are in html formet. I have to remove html tags. Using Html.fromHtml().toString() java method I deleted all tags. But there are some image caption which I have to delete. For delete the caption I have to find tag class. So how can I delete this content using Html Class?
<p class="wp-caption-text">android m marshmallow</
EDIT :
Using regular Expression I solved My problem.
Insert Your specific Html in Regex and you will get your Regular Expression.
yourHtml = yourHtml.replaceAll("Your_Regular_Expression","");
yourHtml = Html.fromHtml(yourHtml).toString();
If you want to get a match you can try this:
<(\w+).*?class="wp-caption-text".*?>[\s\S]*?<\/\1>
Regex101
I'd like to mention that this is not a perfect solution. Regular expressions are not very good at parsing html since the structures in that markup language are actually too complex to 100% be parseable by regular expressions. See here

how to capture content in a string using a regular expression in java

I would like to parse an HTML form and pull our filename's of any embedded images.
So the string could look like:
{
...
random HTML content
image1.png
more random HTML content
image3.png
...
}
From the above I would like to write a function in Java that returns to me
{image1.png, image3.png}.
I have a regular expression that returns to me only the last image name (image3.png) but it disregards previous image names. How can I capture all of them using regex?
All / any help would be appreciated.
https://stackoverflow.com/a/2059614/684934 give a good hint. More specifically, you're probably looking for something like [a-zA-Z0-9_\-]+\.(png|jpg|gif|jpeg|tif)
Note, however, that this is regex and is only looking for sequences of characters. If you are looking at a site that serves up dynamic images using servlets for example, and the resource URI doesn't happen to end with a normal image file extension (such as .jsp or .do), then the regex will completely fail. It will also pick up any "image names" from any sort of text that happens to match, which does not actually represent an image on the page.
To do the job properly, you will need to use some sort of DOM and traverse the <img> elements. (And the <button> elements, which may be of type image... there are probably more tags that can have images.)

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

Java (Android) regular expression to strip out HTML paragraph

I have an Android application which grabs some data from an external XML source. I've stripped out some HTML from one of the XML elements, but it's in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don't really want to start including external HTML parsing libraries.
RegEx match open tags except XHTML self-contained tags
So, I'll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I'd try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three <p> elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.
If it's simple, just do a regex.
If you are getting XML from an external source that you own, I would parse it there.
just doing a split: http://developer.android.com/reference/java/lang/String.html#split(java.lang.String)
on "</p><p>" and taking the second entry in the returned array would actually do it pretty quickly
The regex would probably look something like: .*?>(.*?)<.*
And you access the grouped content by calling group(1) on the Matcher object.
If you are going to parse an XML file downloaded from website, then there is nothing to do with Android.

Categories