how can use a regular expression to extract a links in a web page(suppose i get the html page as a text file) using java?
This previously posted question should help you
How to use regular expressions to parse HTML in Java?
Essentially you should really look at using a HTML parser
Agree that HTML parser will make your life easier if you can include it with your build - I've used Jericho HTML Parser for something similar in the past...
Related
Recently I am working on a android project. I am parsing data from wordpress api. But detail post content are in html formet. I have to remove html tags. Using Html.fromHtml().toString() java method I deleted all tags. But there are some image caption which I have to delete. For delete the caption I have to find tag class. So how can I delete this content using Html Class?
<p class="wp-caption-text">android m marshmallow</
EDIT :
Using regular Expression I solved My problem.
Insert Your specific Html in Regex and you will get your Regular Expression.
yourHtml = yourHtml.replaceAll("Your_Regular_Expression","");
yourHtml = Html.fromHtml(yourHtml).toString();
If you want to get a match you can try this:
<(\w+).*?class="wp-caption-text".*?>[\s\S]*?<\/\1>
Regex101
I'd like to mention that this is not a perfect solution. Regular expressions are not very good at parsing html since the structures in that markup language are actually too complex to 100% be parseable by regular expressions. See here
All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.
I have learnt how to create a HTTP Get request method to retrieve data from a URL, but I would like to filter the response to only give me a list of the links on the webpage.
For example, if the HTML contained the following text:
<link href="http://www.thompsons.co.uk">
then it should print out:
http://www.thompsons.co.uk
I would strongly recommend that you DO NOT use regexes to "parse" HTML. Unless you have control over the formatting of the web pages you are processing, a solution based on regexes is liable to be fragile and buggy.
Instead, use a permissive HTML parser. This Question gives a number of alternatives: HTML/XML Parser for Java
You can use jsoup:
http://jsoup.org/cookbook/extracting-data/attributes-text-html
You read in the whole data fully, then parse it with regexp to extract the links. Read more here: http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What are the pros and cons of the leading Java HTML parsers?
What HTML parser would you recommend for parsing HTML?
I need one feature html parser to have:
That parser returns useful text, no menu, no footer, no headers information. Only text that contains normal content.
I have tried Jericho Html parser, HtmlCleaner but they do not seem to work as I need.
Thanks in advance.
I'm not really sure what you're asking; an HTML parser parses HTML--what you extract out of it is up to you. I like jsoup and tagsoup.
If you want something that pulls "normal" content out of HTML, you could look at how Apache Tika handles HTML. All HTML is written differently--you have to be able to define what "normal" content is, and where it is.
look at the Google map
Is there any parser to get the link(www.google.com/map) from the <a> tag?
or the best way just to write a custom one~
jQuery, for instance:
var href = $('a.more-link').attr('href');
There is many 3:rd party solutions but I am not sure which exist for Java, maybe HTML agility pack exists in a version for Java.
But another solution would be to use regex
/<a\s+[^<]*?href\s*=\s*(?:(['"])(.+?)\1.*?|(.+?))>/
Fixed the regex to handle problems suggested in comments.
Looked up some real HTML parsers for Java if you find you need more than the regex aproach
http://htmlparser.sourceforge.net/
http://jericho.htmlparser.net/docs/index.html
http://jsoup.org/