Java Regular Expression: href without hash - java

I'm trying to build a sitemap and parsing the html bodies for hrefs that doesn't have # (as those with hashes are just sub chapter links in some content page htmls).
My regexp now: <a\\s[^>]*href\\s*=\\s*\"([^\"]*)\"[^>]*>(.*?)</a>
I guess I should use [^#] or !# to exclude the # from hrefs but could not solve it with just trying and googling after it. Thanks in advance for helping me out!

Done it. Just inserted the # too in the [^\"] block. :D
<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>

You should not use regex to parse HTML.
Best use an HTML parser, as eg http://jsoup.org and then
Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");
for (Element each: links) {
if (each.attr("href").startsWith("#")) continue;
...
}
So much more painless than using regex, eh!

Related

Modifying a regex for a better matching

Considering we have this regex to match URLs in a page:
(https?):\\/\\/(www\\.)?[a-z0-9\\.:].*?(?=\\s)
My question is that how we can improve it to, for example, matches:
http://stackoverflow.com/questions/ask
instead of :
http://stackoverflow.com/questions/ask">home</a></div><div>
In short, I want it to filter any of ;:'".,<>?«»“”‘’ that usually comes after URLs in a page HTML code.
Since you already are using JSOUP, I think the best way to get all links is using this library. It is not losing originality, it is a question of your code safety, readability and maintainability.
Here is an example from Jsoup Cookbook how you can collect all links from href and src attributes (basically, from your regex I see that you only want to match those):
List<String> links = new ArrayList<String>();
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
for (Element src : media) {
links.add(src.attr("abs:src"));
}
for (Element link : imports) {
links.add(link.attr("abs:href"));
}
for (Element link : links) {
links.add(link.attr("abs:href"));
}
Note that this method will let you get all absolute path links, even those that were relative.
If you do not need that and want to show off to your boss, try the following regex:
https?://(?:www\\.)?[a-z0-9.:][^<>"]+
See the regex demo
Here, [^<>"]+ matches 1 or more symbols other than <, > and ". Thus, it restricts the regex not to cross the attribute boundary. You might want to add ' there, too.

How to get specific HTML items into an ArrayList of Strings

I'm trying to understand how to make use of the HTML data from the APOD archive. Preferably my end goal is to end up with an ArrayList of Strings like so:
From this url view-source:http://apod.nasa.gov/apod/archivepix.html
get each of these 2015 February 26: Love and War by Moonlight<br>
and put them into an ArrayList
I'm more used to JSON or even XML from rest API's -- parsing through HTML just seems crazy hard, so it'd be really helpful if someone could point me in the right direction on this.
Thanks!
Take a look on these HTML Parser called jsoup.
This will make your task easy.
This link would be helpfull for extracting the values from html.
For example:-
Document doc = Jsoup.connect("http://apod.nasa.gov/apod/archivepix.html").get();
Elements links = content.getElementsByTag("b");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Parse as you need it.
Maybe try using JAXP because you know what element it is that contains the data you want. http://docs.oracle.com/javase/tutorial/jaxp/

Using Jsoup to find elements troubles

I don't really know much about HTML parsers(using Jsoup currently) and have tried many times and can not get it to work due to my poor understanding of it, so please bare that in mind.
Anyway I am trying to grab certain parts of an HTML document. This is what I want to be extracted:
<div class ="detNane" >
<a class="detLink" title="Details for Hock part3">Hock part3</a></div>
Obviously the HTML document has multiple [div class="detName"] and I want to extract all text in each detName div class. I would greatly appreciate it.
Thank you for your time.
You can use a selector for this:
Document doc = // parse your document here or connect to a website
for( Element element : doc.select("div.detNane") )
{
System.out.println(element.text()); // Print the text of that element
}

Java Jericho hyperlink parsing

I'm trying to figure out a way to get all hyperlinks in a webpage - except if they are in an anchor tag().
For this I'm using the Jericho parser.
My initial approach was to take the difference between
List<Element> elementList = source.getAllElements(); and
getAllElements(HTMLElementName.A), but other elements might also contain an anchor link within them, so I don't think that's the right approach.
I recommend you Jsoup for Html processing.
Here's an example how you can get all links (= a-tag with href-attribute):
Document doc = Jsoup.connect("http:// - link here -").get(); // Connect to website and parse its html
Elements links = doc.select("a[href]"); // Select all 'a'-tags' with 'href'-attribute
for( Element element : links ) // iterate over all links (example)
{
// process element
}
Documentation:
Selector API (DOM API is available too)
Cookbook (Examples)
list links (Example)
JavaDoc
Btw. can you explain this a bit more?
except if they are in an anchor tag

Read href inside anchor tag using Java

I have an HTML snippet like this :
View or apply to job
I want to read href value XXXXXXXXXX using Java.
Point to note: I am reading the HTML file from a URL using inputstreamreader(url.openStream()).
I am getting a complete HTML file, and above snippet is a part of that file.
How can I do this?
Thanks
Karunjay Anand
Use a html parser like Jsoup. The API is easy to learn and for your case,the following code snippet will do
URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
Elements links = doc.select("a[href]"); // a with href
for (Element link : links) {
System.out.println("Href = "+link.attr("abs:href"));
}
Use an HTML parser like TagSoup or something similar.
You can use Java's own HtmlEditorKit for parsing html. This way you wont need to depend on any third party html parser. Here is an example of how to use it.

Categories