HTML type string parsing question!

HTML type string parsing question! - java

look at the Google map
Is there any parser to get the link(www.google.com/map) from the <a> tag?
or the best way just to write a custom one~

jQuery, for instance:
var href = $('a.more-link').attr('href');

There is many 3:rd party solutions but I am not sure which exist for Java, maybe HTML agility pack exists in a version for Java.
But another solution would be to use regex
/<a\s+[^<]*?href\s*=\s*(?:(['"])(.+?)\1.*?|(.+?))>/
Fixed the regex to handle problems suggested in comments.
Looked up some real HTML parsers for Java if you find you need more than the regex aproach
http://htmlparser.sourceforge.net/
http://jericho.htmlparser.net/docs/index.html
http://jsoup.org/

Related

HTML-parsing via Java

How do you parse html-docs in Java? I've read a lot of articles about parsing, but haven't found the best way to do it.

Try using Jsoup https://jsoup.org/. It is one of the most widely used html parser.

Safe html in java

I have some input containing HTML like <br> <b> <i> etc. I need a way to escape only the "bad" HTML that exposes my site to XSS etc.
After hours of Googeling I found the GWT which looks kinda promising.
What is the recommended way to escape bad HTML?
Edit:
Let me clear things up.
I am using a javascript text editor which outputs html. Wouldn't it be much easier if i use something like bbcode?

OWASP AntiSamy is a project for just that. If you need users to be able to submit structured text, look at markdown (imho a lot better than BBCode).

Google caja is a tool for making third party HTML, CSS and JavaScript safe to embed in your website.

Playframework 2 already offers a solution.
the #Html() function filters bad html, which is really nice.
I really love play2

You might want to just escape all html. If you want to have users be able to use basic html tags like <b> or <i> then you could just replace them with [b] and [i] (if your forum/whatever you're creating can use bbcode), then just replace all "<" and ">" with "<" and ">".

get part with regex

I need to get everything bewteen
onmouseout="this.style.backgroundColor='#fff'">
and the following <
in this case:
onmouseout="this.style.backgroundColor='#fff'">example<
I would like to get the word example.
Here is a more complicated example of where it should work as well:
onmouseout="this.style.backgroundColor='#fff'">going to drink?<br></span><span title="Juist!" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">Exactly!</span></span></div></div>
So here i need 2 of them back (and not joined).
Could someone help? I suck at regex.
Someone edited my tag to javascript.
I need a solution to use in java, i just get a file as plain text. So javascript or html solutions are not really helpfull.

Regex with html? Well, If you have to parse only a few lines then ok. But in general is better to use a html parser (because HTML is not a regular language).
This is pure gold: https://stackoverflow.com/a/1732454/434171

Java library using css selectors to parse XML

Is there a jQuery like JAVA/Android library that uses CSS Selectors to parse the XML ?
Like :
String desc = myXML.find("bloc[type=pro]").get(0).attr("description");
Chainability is also what I'm looking for, in the same way of jQuery...
I hope this exists !

While initially designed as a HTML parser with CSS selector support, Jsoup works fine for XML documents as well if your sole intent is to extract data, not to manipulate data.
Document document = Jsoup.parse(xmlString);
String desc = document.select("bloc[type=pro]").get(0).attr("description");
// ...
You see, the syntax is almost identical to what you have had in the question.

Apache Jericho is what you are looking for.
You example would look like
String desc = source.getFirstElement( "type", "pro" ).getAttributeValue( "description" );
It's a charm to parse HTML with jericho, so I guess it's even easier for well structured XML.

Since there are some bugs in other Libraries like Jsoup and Jericho is different from what I was expecting,
I wrote a Class Extending the org.xml.sax.helpers.DefaultHandler which parse the XML. I then wrote two other Classes that look like Element and Elements from Jsoup containing two functions called find that handle the CSS3 Selector and attr that returns the attribute value.
I'm now cleaning and commenting that code... I'll post the library later for who is interested in.
xmlDoc.find("bloc[type=Pro]>act").attr("label");
is now possible like in jQuery !
Edit !
Here is the link to access the code for who is interested : Google Code Project
Moving to GitHub : https://github.com/ChristopheCVB/JavaXMLQuery

I use XPath to solve that issue. XML parsing like JDOM is ok to to the XPath. Maybe jQuery see how XPath works :p
//bloc[#type="pro"][1]/#description
Xpath index start from 1, not 0
https://www.w3schools.com/xml/xpath_syntax.asp

The droidQuery library can do many of the things you are looking for. Although the syntax is a little different, you can:
Get view attributes using chained, jQuery-style commands, such as:
CharSequence text = $.with(this, R.id.myTextView).attr("text");
Parse XML:
Document dom = $.parseXML(myXMLString);
If you are a fan of jQuery, you will be pleased to see that nearly all of the features it provides are included in droidQuery, and although the syntax may differ at times, its major goal is to be as syntactically close to jQuery as possible.

regular expressions in java

how can use a regular expression to extract a links in a web page(suppose i get the html page as a text file) using java?

This previously posted question should help you
How to use regular expressions to parse HTML in Java?
Essentially you should really look at using a HTML parser

Agree that HTML parser will make your life easier if you can include it with your build - I've used Jericho HTML Parser for something similar in the past...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML type string parsing question! - java

look at the Google map Is there any parser to get the link(www.google.com/map) from the <a> tag? or the best way just to write a custom one~

jQuery, for instance: var href = $('a.more-link').attr('href');

Related

HTML-parsing via Java

Safe html in java

get part with regex

Java library using css selectors to parse XML

regular expressions in java

Categories

Resources