I am planning to write a Java program to read some exchange rates from a web site (http://www.doviz.com) and was wondering what is the best approach to only read (or read the whole and strip the parts needed) the content that I need.
Any help is appreciated.
My advice is to use the Jsoup library
It's very easy to parse an external content with a css/jquery-like syntax
// Only one line to parse an external content
Document doc = Jsoup.connect("http://jsoup.org").get();
// "Javascript-like" syntax
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
// "Jquery/Css-like" syntax
Elements resultLinks = doc.select("h3.r > a");
Elements pngs = doc.select("img[src$=.png]");
Just add the jsoup.jar library to your classpath and enjoy !
Open-Source and free to use of course.
I'd suggest you to implement an RSS reading mechanism of a webpage (programatically) and extract the content of the RSS xml using standard parsers.
Related
I am going to create a desktop client for my university's parent web interface. When logged in, a webpage displays the student details in a table. I want to retrieve those details using java.
A short google search brought me to this library. https://jsoup.org
As it seems, it can send http requests (to receive the data from your university website) as well as parse these html to simple search for the tables you need.
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
If you don't know how html is structure you should take a short tutorial on how to write simple html to understand what is going on and what you are looking for.
I need to include HTML code into my RSS feed. I use Java ROME RSS library:
SyndFeed feed = new SyndFeedImpl();
feed.setFeedType("rss_2.0");
feed.setTitle("Title");
feed.setLink("example.com");
feed.setDescription("Description");
List<SyndEntry> entries = new ArrayList<>();
SyndEntryImpl entry = new SyndEntryImpl();
entry.setTitle("Name");
SyndContent syndContent = new SyndContentImpl();
syndContent.setType("text/html");
syndContent.setValue("<p>Hello, World !</p>");
entry.setDescription(syndContent);
entries.add(entry);
feed.setEntries(entries);
Writer writer = new FileWriter("rss.xml");
SyndFeedOutput output = new SyndFeedOutput();
output.output(feed, writer);
writer.close();
but the output XML contains encoded description:
<description><p>Hello, World !</p></description>
How to properly include unencoded HTML code with ROME?
Analysis
According to the RSS Best Practices Profile: 4.1.1.20.4 description:
The description must be suitable for presentation as HTML. HTML markup must be encoded as character data either by employing the HTML entities < ("<") and > (">") or a CDATA section.
Therefore, the current output is correct.
CDATA encoding
If it is desirable to have CDATA section (CDATA encoding), the following piece of code may be used:
final List<String> contents = new ArrayList<>();
contents.add("<p>HTML content is here!</p>");
final ContentModule module = new ContentModuleImpl();
module.setEncodeds(contents);
entry.getModules().add(module);
Additional references
RSS Best Practices Profile.
Putting content:encoded in RSS feed using ROME - Stack Overflow.
Re: CDATA Support - Mark Woodman - net.java.dev.rome.dev - MarkMail.
rome-modules/ContentModuleImplTest.java at master · rometools/rome-modules · GitHub.
description versus content:encoded
Should I use both - description and content:encoded nodes or only one of them in my RSS feed item ?
And how about the following?
An item may also be complete in itself, if so, the description contains the text (entity-encoded HTML is allowed; see examples), <…>
According to the RSS 2.0 specification, using the description element is enough: exactly as you have quoted. Here are the examples: Encoding & item-level descriptions (RSS 2.0 at Harvard Law).
For additional details please refer to the question: Difference between description and content:encoded tags in RSS2 - Stack Overflow.
I'm parsing data from a json file. Now, I've a data like this
String Content = <p><img class="alignleft size-full wp-image-56999" alt="abdullah" src="http://www.some.com/wp-content/uploads/2013/12/imageName.jpg" width="348" height="239" />Text</p>
<p>Text</p> <p>Text</p><p>The post Some Text appeared first on Some Webiste</p>
Now, I want to divide this string in two pieces. I want to get this URL from src.
http://www.some.com/wp-content/uploads/2013/12/imageName.jpg
and store it a variable. Also, I want to remove the last line The Post appeared... and store the text's in another variable.
So, the questions are:
Is it possible to get that?
If possible, how can I achieve that ?
IN Java
Get a Document object
Document originalDoc = new SAXReader().read(new StringReader("<div>data</div>");
Then you can parse it.. (read this tutorial)
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
In JavaScript
to get attribute
var url = document.getElementsByTagName('img')[0].getAttribute('src');
In case if you have a string and you want a document object, use jquery
string stringValue = '<div>data</div>';
var myObject= $(stringValue);
Use String.substring(firstIndex, lastIndex) to get the link from src attribute
learn to use a HTML parser like JSoup, will be useful in near future
If its a well structured string you can parse it using any DOM parser and extract data from it...
What im trying to do is parse xml through java. and i only want a snippet of text from each tag for example.
xml example
<data>\nSome Text :\n\MY Spectre around me night and day. Some More: Like a wild beast
guards my way.</data>
<data>\nSome Text :\n\Cruelty has a human heart. Some More: And Jealousy a human face
</data>
so far i have this
NodeList ageList = firstItemElement.getElementsByTagName("data");
Element ageElement =(Element)ageList.item(0);
NodeList textAgeList = ageElement.getChildNodes();
out.write("Data : " + ((Node)textAgeList.item(0)).getNodeValue().trim());
im trying to just get the "Some More:....." part i dont want the whole tag
also im trying to get rid of all the \n
If you're not restricted to the standard DOM API, you could try to use jOOX, which wraps standard DOM. Your example would then translate to:
// Use jOOX's jquery-like API to find elements and their text content
for (String string : $(firstItemElement).find("data").texts()) {
// Use standard String methods to replace content
System.out.println(string.replace("\\n", ""));
}
I would take all of the element text and use regular expressions to capture the relevant parts.
I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it.
I've tried reading the documentation but it seems too extensive and I don't know where to begin.
Could you show me how to scrape this table in to an array? I can try figuring out my way from there. A snippet/example would do just fine too.
Thanks.
You can try jsoup: Java HTML Parser. It is an excellent library with good sample codes.
Transform the web page you are trying to scrap into an XHTML document. There are several options to do this with Java, such as JTidy and HTMLCleaner. These tools will also automatically fix malformed HTML (e.g., close unclosed tags). Both work very well, but I prefer JTidy because it integrates better with Java's DOM API;
Extract required information using XPath expressions.
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table.
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[#valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected.
Another cool tool that you can use is Web Harvest. It basically does everything I did above but using an XML file to configure the extraction pipeline.
Regex is definitely the way to go. Building a DOM is overly complicated and itself requires a lot of text parsing.
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.