Getting info from a webpage in java - java

sorry if it's kind of a big question but I'm just looking for someone to tell me in what direction to learn more since I have no clue, I have very basic knowledge of HTML and Java.
Someone in my family has to copy every product from a supplier into his own webshop.
The problem is he needs to put in all the articles one by one by hand,I'm looking for a way to replace him by a program.
I already got a bit going on for the price calculation , all I need now is the info of the product.
http://pastebin.com/WVCy55Dj
From line 1009 to around 1030.
I need 3 seperate strings of the three span's with the class "CatalogusListDetailTest"
From line 987 to around 1000.
I need a way to get all these images, it's on the website at www.flamingo.be/Images/Products/Large/"productID"(our first string).jpg
sometimes there's a _A , _B as you can see in this example so I'm looking for a way to make it check if there is and get these images aswell.
If I could get this far then I'd be very thankful ! I'll figure the rest out myself, sorry for the long post, wanted to give as much info as possible.

You can look at HTML parser lib Jsoup, doc reference: http://jsoup.org/cookbook/
EDIT: Code to get the product code:
Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
for (Element classElement : classElements) {
if (classElement.text().contains("Productcode :")) {
System.out.println(classElement.parent().ownText());
}
}
Instead of document you may have to use an element to get the consistent result, above code will print all the product codes.

You can use JTidy for what you need.
Code Example:
public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
URL url = new URL(pageLink);
BufferedInputStream page = new BufferedInputStream(url.openStream());
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document response = tidy.parseDOM(page, null);
XPathFactory factory = XPathFactory.newInstance();
XPath xPath=factory.newXPath();
NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);
String imageURL = (String) nodes.item(0).getNodeValue();
saveImageNIO(imageURL, targetDir);
}
where
IMAGE_PATTERN = "///a/img/#src";
but the pattern depends on how the image is innested in the page HTML code.
Method for saving Image using NIO:
public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
URL url = new URL(imageURL);
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
}

Related

Jsoup not seeing some text on website

Currently I am making a program (in Java) that grabs all the streamers on twitch (Videogame streaming site) from a given URL e.g. and lists them into a text file using Jsoup.
However, no matter what I try, it seems like I can't get the streamer's names. After a while I discovered that the page source for some reason does not contain the streamer's names which I think could be the problem?
Here is my code currently.
public static void main(String[] args) throws IOException {
int i = 0;
PrintWriter streamerwriter = new PrintWriter("streamer.txt", "UTF-8");
Document doc = Jsoup.connect(https://www.twitch.tv/directory/game/Hearthstone%3A%20Heroes%20of%20Warcraft).get();
Elements streamers = doc.getElementsByClass("js-profile-link");
for (Element streamer : streamers) {
i++;
System.out.println(i + "." + streamer.text());
streamerwriter.println(i + "." + streamer.text());
}
streamerwriter.close();
}
Any help would be greatly appreciated.
you don't need to parse webpage.Because twitch has an api to select streamers.
https://streams.twitch.tv/kraken/streams?limit=20&offset=0&game=Hearthstone%3A+Heroes+of+Warcraft&broadcaster_language=&on_site=1
so you should parse json data
if you want to know why you don't see streamers in jsoup because of lazy loading.because that part you want to parse is loaded lazily.You should know that lazy request and parse that url by jsoup which i found and wrote up.(twitch api)
Please check this question out:
how to use Jsoup in site that has lazyload scrollLoader.js

How can I use an assertion to test this java code?

My homework assignment is to read a URL and print all hyperlinks at that URL to a file. I also need to submit a junit test case with at least one assertion. I have looked at the different forms of Assert but I just can't come up with any use of them that applies to my code. Any help steering me in the right direction would be great.
(I'm not looking for anyone to write the test case for me, just a little guidance on what direction I should be looking in)
public void saveHyperLinkToFile(String url, String fileName)
throws IOException
{
URL pageLocation = new URL(url);
Scanner in = new Scanner(pageLocation.openStream());
PrintWriter out = new PrintWriter(fileName);
while (in.hasNext())
{
String line = in.next();
if (line.contains("href=\"http://"))
{
int from = line.indexOf("\"");
int to = line.lastIndexOf("\"");
out.println(line.substring(from + 1, to));
}
}
in.close();
out.close();
}
}
Try to decompose your method into simpler ones:
List<URL> readHyperlinksFromUrl(URL url);
void writeUrlsToFile(List<URL> urls, String fileName);
You could already test your first method by saving a sample document as a resource and running it against that resource, comparing the result with the known list of URLs.
You can also test the second method by re-reading that file.
But you can decompose things further on:
void writeUrlsToWriter(List<URL> urls, Writer writer);
Writer createFileWriter(String fileName);
Now you can test your first method by writing to a StringWriter and checking, what was written there by asserting the equality of writer.toString() with the sample value. Not that methods are becoming simpler and simpler.
It would be actually a very good excercise to write the whole thing test-first or even play ping-pong with yourself.
Good luck and happy coding.

document.newPage(); not working inside for each loop in iText

I tried to create contents in new page using doc.newPage();. It worked fine but when i tried to use the same inside the forloop it's not working. Below is my code
String first="First Page";
String second="second Page";
ByteArrayInputStream is = new ByteArrayInputStream(first.getBytes());
worker.parseXHtml(pdfWriter, doc, is);
doc.newPage();
is=new ByteArrayInputStream(second.getBytes());
worker.parseXHtml(pdfWriter, doc, is);
This code is working fine but when I put this code in for loop it's not creating the contents in 2 pages.
ByteArrayInputStream is = null;
List<String> strList=new ArrayList<String>();
String first="First Page";
String second="second Page";
strList.add(first);
strList.add(second);
for(String string:strList)
{
doc.newPage();
is=new ByteArrayInputStream(second.getBytes());
worker.parseXHtml(pdfWriter, doc, is);
}
How to overcome this issue?
Apparently adding text without HTML tags fails when you use XMLWorker. This is a known bug. For now a workaround would be to check whether a string has HTML elements or not and if it doesn't, you could wrap it in a paragraph tag.
I adjusted your sample to make it work:
List<String> strList=new ArrayList<String>();
String first="<p>First Page</p>";
String second="<p>second Page</p>";
strList.add(first);
strList.add(second);
for(String string:strList)
{
doc.newPage();
// I just prefer StringReaders over ByteArrayInputStreams
worker.parseXHtml(pdfWriter, doc, new StringReader(string));
}
Also, for future reference, your code threw an Exception, it would be best if you would provide any thrown exceptions in future questions.

xpath: write to a file

I'm developing Java code to get data from a website and store it in a file. I want to store the result of xpath into a file. Is there any way to save the output of the xpath? Please forgive for any mistakes; this is my first question.
public class TestScrapping {
public static void main(String[] args) throws MalformedURLException, IOException, XPatherException {
// URL to be fetched in the below url u can replace s=cantabil with company of ur choice
String url_fetch = "http://www.yahoo.com";
//create tagnode object to traverse XML using xpath
TagNode node;
String info = null;
//XPath of the data to be fetched.....use firefox's firepath addon or use firebug to fetch the required XPath.
//the below XPath will display the title of the company u have queried for
String name_xpath = "//div[1]/div[2]/div[2]/div[1]/div/div/div/div/table/tbody/tr[1]/td[2]/text()";
// declarations related to the api
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = new CleanerProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
//creating url object
URL url = new URL(url_fetch);
URLConnection conn = url.openConnection(); //opening connection
node = cleaner.clean(new InputStreamReader(conn.getInputStream()));//reading input stream
//storing the nodes belonging to the given xpath
Object[] info_nodes = node.evaluateXPath(name_xpath);
// String li= node.getAttributeByName(name_xpath);
//checking if something returned or not....if XPath invalid info_nodes.length=0
if (info_nodes.length > 0) {
//info_nodes[0] will return string buffer
StringBuffer str = new StringBuffer();
{
for(int i=0;i<info_nodes.length;i++)
System.out.println(info_nodes[i]);
}
/*str.append(info_nodes[0]);
System.out.println(str);
*/
}
}
}
You can "simply" print the nodes as strings, to console/or a file --
example in Perl:
my $all = $XML_OBJ->find('/'); # selecting all nodes from root
foreach my $node ($all->get_nodelist()) {
print XML::XPath::XMLParser::as_string($node);
}
note: this output however may not be nicely xml-formatted/indented
The output of an XPath in Java is a nodeset, so yes, once you have a nodeset you can do anything you want with it, save it to a file, process it some more.
Saving it to a file would involve the same steps in java that saving anything else to a file involve, there is no difference between that and and any other data. Select the nodeset, itterate through it, get the parts you want from it and write them to some kind of file stream.
However, if you mean is there a Nodeset.SaveToFile(), then no.
I would recommend you to take the NodeSet, which is a collection of Nodes, iterate on it, and add it to a created DOM document object.
After this, you can use the TransformerFactory to get a Transformer object, and to use its transform method. You should transform from a DOMSource to a StreamResult object which can be created based on FileOutputStream.

Android: Extracting the text between two HTML tags

I need to extract the text between two HTML tags and store it in a string. An example of the HTML I want to parse is as follows:
<div id=\"swiki.2.1\"> THE TEXT I NEED </div>
I have done this in Java using the pattern (swiki\.2\.1\\\")(.*)(\/div) and getting the string I want from the group $2. However this will not work in android. When I go to print the contents of $2 nothing appears, because the match fails.
Has anyone had a similar problem with using regex in android, or is there a better way (non-regex) to parse the HTML page in the first place. Again, this works fine in a standard java test program. Any help would be greatly appreciated!
For HTML-parsing-stuff I always use HtmlCleaner: http://htmlcleaner.sourceforge.net/
Awesome lib that works great with Xpath and of course Android. :-)
This shows how you can download an XML from URL and parse it to get a certain value from an XML attribute (also shown in the docs):
public static String snapFromHtmlWithCookies(Context context, String xPath, String attrToSnap, String urlString,
String cookies) throws IOException, XPatherException {
String snap = "";
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setDoOutput(true);
// optional cookies
connection.setRequestProperty(context.getString(R.string.cookie_prefix), cookies);
connection.connect();
// use the cleaner to "clean" the HTML and return it as a TagNode object
TagNode root = cleaner.clean(new InputStreamReader(connection.getInputStream()));
Object[] foundNodes = root.evaluateXPath(xPath);
if (foundNodes.length > 0) {
TagNode foundNode = (TagNode) foundNodes[0];
snap = foundNode.getAttributeByName(attrToSnap);
}
return snap;
}
Just edit it for your needs. :-)

Categories