extract URL location from an xml link in java - java

I'm new in java and i have a link "https://moz.com/blog-sitemap.xml" that has URLs ,i want to get them and save them in a string vector/array.
i tried this first to see how i'm going to get the links
URL robotFile = new URL("https://moz.com/blog-sitemap.xml");
//read robot.txt line by line
Scanner robotScanner = new Scanner(robotFile.openStream());
while (robotScanner.hasNextLine()) {
System.out.println(robotScanner.nextLine());
}
this is the sample output
my answer is ,is there a simple easier way to get these links instead of looping on each line checking if it contains "https" so i can extract the link from it ?

You can use Jsoup to do this more easly:
List<String> urlList = new ArrayList<>();
Document doc = Jsoup.connect("https://moz.com/blog-sitemap.xml").get();
Elements urls = doc.getElementsByTag("loc");
for (Element url : urls) {
urlList.add(url.text());
}

Related

How to save a jsoup document as text file

I am trying to save all of the readable words on a web page into one text document while ignoring html markup.
Using JSoup to parse all of the words on a webpage, my only guess of how to seperate the real words from the code is through elements.
Is it possible to convert multiple elements of the jsoup document into a text file?
i.e.:
Elements titles = doc.select("title");
Elements paragraphs = doc.select("p");
Elements links = doc.select("a[href]");
Elements smallText = doc.select("a");
Currently saving the parse as a document with:
Document doc = Jsoup.connect("https:// (enter a url)").get();
Its simple way
Document doc = Jsoup.connect("https:// (enter a url)").get();
BufferedWriter writer = null;
try
{
writer = new BufferedWriter( new FileWriter("d://test.txt"));
writer.write(doc.toString());
}
catch ( IOException e)
{
}
Adding answer because I am unable to comment above.
Replace writer.write(doc.toString()); by writer.write(doc.select("html").text()); in the above code.
It will give you the text on the page.
Instead of "html" in doc.select("**html**").text() other tags can be used to extract text enclosed in those tags.
Edit: you can also use writer.write(doc.body().text());
After writing in the text with writer.write(doc.text()); the very next line you need to write writer.close(); this will fix the problem.

How to display multiple Hyperlinks in Java

I want to display multiple Hyperlinks using JEditorPane.To be more specific i have a HashSet named urlLinks:
static Set<String> urlList = new HashSet<>();
and inside it I store urls like
www.google.com
www.facebook.com
etc.
As i said i am using the JEditorPane and I set it like this:
static final JEditorPane ResultsArea = new JEditorPane();
ResultsArea.setEditorKit(JEditorPane.createEditorKitForContentType("text/html"));
ResultsArea.setEditable(false);
At some point I want to display on the JEditorPane all these links as Hyperlinks
so I do this:
for(String s : urlList)
{
s=(""+s+""+"\n");
ResultsArea.setText(ResultsArea.getText()+s+"\n");
}
but it doesn't display anything.
When i try to change it like this
ResultsArea.setText(s);
it displays me only one of them.However I want to display all of them one after the other
like
www.example.com
www.stackoverflow.com
etc.
Does anyone know how to do that?
Use a StringBuilder to build the list of URLs first.
StringBuilder sb = new StringBuilder();
for (String s : urlList) {
sb.append("").append(s).append("\n");
}
ResultsArea.setText(sb.toString()); // then set the complete URL list once

Getting info from a webpage in java

sorry if it's kind of a big question but I'm just looking for someone to tell me in what direction to learn more since I have no clue, I have very basic knowledge of HTML and Java.
Someone in my family has to copy every product from a supplier into his own webshop.
The problem is he needs to put in all the articles one by one by hand,I'm looking for a way to replace him by a program.
I already got a bit going on for the price calculation , all I need now is the info of the product.
http://pastebin.com/WVCy55Dj
From line 1009 to around 1030.
I need 3 seperate strings of the three span's with the class "CatalogusListDetailTest"
From line 987 to around 1000.
I need a way to get all these images, it's on the website at www.flamingo.be/Images/Products/Large/"productID"(our first string).jpg
sometimes there's a _A , _B as you can see in this example so I'm looking for a way to make it check if there is and get these images aswell.
If I could get this far then I'd be very thankful ! I'll figure the rest out myself, sorry for the long post, wanted to give as much info as possible.
You can look at HTML parser lib Jsoup, doc reference: http://jsoup.org/cookbook/
EDIT: Code to get the product code:
Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
for (Element classElement : classElements) {
if (classElement.text().contains("Productcode :")) {
System.out.println(classElement.parent().ownText());
}
}
Instead of document you may have to use an element to get the consistent result, above code will print all the product codes.
You can use JTidy for what you need.
Code Example:
public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
URL url = new URL(pageLink);
BufferedInputStream page = new BufferedInputStream(url.openStream());
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document response = tidy.parseDOM(page, null);
XPathFactory factory = XPathFactory.newInstance();
XPath xPath=factory.newXPath();
NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);
String imageURL = (String) nodes.item(0).getNodeValue();
saveImageNIO(imageURL, targetDir);
}
where
IMAGE_PATTERN = "///a/img/#src";
but the pattern depends on how the image is innested in the page HTML code.
Method for saving Image using NIO:
public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
URL url = new URL(imageURL);
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
}

java data structure to replace file io

My program goes to a my uni results page, finds all the links and saves to a file. Then I read the file and copy only lines which contain required links and save it to another file. Then I parse it again to extract required data
public class net {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://jntuconnect.net/results_archive/").get();
Elements links = doc.select("a");
File f1 = new File("flink.txt");
File f2 = new File("rlink.txt");
//write extracted links to f1 file
FileUtils.writeLines(f1, links);
// store each link from f1 file in string list
List<String> linklist = FileUtils.readLines(f1);
// second string list to store only required link elements
List<String> rlinklist = new ArrayList<String>();
// loop which finds required links and stores in rlinklist
for(String elem : linklist){
if(elem.contains("B.Tech") && (elem.contains("R07")||elem.contains("R09"))){
rlinklist.add(elem);
}
}
//store required links in f2 file
FileUtils.writeLines(f2, rlinklist);
// parse links from f2 file
Document rdoc = Jsoup.parse(f2, null);
Elements rlinks = rdoc.select("a");
// for storing hrefs and link text
List<String> rhref = new ArrayList<String>();
List<String> rtext = new ArrayList<String>();
for(Element rlink : rlinks){
rhref.add(rlink.attr("href"));
rtext.add(rlink.text());
}
}// end main
}
I don't want to create files to do this. Is there a better way to get hrefs and link texts of only specific urls without creating files?
It uses Apache commons fileutils, jsoup
Here's how you can get rid of the first file write/read:
Elements links = doc.select("a");
List<String> linklist = new ArrayList<String>();
for (Element elt : links) {
linklist.add(elt.toString());
}
The second round trip, if I understand the code, is intended to extract the links that meet a certain test. You can just do that in memory using the same technique.
I see you're relying on Jsoup.parse to extract the href and link text from the selected links. You can do that in memory by writing the selected nodes to a StringBuffer, convert it to a String by calling it's toString() method, and then using one of the Jsoup.parse methods that takes a String instead of a File argument.

xpath: write to a file

I'm developing Java code to get data from a website and store it in a file. I want to store the result of xpath into a file. Is there any way to save the output of the xpath? Please forgive for any mistakes; this is my first question.
public class TestScrapping {
public static void main(String[] args) throws MalformedURLException, IOException, XPatherException {
// URL to be fetched in the below url u can replace s=cantabil with company of ur choice
String url_fetch = "http://www.yahoo.com";
//create tagnode object to traverse XML using xpath
TagNode node;
String info = null;
//XPath of the data to be fetched.....use firefox's firepath addon or use firebug to fetch the required XPath.
//the below XPath will display the title of the company u have queried for
String name_xpath = "//div[1]/div[2]/div[2]/div[1]/div/div/div/div/table/tbody/tr[1]/td[2]/text()";
// declarations related to the api
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = new CleanerProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
//creating url object
URL url = new URL(url_fetch);
URLConnection conn = url.openConnection(); //opening connection
node = cleaner.clean(new InputStreamReader(conn.getInputStream()));//reading input stream
//storing the nodes belonging to the given xpath
Object[] info_nodes = node.evaluateXPath(name_xpath);
// String li= node.getAttributeByName(name_xpath);
//checking if something returned or not....if XPath invalid info_nodes.length=0
if (info_nodes.length > 0) {
//info_nodes[0] will return string buffer
StringBuffer str = new StringBuffer();
{
for(int i=0;i<info_nodes.length;i++)
System.out.println(info_nodes[i]);
}
/*str.append(info_nodes[0]);
System.out.println(str);
*/
}
}
}
You can "simply" print the nodes as strings, to console/or a file --
example in Perl:
my $all = $XML_OBJ->find('/'); # selecting all nodes from root
foreach my $node ($all->get_nodelist()) {
print XML::XPath::XMLParser::as_string($node);
}
note: this output however may not be nicely xml-formatted/indented
The output of an XPath in Java is a nodeset, so yes, once you have a nodeset you can do anything you want with it, save it to a file, process it some more.
Saving it to a file would involve the same steps in java that saving anything else to a file involve, there is no difference between that and and any other data. Select the nodeset, itterate through it, get the parts you want from it and write them to some kind of file stream.
However, if you mean is there a Nodeset.SaveToFile(), then no.
I would recommend you to take the NodeSet, which is a collection of Nodes, iterate on it, and add it to a created DOM document object.
After this, you can use the TransformerFactory to get a Transformer object, and to use its transform method. You should transform from a DOMSource to a StreamResult object which can be created based on FileOutputStream.

Categories