Extracting atom feeds from URL sets - java

I have a huge list of URL's and my task is to feed them to a java code which should spit out the atom contents. Is there an API library or how can I access them?I tried the below code but it does not show any output. I don't know what went wrong?
try {
URL url = new URL("https://www.google.com/search?hl=en&q=robbery&tbm=blg&
output=atom");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(url));
System.out.println("Feed Title: " + feed.getTitle());
for (SyndEntry entry : (List<SyndEntry>) feed.getEntries())
{
System.out.println("Title: " + entry.getTitle());
System.out.println("Unique Identifier: " + entry.getUri());
System.out.println("Updated Date: " + entry.getUpdatedDate());
for (SyndLinkImpl link : (List<SyndLinkImpl>) entry.getLinks())
{
System.out.println("Link: " + link.getHref());}
for (SyndContentImpl content : (List<SyndContentImpl>) entry.getContents())
{
System.out.println("Content: " + content.getValue());
}
for (SyndCategoryImpl category : (List<SyndCategoryImpl>) entry.getCategories())
{
System.out.println("Category: " + category.getName());
}}}
catch (Exception ex)
{
}

You can use Rome (http://rometools.org) to process atom feeds.

Every Atom feed have "feed" tag in it.
So what you can do is read the url and check if it contains feed tag or not.
In java you can use inbuilt XMLparser library to do it -
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(url);
doc.getDocumentElement().normalize();
if (doc.getElementsByTagName("feed").getLength() > 0) {
//do something
}

Related

How to get scrape specific URL from multiple URL in Webpage Java

I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted changes to the exempt offering framework and ignore other links. How do I do that in Java? I was able to extract all URL but not sure how to get specific one. Below is my code
while (rs.next()) {
String Content = rs.getString("Content");
doc = Jsoup.parse(Content);
//email extract
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
//title extract
String title = doc.title();
System.out.println("Title: " + title);
}
Elements links = doc.select("a");
for(Element link: links) {
String url = link.attr("href");
System.out.println("\nlink :"+ url);
System.out.println("text: " + link.text());
}
System.out.println("Getting all the images");
Elements image = doc.getElementsByTag("img");
for(Element src:image) {
System.out.println("src "+ src.attr("abs:src"));
}

Crawling a URL in order to extract all the other URLs in that page

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
However, there some issue with it which I would like to address them:
How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems:
2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to.
2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?
How about:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+#)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=#]+(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:#/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
With one library. I used HtmlCleaner. It does the job.
you can find it at:
http://htmlcleaner.sourceforge.net/javause.php
another example (not tested) with jsoup:
http://jsoup.org/cookbook/extracting-data/example-list-links
rather readable.
You can enhance it, choose < A > tags or others, HREF, etc...
or be more precise with case (HreF, HRef, ...): for exercise
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}

java metadata-extractor tag description

I am using the Java library Metadata-extractor and cannot extract the tag
description correctly using the getUserCommentDescription method code below,
although the tag.getDescription does work:
String exif = "File: " + file;
File jpgFile = new File(file);
Metadata metadata = ImageMetadataReader.readMetadata(jpgFile);
for (Directory directory : metadata.getDirectories()) {
String directoryName = directory.getName();
for (Tag tag : directory.getTags()) {
String tagName = tag.getTagName();
String description = tag.getDescription();
if (tagName.toLowerCase().contains("comment")) {
Log.d("DEBUG", description);
}
exif += "\n " + tagName + ": " + description; //Returns the correct values.
Log.d("DEBUG", directoryName + " " + tagName + " " + description);
}
if (directoryName.equals("Exif IFD0")) {
// create a descriptor
ExifSubIFDDirectory exifDirectory = metadata.getDirectory(ExifSubIFDDirectory.class);
ExifSubIFDDescriptor descriptor = new ExifSubIFDDescriptor(exifDirectory);
Log.d("DEBUG","Comments: " + descriptor.getUserCommentDescription()); //Always null.
}
Am I missing something here?
You are checking for the directory name Exif IFD0 and then accessing the ExifSubIFDDirectory.
Try this code outside the loop:
Metadata metadata = ImageMetadataReader.readMetadata(jpgFile);
ExifSubIFDDirectory exifDirectory = metadata.getDirectory(ExifSubIFDDirectory.class);
ExifSubIFDDescriptor descriptor = new ExifSubIFDDescriptor(exifDirectory);
String comment = descriptor.getUserCommentDescription();
If this returns null then it may be an encoding issue or bug. If you run this code:
byte[] commentBytes =
exifDirectory.getByteArray(ExifSubIFDDirectory.TAG_USER_COMMENT);
Do you have bytes in the array?
If so then please open an issue in the issue tracker and include a sample image that can be used to reproduce the problem. You must authorise any image you provide for use in the public domain.

How to parse XML file with multi level tags and CDATA by using java?

I want to read XML file with multi level tags and CDATA by using Java.
The sample XML is:
<?xml version="1.0" encoding="UTF-8"?>
<Result>
<ResultDetails>
<SearchFilmResult ItemType="film">
<FilmDetails>
<FilmDetail>
<Film Code="INCEPTION"><![CDATA[INCEPTION 2010]]></Film>
<Imdb>8.8</Imdb>
<FilmInformation>
<Director><![CDATA[Christopher Nolan]]></Director>
<Actors>
<Actor1><![CDATA[Leonardo DiCaprio]]></Actor1>
<Actor2><![CDATA[Joseph Gordon-Levitt]]></Actor2>
<Actor3><![CDATA[Ellen Page]]></Actor3>
</Actors>
</FilmInformation>
</FilmDetail>
</FilmDetails>
</SearchFilmResult>
</ResultDetails>
</Result>
The expected result is:
Film Code = INCEPTION
Film Name = INCEPTION 2010
IMDB = 8.8
Director = Christopher Nolan
Actors = Leonardo DiCaprio, Joseph Gordon-Levitt, Joseph Gordon-Levitt
Does anyone can guide me how to do? Many thanks.
Have you looked at XPath?
Here's a very simple example that will parse this sample XML, but I'd think it would be up to you to explore the possibilities that are out there and determine what will work well for you.
Give this a try:
public class Test {
public static void main(String[] args) throws Exception {
// sample xml
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<Result>\n" +
" <ResultDetails>\n" +
" <SearchFilmResult ItemType=\"film\">\n" +
" <FilmDetails>\n" +
" <FilmDetail>\n" +
" <Film Code=\"INCEPTION\"><![CDATA[INCEPTION 2010]]></Film>\n" +
" <Imdb>8.8</Imdb>\n" +
" <FilmInformation>\n" +
" <Director><![CDATA[Christopher Nolan]]></Director> \n" +
" <Actors>\n" +
" <Actor1><![CDATA[Leonardo DiCaprio]]></Actor1>\n" +
" <Actor2><![CDATA[Joseph Gordon-Levitt]]></Actor2>\n" +
" <Actor3><![CDATA[Ellen Page]]></Actor3>\n" +
" </Actors> \n" +
" </FilmInformation>\n" +
" </FilmDetail>\n" +
" </FilmDetails>\n" +
" </SearchFilmResult>\n" +
" </ResultDetails>\n" +
"</Result>";
// read the xml
InputSource source = new InputSource(new StringReader(xml));
// build a document model
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(source);
// create an xpath interpreter
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xpath = xpathFactory.newXPath();
// evaluate nodes
String filmCode = xpath.evaluate("Result/ResultDetails/SearchFilmResult/FilmDetails/FilmDetail/Film/#Code", document);
String filmName = xpath.evaluate("Result/ResultDetails/SearchFilmResult/FilmDetails/FilmDetail/Film", document);
String imdb = xpath.evaluate("Result/ResultDetails/SearchFilmResult/FilmDetails/FilmDetail/Imdb", document);
String director = xpath.evaluate("Result/ResultDetails/SearchFilmResult/FilmDetails/FilmDetail/FilmInformation/Director", document);
// get actor data
XPathExpression expr = xpath.compile("Result/ResultDetails/SearchFilmResult/FilmDetails/FilmDetail/FilmInformation/Actors/child::*");
NodeList actors = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
StringBuilder sb = new StringBuilder();
// compile actor list
for ( int i = 0; i < actors.getLength(); i++ ) {
String actorName = actors.item(i).getFirstChild().getNodeValue();
if ( i > 0 ) {
sb.append(", ");
}
sb.append(actorName);
}
// print output
System.out.println("Film Code = " + filmCode);
System.out.println("Film Name = " + filmName);
System.out.println("IMDB = " + imdb);
System.out.println("Director = " + director);
System.out.println("Actors = " + sb.toString());
}
}
Output:
Film Code = INCEPTION
Film Name = INCEPTION 2010
IMDB = 8.8
Director = Christopher Nolan
Actors = Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen Page

JSoup Parse text and links in sequence from html file

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.
Here is my code:
try {
doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for(Element p : paragraphs){
// System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
}
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
}
}
I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:
Text
Link
Text
Link
I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?
Try
Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for (Element p : paragraphs) {
System.out.println(p.text());
Elements links = p.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
}
}

Categories