JSOUP - parse a website and find information from it

JSOUP - parse a website and find information from it - java

I played a little with JSOUP, but I can't get the information I want with it from a website and I need some help.
For example I have this website, and I would like to extract some info like this :
-ROCK
--ACID ROCK
--PSYCHEDELIC ROCK
--BLUES ROCK
-----Aerosmith
----------One Way Street
-----AC/DC
----------Ain't No Fun (Waiting Round to Be a Millionaire)
With other words ... I want a list with genres containing lists with artist containing lists with songs ...
-Genre1
--Artist1
---Song1
---Song2
---Song3
--Artist2
---Song1
-Genre2
...
This is what i have so far (sorry for the messy code):
package parser;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLParser {
public static void main(String[] args) {
String HTMLSTring = "<!DOCTYPE html>"
+ "<html>"
+ "<head>"
+ "<title>Music</title>"
+ "</head>"
+ "<body>"
+ "<table><tr><td><h1>Artists</h1></tr>"
+ "</table>"
+ "</body>"
+ "</html>";
Document html = Jsoup.parse(HTMLSTring);
String genre = "genres";
String artist = "artist";
String album = "album";
String song = "song";
//Document htmlFile = null;
//Element div = html.body().getElementsByAttributeValueMatching(genre, null);
// String div = html.body().getElementsByAttributeValueMatching(genre, null).text();
//String cssClass = div.className();
List genreList = new ArrayList();
String title = html.title();
String h1 = html.body().getElementsByTag("h1").text();
String h2 = html.body().getElementsByClass(genre).text();
String gen = html.body().getAllElements().text();
Document doc;
try {
doc = Jsoup.connect("http://allmusic.com/").get();
title = doc.title();
h1 = doc.text();
h2 = doc.text();
} catch (IOException e)
{
e.printStackTrace();
}
System.out.println("Title : " + title);
//System.out.println("h1 : "+ h1);
//System.out.println("h2 : "+ h2);
System.out.println("gen : all elements : " + gen);
}
}
And this is my output:
Title : AllMusic | Record Reviews, Streaming Songs, Genres & Bands
gen : all elements : Artists Artists Artists Artists Artists Artists
I haven't got so far ...
I don't know how to extract the information ... (e.g. type of genres, artists names ...)

Related

Display full line with CharTermAttribute in Lucene

I have some e-books in .txt format where usually their page has the author, their title in the form (author: blah blah, title: blah blah etc.).
I want Lucene to automatically locate this information and then display it on the screen with the offset from the found term. More specifically i want to display:
BOOK1, (1)title: blah blah(15) offset 1->15 , (16)author: blah blah(32) offset 16-32 , release date: 14/12/1923 32->50
However, what I do below in the code is to insert in the displayTokensWithFullDetails the file I want to find, and it only shows me when it finds the word author in the text And its Offset counter,
while I want to show me the whole line of the title with cumulative Offset that is from the first letter of the line to the last one. Is there any way I can display the whole line and the offset corresponding to the line?
I would describe it as a command that says "Find a token in the text which matched in "title/author "etc and display the whole line with its offset."
Unless there is a completely different way to do it than through the attributes of StandardAnalyzer.
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
public class Indexer
{
public static void main(String[] args) throws Exception {
File f= new File("C:/test1.txt");
displayTokensWithFullDetails(f);
}
public static void displayTokensWithFullDetails(File F,String info)throws IOException {
TokenStream stream = new StandardAnalyzer().tokenStream("contents",new FileReader(F));
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncr =
stream.addAttribute(PositionIncrementAttribute.class);
OffsetAttribute offset =
stream.addAttribute(OffsetAttribute.class);
int position = 0;
stream.reset();
while(stream.incrementToken())
{
String term1 = term.toString();
if (term1.equals("title"))
{
System.out.print("" +term1 + ":" +offset.startOffset() + "->" +offset.endOffset());
}
if (term1.equals("author"))
{
System.out.print("" +term1 + ":" +offset.startOffset() + "->" +offset.endOffset());
}
if (term1.equals("release"))
{
System.out.print("" +term1 + ":" +offset.startOffset() + "->" +offset.endOffset());
}
}
stream.close();
System.out.println();
}

How to find the html element of a given text

Assume I have the following code to be parsed using JSoup
<body>
<div id="myDiv" class="simple" >
<p>
<img class="alignleft" src="myimage.jpg" alt="myimage" />
I just passed out of UC Berkeley
</p>
</div>
</body>
The question is, given just a keyword "Berkeley", is there a better way to find the element/XPath (or a list of it, if multiple occurrences of the keyword is present) in the html, which has this keyword as part of its text.
I don't get to see the html before hand, and will be available only at runtime.
My current implementation - Using Java-Jsoup, iterate through the children of body, and get "ownText" and text of each children, and then drill down into their children to narrow down the html element. I feel this is very slow.

Not elegant but simple way could look like :
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<body> \n" +
" <div id=\"myDiv\" class=\"simple\" >\n" +
" <p>\n" +
" <img class=\"alignleft\" src=\"myimage.jpg\" alt=\"myimage\" />\n" +
" I just passed out of UC Berkeley\n" +
" </p>\n" +
" <ol>\n" +
" <li>Berkeley</li>\n" +
" <li>Berkeley</li>\n" +
" </ol>\n" +
" </div> \n" +
"</body>";
Elements eles = Jsoup.parse(html).getAllElements(); // get all elements which apear in your html
Set<String> set = new HashSet<>();
for(Element e : eles){
Tag t = e.tag();
set.add(t.getName()); // put the tag name in a set or list
}
set.remove("head"); set.remove("html"); set.remove("body"); set.remove("#root"); set.remove("img"); //remove some unimportant tags
for(String s : set){
System.out.println(s);
if(!Jsoup.parse(html).select(s+":contains(Berkeley)").isEmpty()){ // check if the tag contains your key word
System.out.println(Jsoup.parse(html).select(s+":contains(Berkeley)").get(0).toString());} // print it out or do something else
System.out.println("---------------------");
System.out.println();
}
}
}

Try this xpath :
for the first element with a class :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#class]'
for the first element with an id :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#id]'
Check normalize-space

[JAVA]Get html link from webpage

I want to get the link in this pic using java, image is below. There are few more links in that webpage. I found this code on stackoverflow, I don't understand how to use it though.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class weber{
public static void main(String[] args)throws Exception{
String url = "http://www.skyovnis.com/category/ufology/";
Document doc = Jsoup.connect(url).get();
/*String question = doc.select("#site-inner").text();
System.out.println("Question: " + question);*/
Elements anser = doc.select("#container .entry-title a");
for (Element anse : anser){
System.out.println("Answer: " + anse.text());
}
}
}
code is edited from the original I found tho. please help.

For your URL following code works fine.
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.skyovnis.com/category/ufology/").userAgent("Mozilla").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links (this is what you want)
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
output was
title : Ufology
link : http://www.shop.skyovnis.com/
text : Shop
link : http://www.shop.skyovnis.com/product-category/books/
text : Books
Following code filter the links by text of it.
for (Element link : links) {
if(link.text().contains("Arecibo Message"))//find the link with some texts
{
System.out.println("here is the element you need");
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
}
It’s recommended to specify a “userAgent” in Jsoup, to avoid HTTP 403 error messages.
Document doc = Jsoup.connect("http://anyurl.com").userAgent("Mozilla").get();
"Onna malli mage yuthukama kala."
refernce :
https://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

How to split a word using java regex?

I am using JSOUP package for getting a specific TITLE search like facebook title's . Here is my code which gives the output with TITLE's. From the TITLE's I want to select facebook URL.
PROGRAM :
package googlesearch;
import java.io.IOException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SearchRegexDiv {
private static String REGEX = ".?[facebook]";
public static void main(String[] args) throws IOException {
Pattern p = Pattern.compile(REGEX);
String google = "http://www.google.com/search?q=";
//String search = "stackoverflow";
String search = "hortonworks";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link: links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
//.?facebook
if (title.matches(REGEX)) {
System.out.println("Done");
title.substring(title.lastIndexOf(" ") + 1); //split the String
//(example.substring(example.lastIndexOf(" ") + 1));
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
OUTPUT :
Title: Hortonworks - Facebook logo
URL: https://www.facebook.com/hortonworks/
From the output I get the list of URL's and TITLE's in the above format.
I am trying to match Title containing word Facebook and I want to split it into two strings like
String socila_media = facebook;
String org = hortonworks;

use this code to split you String using multiple Character
Here is a Demo To Split character using multiple param
String word = "https://www.facebook.com/hortonworks/";
String [] array = word.split("[/.]");
for (String each1 : array)
System.out.println(each1);
Output is
https: //each splitted word in different line.
www
facebook
com
hortonworks

MIT Java WordNet Interface: Exception in thread "main" edu.mit.jwi.data.IHasLifecycle$ObjectClosedException

I use JWI in Eclipse to access WordNet on Mac. I followed the example:
package tutorial;
import java.io.File;
import java.io.IOException;
import java.net.URL;
import edu.mit.jwi.*;
import edu.mit.jwi.item.IIndexWord;
import edu.mit.jwi.item.IWord;
import edu.mit.jwi.item.IWordID;
import edu.mit.jwi.item.POS;
public class ExampleWordnet {
public static void main(String[] args) throws IOException{
//construct URL to WordNet Dictionary directory on the computer
String wordNetDirectory = "/Users/Laura/Documents/WordNet-3.0";
//String path = wordNetDirectory + File.separator + "dict";
URL url = new URL("file", null, wordNetDirectory);
//construct the Dictionary object and open it
IDictionary dict = new Dictionary(url);
dict.open();
// look up first sense of the word "dog "
IIndexWord idxWord = dict.getIndexWord ("dog", POS.NOUN );
IWordID wordID = idxWord.getWordIDs().get(0) ;
IWord word = dict.getWord (wordID);
System.out.println("Id = " + wordID);
System.out.println(" Lemma = " + word.getLemma());
System.out.println(" Gloss = " + word.getSynset().getGloss());
} }
But I cannot run it. There are always errors:
Exception in thread "main" edu.mit.jwi.data.IHasLifecycle$ObjectClosedException
at edu.mit.jwi.CachingDictionary.checkOpen(CachingDictionary.java:111)
at edu.mit.jwi.CachingDictionary.getIndexWord(CachingDictionary.java:172)
at tutorial.ExampleWordnet.main(ExampleWordnet.java:28)
Could anybody please help me?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSOUP - parse a website and find information from it - java

Related

Display full line with CharTermAttribute in Lucene

How to find the html element of a given text

[JAVA]Get html link from webpage

How to split a word using java regex?

MIT Java WordNet Interface: Exception in thread "main" edu.mit.jwi.data.IHasLifecycle$ObjectClosedException

Categories

Resources