jsoup : How to search for date text from a webpage - java

Simply this is what I am trying to do :
(I want to use jsoup)
pass only one url to parse
search for date(s) which are mentioned inside the contents of web page
Extracts at least one date from the each page contents
convert that date into standard format
So, Point #1
What I have now :
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();
Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what?
Then Point #2 What I have now:
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);
Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here.
I have done point #4.
So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above.
Thanks in Advance !
Updated :
So here how I want :
public static void main(String[] args){
try {
// using USER AGENT for giving information to the server that I am a browser not a bot
final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
// My only one url which I want to parse
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
// Creating a jsoup.Connection to connect the url with USER AGENT
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
// retrieving the parsed document
Document htmlDocument = connection.get();
/* Now till this part, I have A parsed document of the url page which is in plain-text format right?
* If not, in which type or in which format it is stored in the variable 'htmlDocument'
* */
/* Now, If 'htmlDocument' holds the text format of the web page
* Why do i need elements to find dates, because dates can be normal text in a web page,
* So, how I am going to find an element tag for that?
* As an example, If i wanted to collect text from <p> paragraph tag,
* I would use this :
*/
// I am not sure is it correct or not
//***************************************************/
Elements paragraph = htmlDocument.getElementsByTag("p");
for(Element src: paragraph){
System.out.println("text"+src.attr("abs:p"));
}
//***************************************************//
/* But I do not want any elements to find to gather dates on the page
* I just want to search the whole text document for date
* So, I need a regex formatted date string which will be passed as a input for a search method
* this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
*/
// At the end we will use only one date from our search result and format it in a standard form
/*
* That is it.
*/
/*
* I was trying something like this
*/
//final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = htmlDocument.getElementsMatchingOwnText(p);
for(Element e: elements){
System.out.println("element = [" + e + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}

Here is one possible solution i found:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Created by ruben.alfarodiaz on 21/12/2016.
*/
#RunWith(JUnit4.class)
public class StackTest {
#Test
public void findDates() {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
try {
String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
//with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");
//Here we find all document elements which have some element with the searched pattern
Elements elements = htmlDocument.getElementsMatchingText(pattern);
//in this loop we are going to filter from all original elements to find only the leaf elements
List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
finalElements.stream().forEach(elem ->
System.out.println("Node: " + elem.html())
);
}catch(Exception ex){
ex.printStackTrace();
}
}
//Method to decide if the current element is a leaf or contains others dates inside
private boolean isLastElem(Element elem, Pattern pattern) {
return elem.getElementsMatchingText(pattern).size() <= 1;
}
}
The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities
Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. For instance
<html>
<body>
<div>
20/11/2017
</div>
</body>
</html>
If we find for the pattern dd/mm/yyyy the library will return 3 elements
html, body and div, but we are just interested in div

Related

Getting information whether a google search results exists or not (JAVA)

i try parsing google for search results. What i need are not the search results themselves, but instead i need the information whether a search result exists or not!
Now my problem is i want to search for combined strings. E.g. "Max Testperson".
Now google is really nice and tells me:
We could not find search results for "Max Testperson" but instead for Max Testperson. But !!! I do not need Max Testperson, i need "Max Testperson".
So basically i am not interested in the search results themselves, but instead into the part before the search results (Whether a search string can be found or not!).
I used the following tutorial in java:
http://mph-web.de/web-scraping-with-java-top-10-google-search-results/
With this i can parse the search results. But like i said! No need for that! I just want to know if my search string exists or not. Since google removes the ->" "<- i get search results anyways.
Can anyone help me out with this?
Try to add the get parameter nfpr=1 to your search to disable the auto-correction feature:
final Document doc = Jsoup.connect("https://google.com/search?q=test"+"&nfpr=1").userAgent(USER_AGENT).get();
Update:
You could parse for the message regarding no result:
public class App {
public static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
public static void main(String[] args) throws Exception {
String searchTerm = "\"daniel+nasseh\"+\"26.02.1987\"";
boolean hasExactResults = true;
final Document doc = Jsoup.connect("https://google.com/search?q=" + searchTerm + "&nfpr=1")
.userAgent(USER_AGENT).get();
Elements noResultMessage = doc.select("div.e.obp div.med:first-child");
if (!noResultMessage.isEmpty()) {
hasExactResults = false;
for (Element result : noResultMessage) {
System.out.println(result.text());
}
}
if (hasExactResults) {
// Traverse the results
for (Element result : doc.select("h3.r a")) {
final String title = result.text();
final String url = result.attr("href");
System.out.println(title + " -> " + url);
}
}
}
}
Update 2: best solution as presented from Donselm himself in the comments is to add &tbs=li:1 to force the search for the exact search term
String searchTerm = "\"daniel+nasseh\"+\"26.02.1987\"";
final Document doc = Jsoup.connect("https://google.com/search?q=" + searchTerm + "&tbs=li:1").userAgent(USER_AGENT).get();

How to check if html document contains string

What would be a fast way to check if an URL contains a given string? I tried jsoup and pattern matching, but is there a faster way.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest {
public static void main(String[] args) throws Exception {
String url = "https://en.wikipedia.org/wiki/Hawaii";
Document doc = Jsoup.connect(url).get();
String html = doc.html();
Pattern pattern = Pattern.compile("<h2>Contents</h2>");
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
System.out.println("Found it");
}
}
}
It depends. If your patterns is really only a simple substring to be found exactly in the page content, then both methods you suggest are overkill. If that is indeed the case you should get the page without parsing it in JSoup. You still can use Jsoup if you want to get the page, just don't start the parser:
Connection con = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii");
Response res = con.execute();
String rawPageStr = res.body();
if (rawPageStr.contains("<h2>Contents</h2>")){
//do whatever you need to do
}
If the pattern is indeed a regular expression, use this:
Pattern pattern = Pattern.compile("<h2>\\s*Contents\\s*</h2>");
Matcher matcher = pattern.matcher(rawPageStr);
This makes only sense, if you do not need to parse much more of the page. However, if you actually want to perform a structured search of the DOM via CSS selectors, JSoup is not a bad choice, although a SAX based approach like TagSoup probably could be a bit faster.
Document doc = Jsoup.connect("https://en.wikipedia.org/wiki/Hawaii").get();
Elements h2s = doc.select("h2");
for (Element h2 : h2s){
if (h2.text().equals("Contents")){
//do whatever & more
}
}

jsoup: How to extract correct data from this website

I am trying to extract data from a Spanish dictionary using jsoup. Essentially, the user will input words he wants to define as command line arguments and the program will return a formatted list of definitions. Here is what I have done so far:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
String[] urls = new String[args.length];
for(int i=0; i<args.length; i++) {
urls[i] = "http://www.diccionarios.com/detalle.php?palabra="
+ args[i]
+ "&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on";
try{
Document doc = Jsoup.connect(urls[i]).get();
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
System.out.println(untokenized);
}catch (Exception e) {
System.out.println("EXCEPTION: Word is probably not in this dictionary.");
}
}
}
}
That url array gives the correct urls where the information for the definition is.
Now, what I'm expecting to be returned is what you would get if you went to the try.jsoup website and used (for example) this : http://www.diccionarios.com/detalle.php?palabra=libro&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on
as the link and typed in html as the CSS Query. I need that data so I can tokenize the definition from that.
So I guess my question is, what method would I use to obtain the same data that you can see on the try.jsoup website. Thanks a lot!
Edit: This is about interpreting the data from the url. The end result data I want (in this example) is "Conjunto de hojas escritas unidas o cosidas por uno de sus lados y cubiertas por tapas de cartón u otro material." That is the definition on the website. However, I noticed that on that try.jsoup website that if I put the html text in the CSS Query box then the result was a huge bunch of text. My assumption was that the following 2 lines of code would capture this huge bunch of text and save it as a string:
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
However, the output for when I print untokenized is instead this: "Usuario Clave ¿Olvidaste tu clave? Condiciones Privacidad Versión completa © 2011 Larousse Editorial, SL." So my question is, how to I obtain the string data for that huge bunch of text found on the try.jsoup website?
EDIT: I followed the advice of the question here: Jsoup - CSS Query selector issue (?) and it worked great.

I want to pull Facebook posts from a public page to a Java application

I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/
I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)

Jsoup Html parsing problem finding internal links data

Usually we have many internal links in a file. I want to parse a html file such that i get the headings of a page and its corresponding data in a map.
Steps i did:
1) Got all the internal reference elements
2) Parsed the document for the id = XXX where XXX == (element <a href="#XXX").
3) it takes me to the <span id="XXX">little text here </span> <some tags here too ><p> actual text here </p> <p> here too </p>
4) How to go from <span> to <p> ???
5) I tried going to parent of span and thought that its one of the child is <p> too... its true. But it also involves <p> of other internal links too.
EDIT: added an sample html file portion:
<li class="toclevel-1 tocsection-1"><a href="#Enforcing_mutual_exclusion">
<span class="tocnumber">1</span> <span class="toctext">Enforcing mutual exclusion</span> </a><ul>
<li class="toclevel-2 tocsection-2"><a href="#Hardware_solutions">
<span class="tocnumber">1.1</span> <span class="toctext">Hardware solutions</span>
</a></li>
<li class="toclevel-2 tocsection-3"><a href="#Software_solutions">
<h2><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit&section=1" title="Edit section: Enforcing mutual exclusion">
edit</a>]</span> <span class="mw-headline" id="Enforcing_mutual_exclusion">
<comment --------------------------------------------------------------------
**see the id above = Enforcing_mutual_exclusion** which is same as first internal
link . Jsoup takes me to this span element. i want to access every <p> element after
this <span> tag before another <span> tag with id="any of the internal links"
------------------------------------------------------------------------------!>
Enforcing mutual exclusion</span></h2>
<p>There are both software and hardware solutions for enforcing mutual exclusion.
The different solutions are shown below.</p>
<h3><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit&section=2" title="Edit section: Hardware solutions">
edit</a>]</span> <span class="mw-headline" id="Hardware_solutions">Hardware
solutions</span></h3>
<p>On a <a href="/wiki/Uniprocessor" title="Uniprocessor" class="mw-
redirect">uniprocessor</a> system a common way to achieve mutual exclusion inside
kernels is
disable <a href="/wiki/Interrupt" title="Interrupt">
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public final class Website {
private URL websiteURL ;
private Document httpDoc ;
LinkedHashMap<String, ArrayList<String>> internalLinks =
new LinkedHashMap<String, ArrayList<String>>();
public Website(URL __websiteURL) throws MalformedURLException, IOException, Exception{
if(__websiteURL == null)
throw new Exception();
websiteURL = __websiteURL;
httpDoc = Jsoup.parse(connect());
System.out.println("Parsed the http file to Document");
}
/* Here is my function: i first gets all the internal links in internalLinksElements.
I then get the href name of <a ..> tag so that i can search for it in documnet.
*/
public void getDataWithHeadingsTogether(){
Elements internalLinksElements;
internalLinksElements = httpDoc.select("a[href^=#]");
for(Element element : internalLinksElements){
// some inline links were bad. i only those having span as their child.
Elements spanElements = element.select("span");
if(!spanElements.isEmpty()){
System.out.println("Text(): " + element.text()); // this can not give what i want
// ok i get the href tag name that would be the id
String href = element.attr("href") ;
href = href.replace("#", "");
System.out.println(href);
// selecting the element where we have that id.
Element data = httpDoc.getElementById(href);
// got the span
if(data == null)
continue;
Elements children = new Elements();
// problem is here.
while(children.isEmpty()){
// going to its element unless gets some data.
data = data.parent();
System.out.println(data);
children = data.select("p");
}
// its giving me all the data of file. thats bad.
System.out.println(children.text());
}
}
}
/**
*
* #return String Get all the headings of the document.
* #throws MalformedURLException
* #throws IOException
*/
#SuppressWarnings("CallToThreadDumpStack")
public String connect() throws MalformedURLException, IOException{
// Is this thread safe ? url.openStream();
BufferedReader reader = null;
try{
reader = new BufferedReader( new InputStreamReader(websiteURL.openStream()));
System.out.println("Got the reader");
} catch(Exception e){
e.printStackTrace();
System.out.println("Bye");
String html = "<html><h1>Heading 1</h1><body><h2>Heading 2</h2><p>hello</p></body></html>";
return html;
}
String inputLine, result = new String();
while((inputLine = reader.readLine()) != null){
result += inputLine;
}
reader.close();
System.out.println("Made the html file");
return result;
}
/**
*
* #param argv all the command line parameters.
* #throws MalformedURLException
* #throws IOException
*/
public static void main(String[] argv) throws MalformedURLException, IOException, Exception{
System.setProperty("proxyHost", "172.16.0.3");
System.setProperty("proxyPort","8383");
System.out.println("Sending url");
// a html file or any url place here ------------------------------------
URL url = new URL("put a html file here ");
Website website = new Website(url);
System.out.println(url.toString());
System.out.println("++++++++++++++++++++++++++++++++++++++++++++++++");
website.getDataWithHeadingsTogether();
}
}
I think you need to understand that the <span>s that you are locating are children of header elements, and that the data you want to store is made up of siblings of that header.
Therefore, you need to grab the <span>'s parent, and then use nextSibling to collect nodes that are your data for that <span>. You need to stop collecting data when you run out of siblings, or you encounter another header element, because another header indicates the start of the next item's data.

Categories