Jsoup parsing html tag from page - java

I am trying to parse pages (any page dynamic parser).
code is
Elements title = doc.select("title");
Elements metades = doc.select("meta[name=description]");
As you can see i want to extract title tag.
It is working fine on approx every website for example hinddroid.com
But it unable to parse Title from google.com and youtube.com
I think it is due to no space between two tags.
Most of big website not have space in html to save bandwidth.
Please suggest me - i want to parse html from website.
Full code :
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import java.sql.*;
import java.util.regex.*;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class post_link extends HttpServlet
{
#Override
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException
{
response.setContentType("text/html");
PrintWriter out = response.getWriter();
try
{
//out.println("<link rel=\"stylesheet\" type=\"text/css\" href=\"style.css\" /><script src=\"http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.6.3.min.js\"></script><script src=\"jquery-social.js\"></script>");
String linktopro = "http://"+request.getParameter("link_topro");
//String linktopro = "http://hinddroid.com";
Document doc = Jsoup.connect(linktopro).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").timeout(3000).get();
Elements png = doc.select("img[src]");
Elements title = doc.select("title:first-child");
//Elements title = doc.title();
Elements metades = doc.select("meta[name=description]");
Pattern p1 = Pattern.compile("http://.*|.com*?.(com)");
out.println("<script> var myCars=new Array(");
for(Element pngs : png)
{
Matcher m1 = p1.matcher(pngs.attr("src"));
boolean url = m1.matches();
String baseurl = "";
//out.println(url+"");
if(url)
{ baseurl = ""; }
else
{ baseurl = linktopro; }
out.println("\""+baseurl+""+pngs.attr("src")+"\",");
}
out.println("\"\"");
out.println(");</script>");
String outlink = "<div class=\"linkembox\">"+
"<div class=\"linkembox-img\">"+
"<img src=\"http://hinddroid.com/img/logo.gif\" width=\"150\" height=\"120\" />"+
"<br/><div id=\"linkimg-left\"><</div><div id=\"linkimg-right\">></div>"+
"</div>"+
"<div class=\"linkembox-text\">"+
"<div class=\"h\">"+title.html()+"</div><br/>"+
"<div class=\"h1\">"+metades.attr("content")+"</div>"+
"</div>"+
"</div>";
out.println(outlink);
out.print("<script> left(myCars); </script>");
}
catch(Exception ex)
{
out.print(ex);
}
finally
{
out.close();
}
}
}

I execute the selectors, it's fine. No problem at all!
public static void main( String[] args ) throws IOException
{
Document doc = Jsoup.connect("http://facebook.com").get();
System.out.println("Title: " + doc.title());
System.out.println("Meta Description: " + doc.select("meta[name=description]").first().attr("content"));
}
With google.com, you can get only <title>, not <meta name=description... because it's not in HTML source.

Related

How handle URL Encoded Characters in Jsoup

How to handle URL Encoded Characters like colon (%3A) in JSoup connect function?
What you could basically do is encode the URL before you use it in JSOUP.
I believe what you are trying to do here is pass some parameters to the host in the URL itself.
To encode the URL, use the below code:
String url = "https://google.com?q=i wish to search something";
String encodeURL=URLEncoder.encode( url, "UTF8" );
Here's the answer to your comment:
package com.abk;
import java.io.IOException;
import java.net.URLDecoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupTest{
public static void main( String[] args ) throws IOException{
Document doc = Jsoup.connect(URLDecoder.decode("https://siccode.com/en/business-list/sic%3A2211%22","UTF8")).get();
String title = doc.title();
System.out.println("title is: " + title);
}
}
This should work like a charm :)
Use
String decodedString1 = URLDecoder.decode("siccode.com/en/business-list/sic%3A2211", "UTF-8");
as its url encoded you need to decode it before using.
Sample for JS.
var str = decodeURIComponent("siccode.com/en/business-list/sic%3A2211");
console.log(str);

Using Jsoup to parse search descriptions in SERPS (Google Results)

I keep running into a problem whenever I try to scrape off searches from Google Search results. I am using Jsoup to pull out the HTML code, but I am unable to pull out the information from the webpage that I need. I am aiming to reach the descriptions of the information under the titles. Here is my code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.Proxy;
public class internetSearch {
public void retrieveFileInfo(String pulling) {
Document doc;
try {
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort));
doc = Jsoup
.connect(pulling)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.header("Content-Language", "en-US")
.timeout(0)
.get();
System.out.println(doc.toString());
Elements links = doc.select("div[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: " + title);
System.out.println("Body: " + body + "\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I've used many sources across the web in order to get my code. In the past, I used Selenium as well, but to no avail.
I continuously search through my outcome in order to find the class ".st" which it is under (in h3, span, .st), and I do not reach a conclusion.
Is it just simply Google jumbling up the code or am I missing something vital?
Here is a solution with estivate (which is a Java DOM Parser with Annotations compatible with JSoup)
Document doc = // here your JSoup document grabbing
EstivateMapper2 mapper = new EstivateMapper2()
List<GoogleResult> results = mapper.mapToList(doc, GoogleResult .class);
with the defintion of GoogleResult as follow :
#Select("div.g")
public class GoogleResult {
#Text(select = "h3.r")
public String title;
#Text(select = "div.s cite")
public String link;
#Text(select = "span.st")
public String body;
}

Jsoup reddit scraper 429 error

So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}
Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.
As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.

JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

I was trying to get the HTML page and parse information. I just found out that some of the pages were not completely downloaded using Jsoup. I checked with curl command on command line then the complete page got downloaded. Initially I thought that it was site specific, but then I just tried to parse any big webpage randomly using Jsoup and found that it didn't download the complete webpage. I tried specifying user agent and time out properties still it failed to download. Here is the code I tried:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String[] args) throws MalformedURLException, UnsupportedEncodingException, IOException {
String urlStr = "http://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States";
URL url = new URL(urlStr);
String content = "";
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"))) {
for (String line; (line = reader.readLine()) != null;) {
content += line;
}
}
String article1 = Jsoup.connect(urlStr).get().text();
String article2 = Jsoup.connect(urlStr).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").referrer("http://www.google.com").timeout(30000).execute().parse().text();
String article3 = Jsoup.parse(content).text();
System.out.println("ARTICLE 1 : "+article1);
System.out.println("ARTICLE 2 : "+article2);
System.out.println("ARTICLE 3 : "+article3);
}
}
In Article 1 and 2 when I am using Jsoup to connect to the website I am not getting complete info, but while using URL to connect I am getting the complete Page. So basically Article 3 is complete which was done using URL. I have tried with Jsoup 1.8.1 and Jsoup 1.7.2
Use method maxBodySize:
String article = Jsoup.connect(urlStr).maxBodySize(Integer.MAX_VALUE).get().text();

uploading files using jsp and servlet

I am developing a web application using Java. From my index page I need to upload a file with some other fields such as some texts and numbers using input tags.
this is my jsp file.
<select name="category">
<option value="">-Select-</option>
<option value="Mobile Phones">Mobile Phones</option>
<option value="Automobile">Automobile</option>
<option value="Computers">Computers</option>
</select><br/><br/>
<label>Title: </label><input type="text" name="Title"/><br/><br/>
<label>Photo: </label><input type="file" name="photo"/><br/><br/>
<label>Description: </label><input type="text" name="description"/><br/><br/>
<label>Price: </label><input type="text" name="price"/><br/><br/>
<input type="submit" value="Post">
I found some articles which use Apache commons, but in all of that, I can get only the image. All the other values get set to null. The article I followed is this.
I need to know how to get other values as well. (In this case category, title, photo etc.)
How can I do that?
Thank you!
EDIT:
This is my servlet.
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.List;
import javax.servlet.ServletException;
import javax.servlet.annotation.MultipartConfig;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.apache.commons.fileupload.FileItem;
import org.apache.commons.fileupload.disk.DiskFileItemFactory;
import org.apache.commons.fileupload.servlet.ServletFileUpload;
import com.im.dao.PostAdDao;
import com.im.dao.PostAdDaoImpl;
import com.im.entities.Advertiesment;
#WebServlet("/postAd")
#MultipartConfig
public class PostAdServlet extends HttpServlet {
private static final long serialVersionUID = 1L;
private final String UPLOAD_DIRECTORY = "C:/uploadss";
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
Advertiesment ad = new Advertiesment();
PostAdDao pad = new PostAdDaoImpl();
PrintWriter out = response.getWriter();
String name = null;
if(ServletFileUpload.isMultipartContent(request)){
try {
List<FileItem> multiparts = new ServletFileUpload(new DiskFileItemFactory()).parseRequest(request);
for(FileItem item : multiparts){
if(item.isFormField()){
String cat = request.getParameter("category");
System.out.println("INFO: Category : "+cat);
if( cat != null ){
ad.setCategory(cat);
}
String title = request.getParameter("adTitle");
if( title != null ){
ad.setTitle(title);
System.out.println("INFO: Title : "+title);
}
String des = request.getParameter("description");
if(des != null){
ad.setDescription(des);
System.out.println("INFO: Description : "+des);
}
try{
Double price = Double.parseDouble(request.getParameter("price"));
if(price != null){
ad.setPrice(price);
System.out.println("INFO: Price : "+price);
}
}catch(Exception e){
System.out.println("ERROR: Occured while setting price in servlet");
}
}else{
name = new File(item.getName()).getName();
item.write( new File(UPLOAD_DIRECTORY + File.separator + name));
}
}
//File uploaded successfully
request.setAttribute("message", "Advertiesment Posted Successfully");
System.out.println("INFO: Advertiesment Posted Successfully");
System.out.println("INFO: File name : "+name);
ad.setPhoto(name);
} catch (Exception ex) {
request.setAttribute("message", "File Upload Failed due to " + ex);
System.out.println("\nERROR: Occured while posting the advertiesment! "+ex );
}
}else{
//request.setAttribute("message","Sorry this Servlet only handles file upload request");
}
//request.getRequestDispatcher("/result.jsp").forward(request, response);
String msg = pad.postAd(ad);
}
}
I found some articles which use Apache commons, but in all of that, I
can get only the image
No . you can get other items also from the request .
DiskFileUpload upload = new DiskFileUpload();
List<FileItem> items = upload.parseRequest(request);
for (FileItem item : items) {
if (item.isFormField()) {
//get form fields here
}
else {
//process file upload here
}}
Read the documentation here to understand more on this and also a nice thread here values of input text fields in a html multipart form
Update:
String cat = item.getFieldName("category") instead of request.getParameter("category");
Because you are parsing the request object . so you need to get it from FileItem object . similarly for other fields too.

Categories