So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}
Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.
As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.
Related
im developing my Java bot for discord. And I want to send an image. I tried using TextChannel.sendFile(File, Message), but it`s not that result that I want to get. I want this file to be displayed like a normal image.
The imports:
import java.io.File;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import javax.xml.namespace.QName;
import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import org.apache.commons.io.FileUtils;
import net.dv8tion.jda.core.MessageBuilder;
import net.dv8tion.jda.core.entities.Message;
import net.dv8tion.jda.core.entities.TextChannel;
import net.dv8tion.jda.core.events.message.MessageReceivedEvent;
And the other code:
URL url = new URL(s.toString());
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(conn.getInputStream());
final List<String> files = new ArrayList<>();
while (reader.hasNext()) {
XMLEvent e = reader.nextEvent();
if (e.isStartElement()) {
StartElement se = e.asStartElement();
if (se.getName().getLocalPart().equals("post")) {
Attribute purl = se.getAttributeByName(new QName("file_url"));
files.add(purl.getValue());
}
}
}
int rid = ThreadLocalRandom.current().nextInt(files.size() - 1);
String p = files.get(rid);
files.clear();
URL u = new URL(p);
final String[] dots = p.split("\\.");
final String format = dots[dots.length - 1];
File f = new File("its not a porn." + format);
FileUtils.copyURLToFile(url, f);
Message m = new MessageBuilder().append("okay :)").build();
c.sendFile(f, m).queue();
}
I tried to find a solution somewhere but i haven found any info that could help.
At JDA 4.2.0_168
the message on sendFile() is the name of the file that you are sending to the discord servers, so it needs an extension
example:
File f = new File("image.png");
TextChannel.sendFile(f, "image.png").queue();
if you want comments in the message
File f = new File("image.png");
//the name doesn't need to be the same, just the same extension
TextChannel.sendFile(f, "another_name.png").append("okay :)").queue();
Result of last code
Reading through the docs, you need to create a MessageEmbed and add it to the message using m.setEmbed(..)
I used Eclipse Luna 64bit, Maven, docx4j API for PDF conversion, template letter format on which I want my HTML code. This template is saved in my database.
I want to include a hyperlink in the PDF, so my users can on click this link and open it in their browser.
This is my main class:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Properties;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.faces.bean.ManagedBean;
import javax.faces.bean.ManagedProperty;
import javax.faces.bean.ViewScoped;
import javax.faces.model.SelectItem;
import org.apache.commons.lang.StringUtils;
import org.docx4j.Docx4J;
import org.docx4j.XmlUtils;
import org.docx4j.convert.in.xhtml.XHTMLImporterImpl;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.relationships.Namespaces;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.Body;
import org.docx4j.wml.BooleanDefaultTrue;
import org.docx4j.wml.Document;
import org.docx4j.wml.P;
import org.docx4j.wml.PPrBase;
import org.docx4j.wml.R;
import org.docx4j.wml.Text;
import org.primefaces.context.RequestContext;
import org.primefaces.model.DefaultStreamedContent;
import org.primefaces.model.StreamedContent;
import org.primefaces.model.UploadedFile;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class LetterMaintenanceBean extends BaseManagedBean implements
Serializable {
public StreamedContent previewLetter() {
String content = this.letter.getHtmlContent();
String regex = "<a href=(\"[^\"]*\")[^<]*</a>"; //Digvijay
Pattern p = Pattern.compile(regex); //Digvijay
System.out.println("p: "+p);
Matcher m = p.matcher(content); //Digvijay
System.out.println("m: "+m);
content = m.replaceAll("<strong><u><span style=\"color:#0099cc\">$1</span></u></strong>"); //Digvijay
System.out.println("regex1: "+regex); //Digvijay
Map<String, String> previewExamples = this.getPreviewExamples(this.letter.getMessageTypeCode());
for (Entry<String, String> example : previewExamples.entrySet()) {
if (StringUtils.isNotBlank(example.getKey()) && StringUtils.isNotBlank(example.getValue())) {
content = content.replace(example.getKey(), example.getValue());
System.out.println("content after map date");
}
}
System.out.println("content1:: "+content);
if (!content.startsWith("<div>")) {
content = "<div>" + content + "</div>";
}
// Docx4j does not understand HTML codes for special characters. So replacing with Unicode values.
content = content.replace(" ", " ");
content = content.replace("’", "’");
content = content.replaceAll("</p>", "</p><br/>");
content = content.replaceAll("\"</span>", "</span>");
InputStream stream = null;
try {
System.out.println("content:"+content);
if (this.letter.getHtmlContent().getBytes() != null && this.letter.getWfTemplateId() != null) {
stream = new ByteArrayInputStream(this.HTMLToPDF(content.getBytes(), this.letter.getWfTemplateId()));
} else {
stream = new ByteArrayInputStream(this.HTMLToPDFWithoutTemplate(content.getBytes()));
}
StreamedContent file = new DefaultStreamedContent(stream, "application/pdf", this.letter.getLetterName() + ".pdf");
return file;
} catch (LetterMaintenanceException e) {
this.processServiceException(e);
StreamedContent file = new DefaultStreamedContent(
new ByteArrayInputStream(
"Unable to process your request. If the problem persists, please contact application support."
.getBytes()), "application/pdf", "error" + ".pdf");
return file;
} catch (Exception e) {
this.processGenericException(e);
StreamedContent file = new DefaultStreamedContent(
new ByteArrayInputStream(
"Unable to process your request. If the problem persists, please contact application support."
.getBytes()), "application/pdf", "error" + ".pdf");
return file;
}
}
This is my HTMLToPDF() method:
private byte[] HTMLToPDF(final byte[] htmlContent, final String templateId)
throws Docx4JException, LetterMaintenanceException {
LetterMaintenanceDelegate letterMaintenanceDelegate = new LetterMaintenanceDelegate();
Template template = letterMaintenanceDelegate.retrieveTemplateById(templateId);
if (template == null || template.getContent() == null) {
throw new LetterMaintenanceException("Could not retrieve template");
}
InputStream is = new ByteArrayInputStream(template.getContent());
WordprocessingMLPackage templatePackage = WordprocessingMLPackage.load(is);
// Convert HTML to docx
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(templatePackage);
XHTMLImporter.setHyperlinkStyle("Hyperlink");
templatePackage
.getMainDocumentPart()
.getContent()
.addAll(XHTMLImporter.convert(new ByteArrayInputStream(htmlContent), null));
// Add content of content docx to template
templatePackage.getMainDocumentPart().getContent().addAll(templatePackage.getMainDocumentPart().getContent());
// Handle page breaks
templatePackage = this.handlePagebreaksInDocx(templatePackage);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Docx4J.toPDF(templatePackage, baos);
return baos.toByteArray();
}
}
In this code I am trying to convert HTML (with href tag) to PDF file and in the PDF output the hyperlink must work.
The current output of this program is a PDF but there are no working links in it.
How can I activate my links?
The split method returns spaces and i need to return all elements of the string so i can pick the values i want. it works fine on Android but not on app engine. please help i need the html as an array of strings with no spaces. no this is not a duplicate of the other question. look am using the right regex "\s+"
import com.google.api.server.spi.config.Api;
import com.google.api.server.spi.config.ApiMethod;
import com.google.api.server.spi.config.ApiNamespace;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;
import javax.inject.Named;
/**
* An endpoint class we are exposing
*/
#Api(name = "myApi", version = "v1", namespace = #ApiNamespace(ownerDomain = "backend.abokiforex.greatcallie.com", ownerName = "backend.abokiforex.greatcallie.com", packagePath = ""))
public class RateEndPoint {
String[] html;
private static final Logger LOG = Logger.getLogger(RateEndPoint.class.getName());
/**
* A simple endpoint method that takes a name and says Hi back
*/
#ApiMethod(name = "getRates")
public MyRates getRates() {
MyRates response = new MyRates();
try {
Document site = Jsoup.connect("http://abokifx.com/").timeout(0).get();
Elements tags = site.select("p");
String txt = tags.text();
html = txt.split("\\s+");
} catch (Exception e) {
e.printStackTrace();
}
for (int i = 0; i < html.length; i++){
LOG.info(html[i] +"\n");
}
response.setValue1(html[18]);
response.setValue2(html[20]);
response.setValue3(html[21]);
response.setValue4(html[23]);
response.setValue5(html[24]);
response.setValue6(html[26]);
response.setValue7(html[97]);
response.setValue8(html[98]);
response.setValue9(html[99]);
response.setValue10(html[70]);
response.setValue11(html[71]);
response.setValue12(html[72]);
return response;
}
}
okay here is the answer. the app engine wasnt using utf-8(unicode) and so it wasnt getting the regex right. i defined it in the appengine-web xml file for it to work
I am using DOCX4J to convert the DOCX to HTML .I have successfully done the conversion and got the html format.I will be using the html format to embed it as EMAIL body to send an email.But I have some issues which are listed below....
Unable to display images in email body
Losing the spaces and bullets
Please find the code which I have written,
WordprocessingMLPackage wordMLPackage;
wordMLPackage = Docx4J.load(new java.io.File(resourcePath2));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath(imageFolder + resourcePath2 + "_files");
htmlSettings.setImageTargetUri(imageFolder +resourcePath2.substring(resourcePath2.lastIndexOf("/")+1) + "_files");
htmlSettings.setWmlPackage(wordMLPackage);
OutputStream os;
os = new ByteArrayOutputStream();
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_SAVE_FLAT_XML);
DOCX = ((ByteArrayOutputStream)os).toString();
You may add like this in your code
package tcg.doc.web.managedBeans;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;
#Component
#Scope("session")
#Qualifier("ConvertWord")
public class ConvertWord {
private static final String docName = "TestDocx.docx";
private static final String outputlFolderPath = "d:/";
String htmlNamePath = "docHtml.html";
String zipName="_tmp.zip";
File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
// 1) Load DOCX into XWPFDocument
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
System.out.println("InputStream"+doc);
XWPFDocument document = new XWPFDocument(doc);
// 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
// Extract image
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
// URI resolver
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
System.out.println("OutputStream "+out.toString());
} catch (FileNotFoundException ex) {
} catch (IOException ex) {
}
}
public static void main(String[] args) {
ConvertWord cwoWord=new ConvertWord();
cwoWord.ConvertWordToHtml();
System.out.println();
}
public String htmlPath(){
// d:/docHtml.html
return outputlFolderPath+htmlNamePath;
}
public String zipPath(){
// d:/_tmp.zip
return outputlFolderPath+zipName;
}
}
For maven Dependency on pom.xml
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
<version>1.0.4</version>
</dependency>
or download it from Here
For images to work in an email body, I guess you need to use either a data URI or publish them to a web-reachable location.
In either case, you'll need to write an implementation of:
public interface ConversionImageHandler {
/**
* #param picture
* #param relationship of the image
* #param part of the image, if it is an internal image, otherwise null
* #return uri for the image we've saved, or null
* #throws Docx4JException this exception will be logged, but not propagated
*/
public String handleImage(AbstractWordXmlPicture picture, Relationship relationship, BinaryPart part) throws Docx4JException;
}
and configure docx4j to use it with htmlSettings.setImageHandler.
You can look at some of the existing implementations in the docx4j source code, and take advantage of the helper methods in AbstractConversionImageHandler (eg createEncodedImage if you want data URIs).
I have written a code which will fetch me a html contents of the page as response , I am using HTML Unit to do so . But I am getting error's for some specific urls like
[https://communities.netapp.com/welcome][1]
For first page i am able to retrieve the contents . But when i dont the content which we get using load more button .
Here's my code:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Sample {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
String url = "https://communities.netapp.com/welcome";
WebClient client = new WebClient(BrowserVersion.INTERNET_EXPLORER_9);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setRedirectEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(true);
client.getOptions().setCssEnabled(true);
client.getOptions().setUseInsecureSSL(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = client.getPage(url);
Writer output = null;
String text = page.asText();
File file = new File("D://write6.txt");
output = new BufferedWriter(new FileWriter(file));
output.write(text);
output.close();
System.out.println("Your file has been written");
// System.out.println("as Text ==" +page.asText());
// System.out.println("asXML == " +page.asXml());
// System.out.println("text content ==" +page.getTextContent());
// System.out.println(page.getWebResponse().getContentAsString());
}
}
Any suggestion ?
As i understand from your question you have a button on which you have to press.
Please look at: http://htmlunit.sourceforge.net/gettingStarted.html
You have there an example of submitting a form.
This should be very similar here