Document doc = Jsoup.connect("https://www.amazon.com/gp/product/B01MXLQ5TM/").get();
String title = doc.title();
System.out.println("TITLE "+title);
Element reviews = doc.getElementById("reviewsMedley");
System.out.println(" " + reviews.text());
Hey, I am working on data extraction using jsoup and extracting reviews from Amazon. This is my code, it gives me reviews from first page. How can I transform it to get reviews from all pages.
Here is my simple implementation of Amazon review crawler.
package com.mycompany.amazon.crawler;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class AmazonCrawler {
private static final Logger LOG = LogManager.getLogger(AmazonCrawler.class);
public static void main(String[] args) throws IOException {
List<Review> reviews = new ArrayList<>();
int pageNumber = 1;
while (true) {
/*
URL is changed after saving answer, change it to this:
https://www.amazon.com/Dell-Inspiron-Touchscreen-Performance-Bluetooth/product-reviews/B01MXLQ5TM/ref=cm_cr_getr_d_paging_btm_ + pageNumber + ?reviewerType=all_reviews&pageNumber= + pageNumber
*/
String url = "https://www.amazon.com/Dell-Inspiron-Touchscreen-Performance-Bluetooth/product-reviews/B01MXLQ5TM/ref=cm_cr_getr_d_paging_btm_" + pageNumber + "?reviewerType=all_reviews&pageNumber=" + pageNumber;
LOG.info("Crawling URL: {}", url);
Document doc = Jsoup.connect(url).get();
Elements reviewElements = doc.select(".review");
if (reviewElements == null || reviewElements.isEmpty()) {
break;
}
for (Element reviewElement : reviewElements) {
Element titleElement = reviewElement.select(".review-title").first();
if (titleElement == null) {
LOG.error("Title element is null");
continue;
}
String title = titleElement.text();
Element textElement = reviewElement.select(".review-text").first();
if (textElement == null) {
LOG.error("Text element is null");
continue;
}
String text = textElement.text();
reviews.add(new Review(title, text));
}
pageNumber++;
}
LOG.info("Number of reviews: {}", reviews.size());
for (Review review : reviews) {
System.out.println(review.getTitle());
System.out.println(review.getText());
System.out.println("\n");
}
}
static class Review {
private final String title;
private final String text;
public Review(String title, String text) {
this.title = title;
this.text = text;
}
public String getTitle() {
return title;
}
public String getText() {
return text;
}
}
}
I know this is flagged for JSoup, but wouldn't it be more reliable to simply use Amazon's API for retrieving this data?
http://docs.aws.amazon.com/AWSECommerceService/latest/DG/EX_RetrievingCustomerReviews.html
I have created a web scraper which brings the market data of share rates from the website of stock exchange. www.psx.com.pk in that site there is a hyperlink of Market Summary. From that link I have to scrap the data. I have created a program which is as follows.
package com.market_summary;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.Locale;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ComMarket_summary {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.202");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.psx.com.pk/index.php";
doc = Jsoup.connect(tempUrl).get();
System.out.println("Successfully Connected");
}
public static void parsingHTML() throws Exception {
File fold = new File("C:\\market_smry.csv");
fold.delete();
File fnew = new File("C:\\market_smry.csv");
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
trElement2 = trElement.getElementsByTag("td");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter(fnew, true);
if (trElement.hasClass("marketData")) {
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" | ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
}
System.out.println(sampleList.add(tdElements));
}
}
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);
public static String formatData(String text) {
String tmp = null;
try {
Date d = FORMATTER_MMM_d_yyyy.parse(text);
tmp = FORMATTER_dd_MMM_yyyy.format(d);
} catch (ParseException pe) {
tmp = text;
}
return tmp;
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
}
Now, the problem is when I execute this program it should create a .csv file but what actually happens is it's not creating any file. When I debug this code I found that program is not entering in the loop. I don't understand that why it is doing so. While when I run the same program on the other website which have slightly different page structure it is running fine.
What I understand that this data is present in the #document which is a virtual element and doesn't mean anything that's why program can't read it while there is no such thing in other website. Kindly, help me out to read the data inside the #document element.
Long Story Short
Change your temp url to http://www.psx.com.pk/phps/index1.php
Explanation
There is no table in the document of http://www.psx.com.pk/index.php.
Instead it is showing it's content in two frameset.
One is dummy with url http://www.psx.com.pk/phps/blank.php.
Another one is the real page which is showing actual data and it's url is
http://www.psx.com.pk/phps/index1.php
I am trying to parse through a XML format String for example;
<params city="SANTA ANA" dateOfBirth="1970-01-01"/>
My goal is to add attributes name in an array list such as {city,dateOfBirth} and values of the attributes in another array list such as {Santa Ana, 1970-01-01}
any advice, please help!
Create SAXParserFactory.
Create SAXParser.
Create YourHandler, which extends DefaultHandler.
Parse your file using SAXParser and YourHandler.
For example:
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(yourFile, new YourHandler());
} catch (ParserConfigurationException e) {
System.err.println(e.getMessage());
}
where, yourFile - object of the File class.
In YourHandler class:
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class YourHandler extends DefaultHandler {
String tag = "params"; // needed tag
String city = "city"; // name of the attribute
String value; // your value of the city
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if(localName.equals(tag)) {
value = attributes.getValue(city);
}
}
public String getValue() {
return value;
}
}`
More information for SAX parser and DefaultHandler here and here respectively.
Using JDOM (http://www.jdom.org/docs/apidocs/):
String myString = "<params city='SANTA ANA' dateOfBirth='1970-01-01'/>";
SAXBuilder builder = new SAXBuilder();
Document myStringAsXML = builder.build(new StringReader(myString));
Element rootElement = myStringAsXML.getRootElement();
ArrayList<String> attributeNames = new ArrayList<String>();
ArrayList<String> values = new ArrayList<String>();
List<Attribute> attributes = new ArrayList<Attribute>();
attributes.addAll(rootElement.getAttributes());
Iterator<Element> childIterator = rootElement.getDescendants();
while (childIterator.hasNext()) {
Element childElement = childIterator.next();
attributes.addAll(childElement.getAttributes());
}
for (Attribute attribute: attributes) {
attributeNames.add(attribute.getName());
values.add(attribute.getValue());
}
System.out.println("Attribute names: " + attributeNames);
System.out.println("Values: " + values);
I need to extract just a single value from a web page. This value is a random number which is generated each time the page is visited. I won't post the full page source but the string that contains the value is:
<span class="label label-info pull-right">Expecting 937117</span>
The "937117" is the value I'm after here. Thanks
Update
Here is what I've got so far:
Document doc = Jsoup.connect("www.mywebsite.com).get();
Elements value = doc.select("*what do I put in here?*");
System.out.println(value);
Everything is described clearly in following snippet. I had created a HTML file with a similar SPAN tag inside. Use Document.select() to select elements with specific class name that you want.
import java.io.File;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities.EscapeMode;
import org.jsoup.select.Elements;
public static void main(String[] args) {
String sourceDir = "C:/Users/admin/Desktop/test.html";
test(sourceDir);
}
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Elements classEles = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Find all SPAN element with matched CLASS name **/
classEles = doc.select("span.label.label-info.pull-right");
if (classEles.size() > 0) {
String number = classEles.get(0).text();
System.out.println("number: " + number);
}
else {
System.out.println("No SPAN element found with class label label-info pull-right.");
}
} catch (Exception e) {
e.printStackTrace();
}
}
can you not use javascript regular expression syntax? If you know the element you are interested in, extract it as a string $stuff from jsoup, then just do
$stuff.match( /Expecting (\d*)/ )[1]
public void yourMethod() {
try {
Document doc = connect("http://google.com").userAgent("Mozilla").get();
Elements value = doc.select("span.label label-info pull-right");
} catch (IOException e) {
e.printStackTrace();
}
}
I'm using docx4j in an XPages application to create Word documents containing content from an XPage. The Word document (in .docx format) is created based on a template (also in docx.format). This all works fine. However, when I change the template from a .docx to a .dotx format, the Word document (.docx) which is generated cannot be opened. On trying to open the document, I get an error saying that the content causes problems.
Can anyone tell me how to convert a .dotx file to a .docx file using docx4j?
The code I am currently using is:
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
import javax.xml.bind.JAXBElement;
import javax.xml.bind.JAXBException;
import org.docx4j.XmlUtils;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.wml.ContentAccessor;
import org.slf4j.impl.*;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.docx4j.wml.*;
import org.apache.commons.lang3.StringUtils;
import java.util.Enumeration;
import java.util.Map;
import java.util.Iterator;
import java.util.Vector;
import lotus.domino.Document;
import lotus.domino.*;
import org.docx4j.openpackaging.parts.WordprocessingML.DocumentSettingsPart;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.parts.relationships.RelationshipsPart;
import org.docx4j.openpackaging.parts.relationships.Namespaces;
public class JavaTemplateDocument {
public void mainCode(Session session, Document currDoc, String empLang, String templateType, String sArt) throws Exception {
Database dbCurr = session.getCurrentDatabase();
String viewName = "vieTemplateLookup";
View tview = dbCurr.getView(viewName);
Vector viewKey = new Vector();
viewKey.addElement(empLang);
viewKey.addElement(templateType);
Document templateDoc = tview.getDocumentByKey(viewKey);
if (tview.getDocumentByKey(viewKey) == null ) System.out.println("templateDoc is NULL");
Item itmNotesFields = templateDoc.getFirstItem("NotesFieldList");
Item itmWordFields = templateDoc.getFirstItem("WordFieldList");
Vector<String[]> notesFields = itmNotesFields.getValues();
Vector<String[]> wordFields = itmWordFields.getValues();
int z = notesFields.size();
int x = wordFields.size();
Enumeration e1 = notesFields.elements();
Enumeration e2 = wordFields.elements();
WordprocessingMLPackage template = getTemplate("C:\\Temp\\AZG Sample Template.dotx","C:\\Temp\\AZG Sample Template.docx");
for (int y = 0; y < x; y++) {
if (currDoc.hasItem(String.valueOf(notesFields.elementAt(y)))) {
Item itmNotesName = currDoc.getFirstItem(String.valueOf(notesFields.elementAt(y)));
replacePlaceholder(template, itmNotesName.getText(), String.valueOf(wordFields.elementAt(y))); }
else {
replacePlaceholder(template, "", String.valueOf(wordFields.elementAt(y)));
}
}
writeDocxToStream(template, "C:\\Temp\\AZG Sample Document.docx");
createResponseDocument(dbCurr, currDoc, templateDoc, sArt);
}
private void createResponseDocument(Database dbCurr, Document currDoc, Document templateDoc, String sArt) throws NotesException{
Document respDoc = dbCurr.createDocument(); // create the response document
String refVal = currDoc.getUniversalID();
respDoc.appendItemValue("IsDocTemplate", "1");
if (currDoc.hasItem("Name")) {
respDoc.appendItemValue("Name", currDoc.getItemValue("Name"));}
else {System.out.println("Name is not available"); }
if (currDoc.hasItem("Firstname")) {
respDoc.appendItemValue("Firstname", currDoc.getItemValue("Firstname"));}
else {System.out.println("Firstname is not available"); }
if (currDoc.hasItem("ReferenceTypeTexts")) {
respDoc.appendItemValue("ReferenceTypeTexts", currDoc.getItemValue("ReferenceTypeTexts"));}
else {System.out.println("ReferenceTypeTexts is not available"); }
if (currDoc.hasItem("ReferenceType")) {
respDoc.appendItemValue("ReferenceType", currDoc.getItemValue("ReferenceType"));}
else {System.out.println("ReferenceType is not available"); }
System.out.println("Append Form value");
respDoc.appendItemValue("Form", "frmRespTempl");
respDoc.makeResponse(currDoc);
RichTextItem body = respDoc.createRichTextItem("Body");
body.embedObject(1454, "", "C:\\Temp\\AZG Sample Document.docx", null);
respDoc.save();
}
/*
* Create a simple word document that we can use as a template.
* For this just open Word, create a new document and save it as template.docx.
* This is the word template we'll use to add content to.
* The first thing we need to do is load this document with docx4j.
*/
private WordprocessingMLPackage getTemplate(String source, String target) throws Docx4JException, FileNotFoundException, IOException {
String WORDPROCESSINGML_DOCUMENT = "application/vnd.openxmlformats- officedocument.wordprocessingml.document.main+xml";
final ContentType contentType = new ContentType(WORDPROCESSINGML_DOCUMENT);
String templatePath = source;
File sourceFile = new File(source);
File targetFile = new File(target);
copyFileUsingFileChannels(sourceFile, targetFile);
WordprocessingMLPackage template = WordprocessingMLPackage.load(new FileInputStream(targetFile));
ContentTypeManager ctm = wordMLPackage.getContentTypeManager();
ctm.addOverrideContentType(new URI("/word/document.xml"),WORDPROCESSINGML_DOCUMENT);
DocumentSettingsPart dsp = new DocumentSettingsPart();
CTSettings settings = Context.getWmlObjectFactory().createCTSettings();
dsp.setJaxbElement(settings);
wordMLPackage.getMainDocumentPart().addTargetPart(dsp);
// Create external rel
RelationshipsPart rp = RelationshipsPart.createRelationshipsPartForPart(dsp);
org.docx4j.relationships.Relationship rel = new org.docx4j.relationships.ObjectFactory().createRelationship();
rel.setType( Namespaces.ATTACHED_TEMPLATE );
rel.setTarget(templatePath);
rel.setTargetMode("External");
rp.addRelationship(rel); // addRelationship sets the rel's #Id
settings.setAttachedTemplate(
(CTRel)XmlUtils.unmarshalString("<w:attachedTemplate xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" xmlns:r=\"http://schemas.openxmlformats.org/officeDocument/2006/relationships\" r:id=\"" + rel.getId() + "\"/>", Context.jc, CTRel.class)
);
return template;
}
private static List<Object> getAllElementFromObject(Object obj, Class<?> toSearch) {
List<Object> result = new ArrayList<Object>();
if (obj instanceof JAXBElement) obj = ((JAXBElement<?>) obj).getValue();
if (obj.getClass().equals(toSearch))
result.add(obj);
else if (obj instanceof ContentAccessor) {
List<?> children = ((ContentAccessor) obj).getContent();
for (Object child : children) {
result.addAll(getAllElementFromObject(child, toSearch));
}
}
return result;
}
/*
* This will look for all the Text elements in the document, and those that match are replaced with the value we specify.
*/
private void replacePlaceholder(WordprocessingMLPackage template, String name, String placeholder ) {
List<Object> texts = getAllElementFromObject(template.getMainDocumentPart(), Text.class);
for (Object text : texts) {
Text textElement = (Text) text;
if (textElement.getValue().equals(placeholder)) {
textElement.setValue(name);
}
}
}
/*
* write the document back to a file
*/
private void writeDocxToStream(WordprocessingMLPackage template, String target) throws IOException, Docx4JException {
File f = new File(target);
template.save(f);
}
/*
* Example code for replaceParagraph
*
String placeholder = "SJ_EX1";
String toAdd = "jos\ndirksen";
replaceParagraph(placeholder, toAdd, template, template.getMainDocumentPart());
*/
private void replaceParagraph(String placeholder, String textToAdd, WordprocessingMLPackage template, ContentAccessor addTo) {
// 1. get the paragraph
List<Object> paragraphs = getAllElementFromObject(template.getMainDocumentPart(), P.class);
P toReplace = null;
for (Object p : paragraphs) {
List<Object> texts = getAllElementFromObject(p, Text.class);
for (Object t : texts) {
Text content = (Text) t;
if (content.getValue().equals(placeholder)) {
toReplace = (P) p;
break;
}
}
}
// we now have the paragraph that contains our placeholder: toReplace
// 2. split into seperate lines
String as[] = StringUtils.splitPreserveAllTokens(textToAdd, '\n');
for (int i = 0; i < as.length; i++) {
String ptext = as[i];
// 3. copy the found paragraph to keep styling correct
P copy = (P) XmlUtils.deepCopy(toReplace);
// replace the text elements from the copy
List<?> texts = getAllElementFromObject(copy, Text.class);
if (texts.size() > 0) {
Text textToReplace = (Text) texts.get(0);
textToReplace.setValue(ptext);
}
// add the paragraph to the document
addTo.getContent().add(copy);
}
// 4. remove the original one
((ContentAccessor)toReplace.getParent()).getContent().remove(toReplace);
}
/*
* A set of hashmaps that contain the name of the placeholder to replace and the value to replace it with.
*
* Map<String,String> repl1 = new HashMap<String, String>();
repl1.put("SJ_FUNCTION", "function1");
repl1.put("SJ_DESC", "desc1");
repl1.put("SJ_PERIOD", "period1");
Map<String,String> repl2 = new HashMap<String, String>();
repl2.put("SJ_FUNCTION", "function2");
repl2.put("SJ_DESC", "desc2");
repl2.put("SJ_PERIOD", "period2");
Map<String,String> repl3 = new HashMap<String, String>();
repl3.put("SJ_FUNCTION", "function3");
repl3.put("SJ_DESC", "desc3");
repl3.put("SJ_PERIOD", "period3");
replaceTable(new String[]{"SJ_FUNCTION","SJ_DESC","SJ_PERIOD"}, Arrays.asList(repl1,repl2,repl3), template);
*/
private void replaceTable(String[] placeholders, List<Map<String, String>> textToAdd,
WordprocessingMLPackage template) throws Docx4JException, JAXBException {
List<Object> tables = getAllElementFromObject(template.getMainDocumentPart(), Tbl.class);
// 1. find the table
Tbl tempTable = getTemplateTable(tables, placeholders[0]);
List<Object> rows = getAllElementFromObject(tempTable, Tr.class);
// first row is header, second row is content
if (rows.size() == 2) {
// this is our template row
Tr templateRow = (Tr) rows.get(1);
for (Map<String, String> replacements : textToAdd) {
// 2 and 3 are done in this method
addRowToTable(tempTable, templateRow, replacements);
}
// 4. remove the template row
tempTable.getContent().remove(templateRow);
}
}
private Tbl getTemplateTable(List<Object> tables, String templateKey) throws Docx4JException, JAXBException {
for (Iterator<Object> iterator = tables.iterator(); iterator.hasNext();) {
Object tbl = iterator.next();
List<?> textElements = getAllElementFromObject(tbl, Text.class);
for (Object text : textElements) {
Text textElement = (Text) text;
if (textElement.getValue() != null && textElement.getValue().equals(templateKey))
return (Tbl) tbl;
}
}
return null;
}
private static void addRowToTable(Tbl reviewtable, Tr templateRow, Map<String, String> replacements) {
Tr workingRow = (Tr) XmlUtils.deepCopy(templateRow);
List<?> textElements = getAllElementFromObject(workingRow, Text.class);
for (Object object : textElements) {
Text text = (Text) object;
String replacementValue = (String) replacements.get(text.getValue());
if (replacementValue != null)
text.setValue(replacementValue);
}
reviewtable.getContent().add(workingRow);
}
private static void copyFileUsingFileChannels(File source, File dest)
throws IOException {
FileChannel inputChannel = null;
FileChannel outputChannel = null;
try {
inputChannel = new FileInputStream(source).getChannel();
outputChannel = new FileOutputStream(dest).getChannel();
outputChannel.transferFrom(inputChannel, 0, inputChannel.size());
} finally {
inputChannel.close();
outputChannel.close();
}
}
}
Broadly, there are a few things that comprise the difference between a template (.dotx) and a document (.docx). This means you have a few things that you need to do -- it's not as simple as just changing the file extension, whether you're saving a doc as a template, or attempting to create a document from a template.
Hopefully this outline will assist:
First do what you've already done: your new document should be a file copy of the template
Change your new WordprocessingMLPackage's document type as appropriate (see WORDPROCESSINGML_TEMPLATE in the ContentTypes class)
Create an attached template and attach it to your document: see the sample code on Github for more detail on that (TemplateAttach.java sample).
Good luck!
Let's hack it.
New office formats are just ZIPs with many XML configurations and data. Try to save identical document as template and document in MS Word. IMHO the core of your problem is in (packed) file [Content_Types].xml.
They differ in the property:
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
I would expect #benpoole's advice should work (it should alter the content of said file). If that is not the case, simply hack the content of it inside the file (it is just ordinary ZIP archive, remember).
Disclaimer: there IS difference in few more files, that might need tweaking to make it work.
I would say that you need to change the returning filename to a dotx from docx
do a filecopy from docx to dotx and change this row
body.embedObject(1454, "", "C:\\Temp\\AZG Sample Document.dotx", null);