pull data with Jsoup and write into excel file

pull data with Jsoup and write into excel file - java

I am attempting to use jsoup in eclipse to pull data from an html table and write that dat into an excel file. I am able to pull the table data out of the html, but it is all considered one string that is difficult to write to an excel file. I am not sure if this is the correct way to pull the information but this is my current code:
public static void main(String args[]) {
Document document;
try {
document = Jsoup.connect("https://companiesmarketcap.com/usa/largest-companies-in-the-usa-by-market-cap/?page=1/").get();
Elements trs = document.select("td");
for (Element NEW : trs ) {
Elements table = NEW.getElementsByTag("td");
Element td = table.first();
String TA = td.text();
}
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
This gets me my table data, but I am not able to adjust this data to put it into an excel doc. Any help is much appreciated!

You have to be more specific to select the individual elements so you can format them the way you want... Keep studying the Jsoup documentation and CSS selectors... I kept it simple, the code below works so you can copy and paste the result into excel.
public static void main(String[] args) {
try {
StringBuilder sb = new StringBuilder();
Document document = Jsoup.connect("https://companiesmarketcap.com/usa/largest-companies-in-the-usa-by-market-cap/?page=1/").get();
Elements thList = document.select("table > thead > tr > th"); // Get table column headers
for (Element th : thList) {
sb.append(th.text()).append("\t"); // Append table header's text content + tab separator
}
sb.setLength(sb.length() - 1); // To delete the last tab
sb.append("\n"); // Append new line to end of header's content
Elements trList = document.select("table > tbody > tr"); // Get all rows in the table body
for (Element tr : trList) {
Elements tdList = tr.select("td"); // Get column details
for (Element td : tdList) {
sb.append(td.text()).append("\t"); // Append table details' text content + tab separator
}
sb.setLength(sb.length() - 1); // To delete the last tab
sb.append("\n"); // Append new line to end of row's content
}
System.out.println(sb);
} catch (IOException e) {
System.out.println(e.getMessage());
}
}

Related

Trying to iterate over very similar Elements in an XML file. NOTE XML file is attribute less

Hullo, I have a question about xml and java. I have a weird XML file with no attributes and only Elements, im trying to zero in on a specific Element Stack, and then iterate over all of the similar element stacks.
<InstrumentData>
<Action>Entire Plot</Action>
<AppStamp>Vectorworks</AppStamp>
<VWVersion>2502</VWVersion>
<VWBuild>523565</VWBuild>
<AutoRot2D>false</AutoRot2D>
<UID_1505_1_1_0_0> ---- This is the part I care about, there are about 1000+ of these and they all vary slightly after the "UID_"---
<Action>Update</Action>
<TimeStamp>20200427192323</TimeStamp>
<AppStamp>Vectorworks</AppStamp>
<UID>1505.1.1.0.0</UID>
</UID_1505_1_1_0_0>
I am using dom4j as the xml parser and I dont have any issues spitting out all of the data I just want to zero in on the XML path.
This is the code so far:
public class Unmarshal {
public Unmarshal() {
File file = new File("/Users/michaelaboah/Desktop/LIHN 1.11.18 v2020.xml");
SAXReader reader = new SAXReader();
try {
Document doc = reader.read(file);
Element ele = doc.getRootElement();
Iterator<Element> it = ele.elementIterator();
Iterator<Node> nodeIt = ele.nodeIterator();
while(it.hasNext()) {
Element test2 = (Element) it.next();
List<Element> eleList = ele.elements();
for(Element elementsIt : eleList) {
System.out.println(elementsIt.selectSingleNode("/SLData/InstrumentData").getStringValue());
//This spits out everything under the Instrument Data branch
//All of that data is very large
System.out.println(elementsIt.selectSingleNode("/SLData/InstrumentData/UID_1505_1_1_0_0").getStringValue());
//This spits out everything under the UID branch
}
}
} catch (DocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Also, I know there are some unused data types and variables there was a lot of testing

I think your answer is:
elementsIt.selectSingleNode("/SLData/InstrumentData/*[starts-with(local-name(), 'UID_')]").getStringValue()
I used this post to find this XPath and it works with the few xml lines you gave.

How to add a paragraph or text between Tables in .docx file with apache POI

I'm trying to set some paragraph or text in .docx file using Apache POI, I'm reading a .docx file used as template from WEB-INF/resources/templates folder inside my war file, once read, I want to create dynamically more tables starting after 9th table used as template, I'm able to add more tables but other type of content (paragraphs) are arranged in other section of the document ¿is there a required form to do this thing?
XWPFDocument doc = null;
try {
doc = new XWPFDocument(OPCPackage.open(request.getSession().getServletContext().getResourceAsStream("/resources/templates/twd.docx")));
} catch (Exception e) {
e.printStackTrace();
}
XWPFParagraph parrafo = null;
XWPFTable table=null;
org.apache.xmlbeans.XmlCursor cursor = null;
XWPFParagraph newParagraph = null;
XWPFRun run = null;
for(int j=0; j < 3; j++) { //create 3 more tables
table = doc.getTables().get(9);
cursor = table.getCTTbl().newCursor();
cursor.toEndToken();
if (cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START);
{
table = doc.insertNewTbl(cursor);
table.getRow(0).getCell(0).addParagraph().createRun()
.setText("Name");
table.createRow().getCell(0).addParagraph().createRun().setText("Version");
table.createRow().getCell(0).addParagraph().createRun().setText("Description");
table.createRow().getCell(0).addParagraph().createRun().setText("Comments");
table.createRow().getCell(0).addParagraph().createRun().addCarriageReturn();
table.getRow(0).createCell().addParagraph().createRun().setText("some text");
table.getRow(1).createCell().addParagraph().createRun().setText("some text");
table.getRow(2).createCell().addParagraph().createRun().setText("some text");
table.getRow(3).createCell().addParagraph().createRun().setText("some text");
table.getRows().get(0).getCell(0).setColor("183154");
table.getRows().get(1).getCell(0).setColor("183154");
table.getRows().get(2).getCell(0).setColor("183154");
table.getRows().get(3).getCell(0).setColor("183154");
table.getCTTbl().addNewTblGrid().addNewGridCol().setW(BigInteger.valueOf(4000));
table.getCTTbl().getTblGrid().addNewGridCol().setW(BigInteger.valueOf(4000));
}
//OTHER CONTENT BETWEEN CREATED TABLES (PARAGRAPHS, BREAK LINES,ETC)
doc.createParagraph().createRun().setText("text after table");
}

If you once uses a cursor, then you must using that cursor for further inserting content if you wants to be in the document part where the cursor is also. Don't believe, the document automatically will take note of the cursor you created.
So for example:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordTextAfterTable {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordTextAfterTable.docx"));
XWPFTable table = document.getTables().get(9);
org.apache.xmlbeans.XmlCursor cursor = table.getCTTbl().newCursor();
cursor.toEndToken(); //now we are at end of the CTTbl
//there always must be a next start token after the table. Either a p or at least sectPr.
while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START); //we loop over the tokens until next TokenType.START
//now we are at next TokenType.START and insert the new table
//note: This is immediately after the table. So both tables touch each other.
table = document.insertNewTbl(cursor);
table.getRow(0).getCell(0).addParagraph().createRun().setText("Name");
table.createRow().getCell(0).addParagraph().createRun().setText("Version");
table.createRow().getCell(0).addParagraph().createRun().setText("Description");
table.createRow().getCell(0).addParagraph().createRun().setText("Comments");
table.createRow().getCell(0).addParagraph().createRun().addCarriageReturn();
//...
System.out.println(cursor.isEnd()); //cursor is now at the end of the new table
//there always must be a next start token after the table. Either a p or at least sectPr.
while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START); //we loop over the tokens until next TokenType.START
XWPFParagraph newParagraph = document.insertNewParagraph(cursor);
XWPFRun run = newParagraph.createRun();
run.setText("text after table");
document.write(new FileOutputStream("WordTextAfterTableNew.docx"));
document.close();
}
}

How to create bookmarks -table of contents- from headings with iText?

I create pdf from a list of html codes. it generates pdf very well. But I want to add table of contents to the pdf. Table of contents should be created from h1, h2 etc. How can I do it? Below is my function to create pdf. I looked to existing examples in iText site but I couldn't make it work as I want it.
public static void createMultiplePagedPdf(String destinationFile, ArrayList<String> htmlStrings,
String cssLocation, HeaderFooter headerFooter, boolean tableOfContents) {
String css = null;
ElementList list = null;
Document document = new Document();
try {
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(destinationFile));
if(headerFooter!=null)
writer.setPageEvent(headerFooter);
TableOfContent tocEvent = new TableOfContent();
writer.setPageEvent(tocEvent);
document.open();
if(tableOfContents){
tocEvent.setRoot(writer.getRootOutline());
for (TOCEntry entry : tocEvent.getToc()) {
Chunk c = new Chunk(entry.title);
c.setAction(entry.action);
document.add(new Paragraph(c));
}
}
if (cssLocation != null)
css = readCSS(cssLocation);
for (String htmlfile : htmlStrings) {
if (css != null)
list = XMLWorkerHelper.parseToElementList(htmlfile, css);
else
list = XMLWorkerHelper.parseToElementList(htmlfile, null);
for (Element e : list) {
document.add(e);
}
}
System.out.println("Pdf Created successfully");
document.close();
} catch ...
}
tocEvent.getToc() returns an empty list. When I move that if statement to the end of code it doesn't matter.
My TOCEntry and TableOfContent classes are the same as written in the first example of Creating Table of Contents using events Using iText5.
Thanks in advance!

How to extract headline titles followed by respective text from Wikipedia

I am trying to use Jsoup in order to extract text from Wikipedia articles.
My idea is to simply extract every headline, and their respective text paragraphs.
I am having some trouble understanding how I can take only the specific text of each section, here's what I have:
public static void main(String[] args) {
String url = "http://en.wikipedia.org/wiki/Albert_Einstein";
Document doc;
try {
doc = Jsoup.connect(url).get();
doc = Jsoup.parse(doc.toString());
Elements titles = doc.select(".mw-headline");
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
for(Element h3 : doc.select(".mw-headline"))
{
String title = h3.text();
String titleID = h3.id();
Elements paragraphs = doc.select("p#"+titleID);
//Element nextEle=h3.nextElementSibling();
System.out.println(title);
System.out.println("----------------------------------------");
System.out.println(titleID);
System.out.print("\n");
System.out.println(paragraphs.text());
System.out.print("\n");
}
} catch (IOException e) {
System.out.println("deu merda");
e.printStackTrace();
}
With this I can extract every headline, but I can't get how I would get the text from each section to print it accordingly. I was thinking maybe with the headline's ID, but no dice.
Thank you for any help!

Depending on the tag structure of the page (if any), that could be complicated. A better alternative could be to iterate on all the elements, detecting headlines. Every time you detect a new headline (or you reach the end of the elements), it means a new headline. All elements up to here belong to the previous headline (or to the "header" of the article if there is no previous headline).

Read table from docx file using Apache POI

I am able to read tables from doc file. (see following code)
public String readDocFile(String filename, String str) {
try {
InputStream fis = new FileInputStream(filename);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
boolean intable = false;
boolean inrow = false;
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph par = range.getParagraph(i);
//System.out.println("paragraph "+(i+1));
//System.out.println("is in table: "+par.isInTable());
//System.out.println("is table row end: "+par.isTableRowEnd());
//System.out.println(par.text());
if (par.isInTable()) {
if (!intable) {//System.out.println("New table creating"+intable);
str += "<table border='1'>";
intable = true;
}
if (!inrow) {//System.out.println("New row creating"+inrow);
str += "<tr>";
inrow = true;
}
if (par.isTableRowEnd()) {
inrow = false;
} else {
//System.out.println("New text adding"+par.text());
str += "<td>" + par.text() + "</td>";
}
} else {
if (inrow) {//System.out.println("Closing Row");
str += "</tr>";
inrow = false;
}
if (intable) {//System.out.println("Closing Table");
str += "</table>";
intable = false;
}
str += par.text() + "<br/>";
}
}
} catch (Exception e) {
System.out.println("Exception: " + e);
}
return str;
}
Can anyone suggest me how can I do the same with docx file ?
I tried to do that. But could not locate a replacement of 'Range' class.
Please help.

By popular request, promoting a comment to an answer...
In the Apache POI code examples, you can find the XWPF SimpleTable example
This shows how to create a simple table, and how to create one with lots of fancy styling.
Assuming you just want a simple table from scratch, in a brand new workbook, then the code you need goes along the lines of:
// Start with a new document
XWPFDocument doc = new XWPFDocument();
// Add a 3 column, 3 row table
XWPFTable table = doc.createTable(3, 3);
// Set some text in the middle
table.getRow(1).getCell(1).setText("EXAMPLE OF TABLE");
// table cells have a list of paragraphs; there is an initial
// paragraph created when the cell is created. If you create a
// paragraph in the document to put in the cell, it will also
// appear in the document following the table, which is probably
// not the desired result.
XWPFParagraph p1 = table.getRow(0).getCell(0).getParagraphs().get(0);
XWPFRun r1 = p1.createRun();
r1.setBold(true);
r1.setText("The quick brown fox");
r1.setItalic(true);
r1.setFontFamily("Courier");
r1.setUnderline(UnderlinePatterns.DOT_DOT_DASH);
r1.setTextPosition(100);
// And at the end
table.getRow(2).getCell(2).setText("only text");
// Save it out, to view in word
FileOutputStream out = new FileOutputStream("simpleTable.docx");
doc.write(out);
out.close();

The following snippet uses Apache POI 5.0.0, and it works well when reading docx table data
public void readDocxTables(String docxFilePath) throws FileNotFoundException, IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream(docxFilePath));
for(XWPFTable table : doc.getTables()) {
for(XWPFTableRow row : table.getRows()) {
for(XWPFTableCell cell : row.getTableCells()) {
System.out.println("cell text: " + cell.getText());
}
}
}
}

This is not Apache POI, but using third party component found it much easier.
An example how to get tables from a docx file.
Of course, just idea if you do not find solution with the POI,

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

pull data with Jsoup and write into excel file - java

Related

Trying to iterate over very similar Elements in an XML file. NOTE XML file is attribute less

How to add a paragraph or text between Tables in .docx file with apache POI

How to create bookmarks -table of contents- from headings with iText?

How to extract headline titles followed by respective text from Wikipedia

Read table from docx file using Apache POI

Categories

Resources