How to save a jsoup document as text file - java

I am trying to save all of the readable words on a web page into one text document while ignoring html markup.
Using JSoup to parse all of the words on a webpage, my only guess of how to seperate the real words from the code is through elements.
Is it possible to convert multiple elements of the jsoup document into a text file?
i.e.:
Elements titles = doc.select("title");
Elements paragraphs = doc.select("p");
Elements links = doc.select("a[href]");
Elements smallText = doc.select("a");
Currently saving the parse as a document with:
Document doc = Jsoup.connect("https:// (enter a url)").get();

Its simple way
Document doc = Jsoup.connect("https:// (enter a url)").get();
BufferedWriter writer = null;
try
{
writer = new BufferedWriter( new FileWriter("d://test.txt"));
writer.write(doc.toString());
}
catch ( IOException e)
{
}

Adding answer because I am unable to comment above.
Replace writer.write(doc.toString()); by writer.write(doc.select("html").text()); in the above code.
It will give you the text on the page.
Instead of "html" in doc.select("**html**").text() other tags can be used to extract text enclosed in those tags.
Edit: you can also use writer.write(doc.body().text());

After writing in the text with writer.write(doc.text()); the very next line you need to write writer.close(); this will fix the problem.

Related

How to get content of a page of a .docx file using Apache Poi?

I'm trying to read .docx files with styling information using Apache Poi which I have done by looping through each XWPFParagraph and working with all the XWPFRun run inside the paragraphs. Now I want to get contents of each pages. So is there a way to get the contents of each pages or is it possible to know in which page a paragraph is currently in?
This is a function that takes the absolute path of a docx file and returns an array of strings
FileInputStream fis = new FileInputStream(absolutePath);
XWPFDocument document = new XWPFDocument(fis);
List<IBodyElement> bodyElements = document.getBodyElements();
List<String> textList = new ArrayList<>();
/* I want to add some kind of outer loop here for each page
and at the end of that loop I want to add a "<hr/>" tag in the textList
*/
for (IBodyElement bodyElement : bodyElements) { // Looping through paragraphs
if (bodyElement.getElementType() == BodyElementType.PARAGRAPH) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
String textToAdd = parseParagraph(paragraph); //custom funtion to handle paragraphs
textList.add(textToAdd);
}
}
document.close();
return textList.toArray(new String[0]);
As you can see my goal here is to add a <hr/> tag after each page. So, if somehow I can get the page number of a paragraph or loop through pages, I will be able to do that.
Please kindly mention if you know about any other approach that may help.
To get Page Count from XWPFDocument (for your outer loop), you can do something like this:
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(YOUR_FILE_PATH));
int numOfPages = docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
For your paragraph text,
for (XWPFParagraph p : document.getParagraphs()) {
System.out.println(p.getParagraphText()); // YOUR PARAGRAPH TEXT
}

Replace PDF page using PDFBox

I have two PDF files (named : A1.pdf and B1.pdf). Now I want to replace the some pages of the second PDF file (B1.pdf) with the first one (A1.pdf) programatically. In this case I am using PDFBox library.
Here is my sample code:
try {
File file = new File("/Users/test/Desktop/A1.pdf");
PDDocument pdDoc = PDDocument.load(file);
PDDocument document = PDDocument.load(new File("/Users/test/Desktop/B1.pdf"));
document.removePage(3);
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0));
document.save("/Users/test/Desktop/"+"generatedPDFBox"+".pdf");
document.close();
}catch(Exception e){}
The idea is to replace the 3rd page. In this implementation the page is appending to the last page of the output pdf. Can anyone help me to implement this? If not with PDFBOX. Could you please suggest some other libraries in java?
This solution creates a third PDF file with the contents like you asked for. Note that pages are zerobased, so the "3" in your question must be a "2".
PDDocument a1doc = PDDocument.load(file1);
PDDocument b1doc = PDDocument.load(file2);
PDDocument resDoc = new PDDocument();
List<PDPage> a1Pages = a1doc.getDocumentCatalog().getAllPages();
List<PDPage> b1Pages = b1doc.getDocumentCatalog().getAllPages();
// replace the 3rd page of the 2nd file with the 1st page of the 1st one
for (int p = 0; p < b1Pages.size(); ++p)
{
if (p == 2)
resDoc.addPage(a1Pages.get(0));
else
resDoc.addPage(b1Pages.get(p));
}
resDoc.save(file3);
a1doc.close();
b1doc.close();
resDoc.close();
If you want to work from the command line instead, look here:
https://pdfbox.apache.org/commandline/
Then use PDFSplit and PDFMerge.
I am not too familiar with how PDFBox works, but to answer your follow up I know you can accomplish what you want to do in a fairly simple manner with the Datalogics APDFL SDK. A free trial exists in case you want to look into it. Here is a code snippet so you can see how it would be done:
Document Doc1 = new Document("/Users/test/Desktop/A1.pdf");
Document Doc2 = new Document("/Users/test/Desktop/B1.pdf");
/* Delete pages on the page range 3-3*/
Doc2.deletePages(3, 3)
/* LastPage is where in Doc2 you want to insert the page, Doc1 the document from which the page is coming from, 0 is the page number in Doc1 that will be inserted first, 1 is the number of pages that will be inserted (beginning from the page number specified in the previous parameter), and PageInsertFlags which would let you customize what gets / doesn't get copied */
Doc2.insertPages(Document.LastPage, Doc1, 0, 1, PageInsertFlags.All);
Doc2.save(EnumSet.of(SaveFlags.FULL), "out.pdf")
Alternatively, there is another method called replacePages which makes the deletion unnecessary. It all depends on what your end goal is, of course.

How to extract headline titles followed by respective text from Wikipedia

I am trying to use Jsoup in order to extract text from Wikipedia articles.
My idea is to simply extract every headline, and their respective text paragraphs.
I am having some trouble understanding how I can take only the specific text of each section, here's what I have:
public static void main(String[] args) {
String url = "http://en.wikipedia.org/wiki/Albert_Einstein";
Document doc;
try {
doc = Jsoup.connect(url).get();
doc = Jsoup.parse(doc.toString());
Elements titles = doc.select(".mw-headline");
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
for(Element h3 : doc.select(".mw-headline"))
{
String title = h3.text();
String titleID = h3.id();
Elements paragraphs = doc.select("p#"+titleID);
//Element nextEle=h3.nextElementSibling();
System.out.println(title);
System.out.println("----------------------------------------");
System.out.println(titleID);
System.out.print("\n");
System.out.println(paragraphs.text());
System.out.print("\n");
}
} catch (IOException e) {
System.out.println("deu merda");
e.printStackTrace();
}
With this I can extract every headline, but I can't get how I would get the text from each section to print it accordingly. I was thinking maybe with the headline's ID, but no dice.
Thank you for any help!
Depending on the tag structure of the page (if any), that could be complicated. A better alternative could be to iterate on all the elements, detecting headlines. Every time you detect a new headline (or you reach the end of the elements), it means a new headline. All elements up to here belong to the previous headline (or to the "header" of the article if there is no previous headline).

Convert HTML to PDF and add it to a paragraph

I want to add a paragraph, containing HTML, to a document.
As far as I know, iText only supports adding HTML to a document directly via XMLWorkerHelper.
Furthermore I want to change the font of the HTML, but this can be done with a css-file.
My approach is similar to this code:
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(pdfWriter, document, fis);
But this solution is writing to the document directly. I want to add the HTML to a paragraph, so I may add some additional formatting to that section.
String html = "<p>Html code here</p>";
Paragraph comb = new Paragraph();
StringBuilder sb = new StringBuilder();
sb.append(html);
ElementList list = XMLWorkerHelper.parseToElementList(sb.toString(), null);
for (Element element : list) {
comb.add(element);
}
para = new Paragraph(comb);
cell = new PdfPCell(para);
cell.setHorizontalAlignment(Element.ALIGN_LEFT);
cell.setBorder(Rectangle.NO_BORDER);
cell.setPaddingTop(0);
cell.setPaddingBottom(15f);
cell.setLeading(3f, 1.2f);
table.addCell(cell);
Go to parsing HTML step by step. In that example, the final pipeline is a PdfWriterPipeline which isn't what you want (because this pipeline writes stuff to the document). You want to replace this final pipeline with an ElementHandlerPipeline, converting all the HTML tags that are encountered to an ElementList.
Once you have this list of Element instances, it's up to you to decide what to do with it (adding them to a Paragraph is one option).

Extracting URLs with elem.absUrl

I have a program all what I need it to do is to extract URLs from a text file and saves them into another text file. The code calls ExtractHTML2.getURL2(url,input); which is simply extract the HTML code for a given link (which works correctly & no need to include its code here).
EDIT: The code parse number of pages, on each page, it save its html code in text file, then parse this text file, to extract 10 links.
Now, the following code suppose to parse the extracted HTML code and extract the URLs. This does not work with me. It does not extract any thing.
CODE EDITED:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.*;
public class ExtractLinks2 {
public static void getLinks2(String url, int pages) throws IOException {
{
Document doc;
Element link;
String elementLink=null;
int linkId=1; //represent the Id of the href tag inside the HTML code
//The file that contains the extracted HTML code for the web page.
File input = new File
("extracted.txt");
//To write the extracted links
FileWriter fstream = new FileWriter
("links.txt");
BufferedWriter out = new BufferedWriter(fstream);
// Loop to traverse the pages
for (int z=1; z<=pages; z++)
{
/*get the HTML code for that page and save
it in input (extracted.txt)*/
ExtractHTML2.getURL2(url, input);
//Using parse function from JSoup library
doc = Jsoup.parse(input, "UTF-8");
//Loop for 10 times to extract 10 links per page
for(int e=1; e<=10; e++)
{
link = doc.getElementById("link-"+linkId); //the href tag Id
System.out.println("This is link no."+linkId);
elementLink=link.absUrl("href");
//write the extracted link to text file
out.write(elementLink);
out.write(","); //add a comma
linkId++;
} //end for loop
linkId=1; //reset the linkId
}//end for loop
out.close();
} //end the getLinks function
} //end IOExceptions
} //end ExtractDNs class
As I said, my program does not extract the URLs. I have doubt in my syntax for Jsoup.parse. Reference to: http://jsoup.org/cookbook/input/load-document-from-file there is optional third argument that I ignored it as I think it is not needed in my case. I need to extract from text file not html page.
My program is able to extract the href tag text if I typed: eURL =elem.text(); but I don’t need the text, I need the URL itself, e.g: If I have the following:
<a id="link-1" class="yschttl spt" href="/r/_ylt=A7x9QXi_UOlPrmgAYKpLBQx.;
_ylu=X3oDMTBzcG12Mm9lBHNlYwNzcgRwb3MDMTEEY29sbwNpcmQEdnRpZAM-/SIG=1329l4otf/
EXP=1340719423/**http%3a//www.which.co.uk/technology/computing/guides/how-to-buy
-the-best-laptop/" data-bk="5040.1">How to <b>buy</b> the best <b>laptop</b>
- <b>Laptop</b> <wbr />reviews - Computing ...</a>
I only need "www.which.co.uk" or even better "which.co.uk" if there is a way to do that.
Why the above program does not extract URLs and how to correct the problem ?
The problem was in this line:
link = doc.getElementById("link-"+linkId);
It should be:
link = doc.getElementById("link-" + Integer.toString(linkId));
Since linkId is integer, and getElementById takes string as parameter. So, I had to convert the Id to int first, so the input for the getElementById becomes in the form: link-1, link-2, etc.

Categories