I am writing a utility to collapse the bookmarks an exisiting PDF, and save the file as a new PDF. The platform is Java using the iText API.
The following code retrieves the existing bookmarks and recursively calls a method to close them.
/* Retrieve the bookmarks: */
List<HashMap<String, Object>> bookmarks = SimpleBookmark.getBookmark(originalPdf);
/* Now we recursively close all the bookmarks.*/
if (bookmarks != null ) {
for (HashMap temp : bookmarks) {
closeBookmark(originalPdf, temp, "false");
}
}
private static void closeBookmark(PdfReader originalPdf, HashMap localBookmark, String bookmarkState) {
localBookmark.put("Open", bookmarkState);
/* If the current bookmark has kids (lower-level bookmarks), then recursively close them as well. */
if (localBookmark.containsKey("Kids")) {
ArrayList<HashMap> kidMap = (ArrayList) localBookmark.get("Kids");
for (HashMap temp : kidMap) {
closeBookmark(originalPdf, temp, bookmarkState);
}
}
}
This works on PDFs created by Word, OpenOffice, and XSL-FO, but not FrameMaker or InDesign. In the latter cases, I get the collapsed bookmarks, but clicking on the bookmarks does not scroll the PDF to the destination. It seems as if the destinations are not present in either the bookmarks or the PDF body. Any suggestions?
Related
I am trying to populate a table within a docx file with data from java objects. More precisely each row represents an Object and my pattern starts with one row. I want to find out how can I introduce a new row in case I have more than one objects in my list. See example below:
Docx table looks like this:
And I successfully realized the mapping with the fields but for ONLY one object. How can i introduce another row (from Java) to make room for another object ? For this implementation I am using org.apache.poi.xwpf.usermodel.XWPFDocument;
public class DocMagic {
public static XWPFDocument replaceTextFor(XWPFDocument doc, String findText, String replaceText) {
replaceTextFor(doc.getParagraphs(),findText,replaceText);
doc.getTables().forEach(p -> {
p.getRows().forEach(row -> {
row.getTableCells().forEach(cell -> {
replaceTextFor(cell.getParagraphs(), findText, replaceText);
});
});
});
return doc;
}
private static void replaceTextFor(List<XWPFParagraph> paragraphs, String findText, String replaceText) {
paragraphs.forEach(p -> {
p.getRuns().forEach(run -> {
String text = run.text();
if (text.contains(findText)) {
run.setText(text.replace(findText, replaceText), 0);
}
});
});
}
public static void saveWord(String filePath, XWPFDocument doc) throws FileNotFoundException, IOException {
FileOutputStream out = null;
try {
out = new FileOutputStream(filePath);
doc.write(out);
} catch (Exception e) {
e.printStackTrace();
} finally {
out.close();
}
}
}
EDIT: using addNewTableCell().setText() places the values on the right side of the table
Normally you use below steps to add row in a table,
XWPFTableRow row =tbl.createRow();
row.addNewTableCell().setText("whatever you want");
tbl.addRow(row, y);
But in your case seems you want to add rows on the fly while you are iterating the docx table together with your Java list of object,
In Java your are not safe or able to change the collection while looping it,
so you might need to do it in 2 steps,
you need to expand/add rows first to the docx table before you populate it,
by firstly calculate how many objects you have in your java list.
when the table rows are already added accordingly, you could iterate and populate them
I have a word document which is used as a template. Inside this template I have some tables that contain predefined bullet points. Now I'm trying to replace the placeholder string with a set of strings.
I'm totally stuck on this. My simplified methods looks like this.
replaceKeyValue.put("[DescriptionOfItem]", new HashSet<>(Collections.singletonList("This is the description")));
replaceKeyValue.put("[AllowedEntities]", new HashSet<>(Arrays.asList("a", "b")));
replaceKeyValue.put("[OptionalEntities]", new HashSet<>(Arrays.asList("c", "d")));
replaceKeyValue.put("[NotAllowedEntities]", new HashSet<>(Arrays.asList("e", "f")));
try (XWPFDocument template = new XWPFDocument(OPCPackage.open(file))) {
template.getTables().forEach(
xwpfTable -> xwpfTable.getRows().forEach(
xwpfTableRow -> xwpfTableRow.getTableCells().forEach(
xwpfTableCell -> replaceInCell(replaceKeyValue, xwpfTableCell)
)
));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
template.write(baos);
return new ByteArrayResource(baos.toByteArray());
} finally {
if (file.exists()) {
file.delete();
}
}
private void replaceInCell(Map<String, Set<String>> replacementsKeyValuePairs, XWPFTableCell xwpfTableCell) {
for (XWPFParagraph xwpfParagraph : xwpfTableCell.getParagraphs()) {
for (Map.Entry<String, Set<String>> replPair : replacementsKeyValuePairs.entrySet()) {
String keyToFind = replPair.getKey();
Set<String> replacementStrings = replacementsKeyValuePairs.get(keyToFind);
if (xwpfParagraph.getText().contains(keyToFind)) {
replacementStrings.forEach(replacementString -> {
XWPFParagraph paragraph = xwpfTableCell.addParagraph();
XWPFRun run = paragraph.createRun();
run.setText(replacementString);
});
}
}
}
I was expecting that some more bullet points will be added to the current cell. Am I missing something? The paragraph is the one containing the placeholder string and format.
Thanks for any help!
UPDATE: This is how part of the template looks like. I would like to automatically search for the terms and replace them. Searching works so far. But trying to replace the bullet points ends in an unlocatable NullPointer.
Would it be easier to use fields? I need to keep the bullet point style though.
UPDATE 2: added download link and updated the code. Seems I can't alter the paragraphs if I'm iterating through them. I get a null-pointer.
Download link: WordTemplate
Since Microsoft Word is very, very "strange" in how it divides text in different runs in it's storage, such questions are not possible to answer without having a complete example including all code and the Word documents in question. Having a general usable code for adding content to Word documents seems not be possible, except all the adding or replacement is only in fields (form fields or content controls or mail merge fields).
So I downloaded your WordTemplate.docx which looks like so:
Then I runned the following code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;
import org.apache.xmlbeans.XmlCursor;
import java.util.*;
import java.math.BigInteger;
public class WordReadAndRewrite {
static void addItems(XWPFTableCell cell, XWPFParagraph paragraph, Set<String> items) {
XmlCursor cursor = null;
XWPFRun run = null;
CTR cTR = null; // for a deep copy of the run's low level object
BigInteger numID = paragraph.getNumID();
int indentationLeft = paragraph.getIndentationLeft();
int indentationHanging = paragraph.getIndentationHanging();
boolean first = true;
for (String item : items) {
if (first) {
for (int r = paragraph.getRuns().size()-1; r > 0; r--) {
paragraph.removeRun(r);
}
run = (paragraph.getRuns().size() > 0)?paragraph.getRuns().get(0):null;
if (run == null) run = paragraph.createRun();
run.setText(item, 0);
cTR = (CTR)run.getCTR().copy(); // take a deep copy of the run's low level object
first = false;
} else {
cursor = paragraph.getCTP().newCursor();
boolean thereWasParagraphAfter = cursor.toNextSibling(); // move cursor to next paragraph
// because the new paragraph shall be **after** that paragraph
// thereWasParagraphAfter is true if there is a next paragraph, else false
if (thereWasParagraphAfter) {
paragraph = cell.insertNewParagraph(cursor); // insert new paragraph if there are next paragraphs in cell
} else {
paragraph = cell.addParagraph(); // add new paragraph if there are no other paragraphs present in cell
}
paragraph.setNumID(numID); // set template paragraph's numbering Id
paragraph.setIndentationLeft(indentationLeft); // set template paragraph's indenting from left
if (indentationHanging != -1) paragraph.setIndentationHanging(indentationHanging); // set template paragraph's hanging indenting
run = paragraph.createRun();
if (cTR != null) run.getCTR().set(cTR); // set template paragraph's run formatting
run.setText(item, 0);
}
}
}
public static void main(String[] args) throws Exception {
Map<String, Set<String>> replaceKeyValue = new HashMap<String, Set<String>>();
replaceKeyValue.put("[AllowedEntities]", new HashSet<>(Arrays.asList("allowed 1", "allowed 2", "allowed 3")));
replaceKeyValue.put("[OptionalEntities]", new HashSet<>(Arrays.asList("optional 1", "optional 2", "optional 3")));
replaceKeyValue.put("[NotAllowedEntities]", new HashSet<>(Arrays.asList("not allowed 1", "not allowed 2", "not allowed 3")));
XWPFDocument document = new XWPFDocument(new FileInputStream("WordTemplate.docx"));
List<XWPFTable> tables = document.getTables();
for (XWPFTable table : tables) {
List<XWPFTableRow> rows = table.getRows();
for (XWPFTableRow row : rows) {
List<XWPFTableCell> cells = row.getTableCells();
for (XWPFTableCell cell : cells) {
int countParagraphs = cell.getParagraphs().size();
for (int p = 0; p < countParagraphs; p++) { // do not for each since new paragraphs were added
XWPFParagraph paragraph = cell.getParagraphArray(p);
String placeholder = paragraph.getText();
placeholder = placeholder.trim(); // this is the tricky part to get really the correct placeholder
Set<String> items = replaceKeyValue.get(placeholder);
if (items != null) {
addItems(cell, paragraph, items);
}
}
}
}
}
FileOutputStream out = new FileOutputStream("Result.docx");
document.write(out);
out.close();
document.close();
}
}
The Result.docx looks like so:
The code loops trough the table cells in the Word document and looks for a paragraph which contains exactly the placeholder. This even might be the tricky part since that placeholder might be splitted into differnt text runs by Word. If found it runs a method addItems which takes the found paragraph as a template for numbering and indention (might be incomplter though). Then it sets the first new item in first text run of found paragraph and removes all other text runs which possibly are there. Then it determines wheter new paragraphs must be inserted or added to the cell. For this a XmlCursor is used. In new inserted or added paragrahs the other items are placed and the numbering and indention settings are taken from the placeholder's paragraph.
As said, this is code for showing the principles of how to do. It would must be extended very much to be general usable. In my opinion those trials using text placeholders in Word documents for text replacements are not really good. Placeholders for variable text in Word documents should be fields. This could be form fields, content controls or mail merge fields. Advantage of fields in contrast of text placeholders is that Word knows the fields being entities for variable texts. It will not split them into multiple text runs for multiple strange reasons as it often does with normal text.
So I've been browsing around the source code / documentation for POI (specifically XWPF) and I can't seem to find anything that relates to editing a hyperlink in a .docx. I only see functionality to get the information for the currently set hyperlink. My goal is to change the hyperlink in a .docx to link to "http://yahoo.com" from "http://google.com" as an example. Any help would be greatly appreciated. Thanks!
I found a way to edit the url of the link in a "indirect way" (copy the previous hyperlink, modify the url, delete the previous hyperlink and add the new one in the paragraph).
Code is shown below:
private void editLinksOfParagraph(XWPFParagraph paragraph, XWPFDocument document) {
for (int rIndex = 0; rIndex < paragraph.getRuns().size(); rIndex++) {
XWPFRun run = paragraph.getRuns().get(rIndex);
if (run instanceof XWPFHyperlinkRun) {
// get the url of the link to edit it
XWPFHyperlink link = ((XWPFHyperlinkRun) run).getHyperlink(document);
String linkURL = link.getURL();
//get the xml representation of the hyperlink that includes all the information
XmlObject xmlObject = run.getCTR().copy();
linkURL += "-edited-link"; //edited url of the link, f.e add a '-edited-link' suffix
//remove the previous link from the paragraph
paragraph.removeRun(rIndex);
//add the new hyperlinked with updated url in the paragraph, in place of the previous deleted
XWPFHyperlinkRun hyperlinkRun = paragraph.insertNewHyperlinkRun(rIndex, linkURL);
hyperlinkRun.getCTR().set(xmlObject);
}
}
}
This requirement needs knowledge about how hyperlinks referring to an external reference get stored in Microsoft Word documents and how this gets represented in XWPF of Apache POI.
The XWPFHyperlinkRun is the representation of a linked text run in a IRunBody. This text run, or even multiple text runs, is/are wrapped with a XML object of type CTHyperlink. This contains a relation ID which points to a relation in the package relations part. This package relation contains the URI which is the hyperlink's target.
Currently (apache poi 5.2.2) XWPFHyperlinkRun provides access to a XWPFHyperlink. But this is very rudimentary. It only has getters for the Id and the URI. It neither provides access to it's XWPFHyperlinkRun and it's IRunBody nor it provides a setter for the target URI in the package relations part. It not even has internally access to it's the package relations part.
So only using Apache POI classes the only possibility currently is to delete the old XWPFHyperlinkRun and create a new one pointing to the new URI. But as the text runs also contain the text formatting, deleting them will also delete the text formatting. It would must be copied from the old XWPFHyperlinkRun to the new before deleting the old one. That's uncomfortable.
So the rudimentary XWPFHyperlink should be extended to provide a setter for the target URI in the package relations part. A new class XWPFHyperlinkExtended could look like so:
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.openxml4j.opc.PackageRelationship;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
/**
* Extended XWPF hyperlink class
* Provides access to it's Id, URI, XWPFHyperlinkRun, IRunBody.
* Provides setting target URI in PackageRelationship.
*/
public class XWPFHyperlinkExtended {
private String id;
private String uri;
private XWPFHyperlinkRun hyperlinkRun;
private IRunBody runBody;
private PackageRelationship rel;
public XWPFHyperlinkExtended(XWPFHyperlinkRun hyperlinkRun, PackageRelationship rel) {
this.id = rel.getId();
this.uri = rel.getTargetURI().toString();
this.hyperlinkRun = hyperlinkRun;
this.runBody = hyperlinkRun.getParent();
this.rel = rel;
}
public String getId() {
return this.id;
}
public String getURI() {
return this.uri;
}
public IRunBody getIRunBody() {
return this.runBody;
}
public XWPFHyperlinkRun getHyperlinkRun() {
return this.hyperlinkRun;
}
/**
* Provides setting target URI in PackageRelationship.
* The old PackageRelationship gets removed.
* A new PackageRelationship gets added using the same Id.
*/
public void setTargetURI(String uri) {
this.runBody.getPart().getPackagePart().removeRelationship(this.getId());
this.uri = uri;
PackageRelationship rel = this.runBody.getPart().getPackagePart().addExternalRelationship(uri, XWPFRelation.HYPERLINK.getRelation(), this.getId());
this.rel = rel;
}
}
It does not extend XWPFHyperlink as this is so rudimentary it's not worth it. Furthermore after setTargetURI the String uri needs to be updated. But there is no setter in XWPFHyperlink and the field is only accessible from inside the package.
The new XWPFHyperlinkExtended can be got from XWPFHyperlinkRun like so:
/**
* If this HyperlinkRun refers to an external reference hyperlink,
* return the XWPFHyperlinkExtended object for it.
* May return null if no PackageRelationship found.
*/
/*modifiers*/ XWPFHyperlinkExtended getHyperlink(XWPFHyperlinkRun hyperlinkRun) {
try {
for (org.apache.poi.openxml4j.opc.PackageRelationship rel : hyperlinkRun.getParent().getPart().getPackagePart().getRelationshipsByType(XWPFRelation.HYPERLINK.getRelation())) {
if (rel.getId().equals(hyperlinkRun.getHyperlinkId())) {
return new XWPFHyperlinkExtended(hyperlinkRun, rel);
}
}
} catch (org.apache.poi.openxml4j.exceptions.InvalidFormatException ifex) {
// do nothing, simply do not return something
}
return null;
}
Once we have that XWPFHyperlinkExtended we can set an new target URI using it's method setTargetURI.
A further problem results from the fact, that the XML object of type CTHyperlink can wrap around multiple text runs, not only one. Then multiple XWPFHyperlinkRun are in one CTHyperlink and point to one target URI. For example this could look like:
... [this is a link to example.com]->https://example.com ...
This results in 6 XWPFHyperlinkRuns in one CTHyperlink linking to https://example.com.
This leads to problems when link text needs to be changed when the link target changes. The text of all the 6 text runs is the link text. So which text run shall be changed?
The best I have found is a method which sets the text of the first text run in the CTHyperlink.
/**
* Sets the text of the first text run in the CTHyperlink of this XWPFHyperlinkRun.
* Tries solving the problem when a CTHyperlink contains multiple text runs.
* Then the String value is set in first text run only. All other text runs are set empty.
*/
/*modifiers*/ void setTextInFirstRun(XWPFHyperlinkRun hyperlinkRun, String value) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHyperlink ctHyperlink = hyperlinkRun.getCTHyperlink();
for (int r = 0; r < ctHyperlink.getRList().size(); r++) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR ctR = ctHyperlink.getRList().get(r);
for (int t = 0; t < ctR.getTList().size(); t++) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTText ctText = ctR.getTList().get(t);
if (r == 0 && t == 0) {
ctText.setStringValue(value);
} else {
ctText.setStringValue("");
}
}
}
}
There the String value is set in first text run only. All other text runs are set empty. The text formatting of the first text run remains.
This works, but need more some steps to get text formatting correctly:
try (var fis = new FileInputStream(fileName);
var doc = new XWPFDocument(fis)) {
var pList = doc.getParagraphs();
for (var p : pList) {
var runs = p.getRuns();
for (int i = 0; i < runs.size(); i++) {
var r = runs.get(i);
if (r instanceof XWPFHyperlinkRun) {
var run = (XWPFHyperlinkRun) r;
var link = run.getHyperlink(doc);
// To get text: link for checking
System.out.println(run.getText(0) + ": " + link.getURL());
// how i replace it
var run1 = p.insertNewHyperlinkRun(i, "http://google.com");
run1.setText(run.getText(0));
// remove the old link
p.removeRun(i + 1);
}
}
}
try (var fos = new FileOutputStream(outFileName)) {
doc.write(fos);
}
}
I'm using these libraries:
implementation 'org.apache.poi:poi:5.2.2'
implementation 'org.apache.poi:poi-ooxml:5.2.2'
I am trying to run the Java Code written by Stefano Chizzolini (Awesome guy : Creator of PDFClown) to Parse a PDF using PDF Clown library. I am getting this error and I dont know what I can do to fix this.
Exception in thread "main" org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
at org.pdfclown.documents.contents.fonts.OpenFontParser.getName(OpenFontParser.java:570)
at org.pdfclown.documents.contents.fonts.OpenFontParser.load(OpenFontParser.java:221)
at org.pdfclown.documents.contents.fonts.OpenFontParser.<init>(OpenFontParser.java:205)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:626)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at PDFReader.FullExtract.run(FullExtract.java:71)
at PDFReader.FullExtract.main(FullExtract.java:142)
I know the class OpenFontParser in the library package is throwing this error. Is there anything I can do to fix this?
This code works for most PDF. I have a PDF that it does not parse. I am guessing it is because of this symbol below in the pdf.
public class PDFReader extends Sample {
#Override
public void run()
{
String filePath = new String("C:\\Users\\XYZ\\Desktop\\SomeSamplePDF.pdf");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Extracting text from the document pages...
for(Page page : document.getPages())
{
extract(new ContentScanner(page)); // Wraps the page contents into a scanner.
}
close(file);
}
private void close(File file) {
// TODO Auto-generated method stub
}
/**
Scans a content level looking for text.
*/
/*
NOTE: Page contents are represented by a sequence of content objects,
possibly nested into multiple levels.
*/
private void extract(
ContentScanner level
)
{
if(level == null)
return;
while(level.moveNext())
{
ContentObject content = level.getCurrent();
if(content instanceof ShowText)
{
Font font = level.getState().getFont();
// Extract the current text chunk, decoding it!
System.out.println(font.decode(((ShowText)content).getText()));
}
else if(content instanceof Text
|| content instanceof ContainerObject)
{
// Scan the inner level!
extract(level.getChildLevel());
}
}
}
private boolean prompt(Page page)
{
int pageIndex = page.getIndex();
if(pageIndex > 0)
{
Map<String,String> options = new HashMap<String,String>();
options.put("", "Scan next page");
options.put("Q", "End scanning");
if(!promptChoice(options).equals(""))
return false;
}
System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
return true;
}
public static void main(String args[])
{
new PDFReader().run();
}
}
The issue
As the stacktrace indicates, the problem is that some TrueType font embedded in the PDF does not contain a name table even though it is a required table:
org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
...
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
Thus, strictly speaking, that embedded font is invalid and consequentially the embedding PDF, too. And PDFClown runs into an exception due to this validity issue.
Some backgrounds
A TrueType font file consists of a sequence of concatenated tables. ...
The first of the tables is the font directory, a special table that facilitates access to the other tables in the font. The directory is followed by a sequence of tables containing the font data. These tables can appear in any order. Certain tables are required for all fonts. Others are optional depending upon the functionality expected of a particular font.
Tables that are required must appear in any valid TrueType font file. The required tables and their tag names are shown in Table 2.
Table 2: The required tables
Tag Table
'cmap' character to glyph mapping
'glyf' glyph data
'head' font header
'hhea' horizontal header
'hmtx' horizontal metrics
'loca' index to location
'maxp' maximum profile
'name' naming
'post' PostScript
(Section TrueType Font files: an overview in chapter 6 The TrueType Font File in the TrueType Reference Manual)
On the other hand, though, there are a number of PDF generators cutting down embedded TrueType fonts to the bare essentials required by PDF viewers (foremost Adobe Reader), and the name table does not seem to be strictly required.
Furthermore the table name is only used for one purpose in PDFClown, to determine the name of the font in question, even though the font name could be determined from the BaseFont entry of the associated font dictionary, too. Actually the latter entry is required by the PDF specification while the PostScript name of the font entry in the name table is optional according to the TTF manual.
Thus, using the BaseFont entry in the PDF font dictionary would be a better alternative to this name table access.
Fixing it
Is there anything I can do to fix this?
You can either fix the not entirely valid PDF by adding a name table to the embedded TTF in question or you can patch PDFClown to ignore the missing missing table: in the class org.pdfclown.documents.contents.fonts.OpenFontParser edit the method getName:
private String getName(
int id
) throws EOFException, UnsupportedEncodingException
{
// Naming Table ('name' table).
Integer tableOffset = tableOffsets.get("name");
if(tableOffset == null)
throw new ParseException("'name' table does NOT exist.");
Replace that throw new ParseException("'name' table does NOT exist.") by return null.
PS
While the problem could be analyzed using merely the information given by the OP, the sample file provided by #akarshad in his now deleted answer gave more motivation to start the analysis at all.
I am trying to scrape the Top Stories section in google news for all the titles. In order to only get the titles in the Top Stories section, I must narrow into this tag:
<div class="section top-stories-section" id=":2r">..</div>
This is the code I use (in Eclipse):
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://news.google.com";
Document document = Jsoup.connect(url).get();
// Extract data
Element topStories = document.getElementById(":2r").;
Elements titles = topStories.select("span.titletext");
// Output data
for (Element title : titles) {
System.out.println("Title: " + title.text());
}
}
I always seem to be getting a NullPointerException. It doesn't work either, when I try to reach the Top Stories like this:
Element topStories = document.select("#:2r").first();
Am I missing something? Shouldn't this be working? I am relatively new to this, please help and thank you!
Judging from the error message (and actually looking at the page) that div tag doesn't contain an id attribute. Instead you could select based on the CSS class
Element topStories = document.select("div.section.top-stories-section").first();