I am using flying saucer library and trying to add a custom font for a specific encoding of letters. So that I could make support for unicode characters.
Here is the link of solution that I follow from official guide of flying saucer library http://flyingsaucerproject.github.io/flyingsaucer/r8/guide/users-guide-R8.html#xil_33.
Below is the code,
public void convertHtmlToPdf(String html, String css, OutputStream out) {
try {
html = correctHtml(html);
html = getFormedHTMLWithCSS(html, css);
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
CleanerProperties cleanerProperties = cleaner.getProperties();
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);
File fontFile = new File("/Verdana.ttf");
FontFactory.register(fontFile.getAbsolutePath());
ITextRenderer r = new ITextRenderer();
r.getFontResolver().addFont(fontFile.getAbsolutePath(), BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
r.setDocumentFromString(cleanedHtml);
r.layout();
r.createPDF(out);
r.finishPDF();
} catch (Exception e) {
e.printStackTrace();
logger.error(e.getMessage(), e);
}
}
But Still I am unable to encode certain characters. Like,
'■' : '■',
'▲' : '▲',
For '■' i am getting &x25a0; in generated pdf, and likewise for other characters that I try to encode.
Related
i am passing html code to a variable in java. using aspose library, the html code should be executed and rendered into ppt (i am also giving the reference to css in the html).
appreciated if the ppt is editable.
Please use the following java equivalent code on your end.
public static void main(String[] args) throws Exception {
// The path to the documents directory.
String dataDir ="C:\\html\\";
// Create Empty presentation instance
Presentation pres = new Presentation();
// Access the default first slide of presentation
ISlide slide = pres.getSlides().get_Item(0);
// Adding the AutoShape to accommodate the HTML content
IAutoShape ashape = slide.getShapes().addAutoShape(ShapeType.Rectangle, 10, 10, (float) pres.getSlideSize().getSize().getWidth(), (float) pres.getSlideSize().getSize().getHeight());
ashape.getFillFormat().setFillType(FillType.NoFill);
// Adding text frame to the shape
ashape.addTextFrame("");
// Clearing all paragraphs in added text frame
ashape.getTextFrame().getParagraphs().clear();
// Loading the HTML file using InputStream
InputStream inputStream = new FileInputStream(dataDir + "file.html");
Reader reader = new InputStreamReader(inputStream);
int data = reader.read();
String content = ReadFile(dataDir + "file.html");
// Adding text from HTML stream reader in text frame
ashape.getTextFrame().getParagraphs().addFromHtml(content);
// Saving Presentation
pres.save(dataDir + "output.pptx", SaveFormat.Pptx);
}
public static String ReadFile(String FileName) throws Exception {
File file = new File(FileName);
StringBuilder contents = new StringBuilder();
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text = null;
// repeat until all lines is read
while ((text = reader.readLine()) != null) {
contents.append(text).append(System.getProperty("line.separator"));
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (reader != null) {
reader.close();
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
return contents.toString();
}
#Balchandar Reddy,
I have observed your comments and like to share that ImportingHTMLTextInParagraphs.class points to path of file. I have updated the code relate to this.
Secondly, you need to call import com.aspose.slides.IAutoShape on your end to resolve the issue.
I have observed your requirements and regret to share that Aspose.Slides which is an API for managing PowerPoint slides, does not support feature for converting HTML to PPT/PPTX. However, it supports importing HTML text inside slide text frames that you may use.
// Create Empty presentation instance// Create Empty presentation instance
using (Presentation pres = new Presentation())
{
// Acesss the default first slide of presentation
ISlide slide = pres.Slides[0];
// Adding the AutoShape to accomodate the HTML content
IAutoShape ashape = slide.Shapes.AddAutoShape(ShapeType.Rectangle, 10, 10, pres.SlideSize.Size.Width - 20, pres.SlideSize.Size.Height - 10);
ashape.FillFormat.FillType = FillType.NoFill;
// Adding text frame to the shape
ashape.AddTextFrame("");
// Clearing all paragraphs in added text frame
ashape.TextFrame.Paragraphs.Clear();
// Loading the HTML file using stream reader
TextReader tr = new StreamReader(dataDir + "file.html");
// Adding text from HTML stream reader in text frame
ashape.TextFrame.Paragraphs.AddFromHtml(tr.ReadToEnd());
// Saving Presentation
pres.Save("output_out.pptx", Aspose.Slides.Export.SaveFormat.Pptx);
}
I am working as Support developer/ Evangelist at Aspose.
I have an input stream of a PDF document available to me. I would like to add subject metadata to the document and then save it. I'm not sure how to do this.
I came across a sample recipe here: https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html
However, it is still fuzzy. Below is what I'm trying and places where I have questions
PDDocument doc = PDDocument.load(myInputStream);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
InputStream newXMPData = ...; //what goes here? How can I add subject tag?
PDMetadata newMetadata = new PDMetadata(doc, newXMLData, false );
catalog.setMetadata( newMetadata );
//does anything else need to happen to save the document??
//I would like an outputstream of the document (with metadata) so that I can save it to an S3 bucket
The following code sets the title of a PDF document, but it should be adaptable to work with other properties as well:
public static byte[] insertTitlePdf(byte[] documentBytes, String title) {
try {
PDDocument document = PDDocument.load(documentBytes);
PDDocumentInformation info = document.getDocumentInformation();
info.setTitle(title);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
return baos.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
Apache PDFBox is needed, so import it to e.g. Maven with:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
Add a title with:
byte[] documentBytesWithTitle = insertTitlePdf(documentBytes, "Some fancy title");
Display it in the browser with (JSF example):
<object class="pdf" data="data:application/pdf;base64,#{myBean.getDocumentBytesWithTitleAsBase64()}" type="application/pdf">Document could not be loaded</object>
Result (Chrome):
Another much easier way to do this would be to use the built-in Document Information object:
PDDocument inputDoc = // your doc
inputDoc.getDocumentInformation().setCreator("Some meta");
inputDoc.getDocumentInformation().setCustomMetadataValue("fieldName", "fieldValue");
This also has the benefit of not requiring the xmpbox library.
This answer uses xmpbox and comes from the AddMetadataFromDocInfo example in the source code download:
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
dc.setDescription("descr");
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
doc.getDocumentCatalog().setMetadata(metadata);
The following code is generating special characters instead of spaces for one PDF but not another:
String fullText;
BodyContentHandler handler = null;
try {
// size is limit is 100M
handler = new BodyContentHandler(100 * 1024 * 1024);
Metadata meta = new Metadata();
PDFParser parser = new PDFParser();
parser.setEnableAutoSpace(false);
parser.parse(new FileInputStream(this.pdf /*always a valid pdf file*/), handler, meta, new ParseContext());
}
catch (SAXException e) {
throw new IOException(e);
} catch (TikaException e) {
throw new IOException(e);
}
fullText = handler.toString();
Depending on the PDF a substring of fullText will look like:
will*continue*to*be*used*in*support*of*the
When It should look like this:
will continue to be used in support of the
In other places, '%' substitute '-' and '!' substitute spaces amongst bolded text.
This issue only when processing one PDF but not the other. According to pdfinfo, both PDF's are generated by Quartz PDFContext.
linux command pdftotext renders the same results.
Is this a problem with how the original PDF is generated? Why is this happening?
I am using Pdf Parser to convert pdf to text.Below is my code to convert pdf to text file using java.
My PDF file contains Following Data:
Data Sheet(Header)
PHP Courses for PHP Professionals(Header)
Networking Academy
We live in an increasingly connected world, creating a global economy and a growing need for technical skills. Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.
All copyrights reserved.(Footer).
Sample code:
public class PDF_TEST {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDF_TEST() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
File f = new File(fileName);
if (!f.isFile()) {
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
try {
PrintWriter pw = new PrintWriter(fileName);
pw.print(pdfText);
pw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
//Extracts text from a PDF Document and writes it to a text file
public static void test() {
String args[]={"C://Sample.pdf","C://Sample.txt"};
if (args.length != 2) {
System.exit(1);
}
PDFTextParser pdfTextParserObj = new PDFTextParser();
String pdfToText = pdfTextParserObj.pdftoText(args[0]);
if (pdfToText == null) {
}
else {
pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
}
}
public static void main(String args[]) throws IOException
{
test();
}
}
The above code works for extracting pdf to text.But my requirement is to ignore Header and Footer and extract only content from pdf file.
Required output:
Networking Academy
We live in an increasingly connected world, creating a global economy and a growing need for technical skills. Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.
Please suggest me how to do this.
Thanks.
In general there is nothing special about header or footer texts in PDFs. It is possible to tag that material differently, but tagging is optional and the OP did not provide a sample PDF to check.
Thus, some manual work (or somewhat failure intensive image analysis) generally is necessary to find the regions on the pages for header, content, and footer material.
As soon as you have the coordinates for these regions, though, you can use the PDFTextStripperByAreawhich extends the PDFTextStripper to collect text by regions. Simply define a region for the page content using the largest rectangle including the content but excluding headers and footers, and after pdfStripper.getText(pdDoc) call getTextForRegion for the defined region.
You can use PDFTextStripperByArea to remove "Header" and "Footer" by pdf file.
Code in java using PDFBox.
public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
File file = new File(path + filename);
PDDocument document = PDDocument.load(file);
//Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
String regionName = "region";
PDFTextStripperByArea stripper;
PDPage page = document.getPage(pageNumber + 1);
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
String text = stripper.getTextForRegion(regionName);
return text;
}
(I'm looking for a open source library)
iText I believe has an RTF capability as well as pdf.
http://itextpdf.com/
http://www.java-tips.org/other-api-tips/itext/creating-pdf-rtf-or-document-from-a-java-class-at-ru-2.html
You can convert HTML to RTF using basic Java APIs RTFEditorKit and HTMLEditorKit.
It is not converting new line tags like <br/> and <p> to new line character equivalent in RTF. I have applied external fix for that as shown in following Java code.
private static String convertToRTF(String htmlStr) {
OutputStream os = new ByteArrayOutputStream();
HTMLEditorKit htmlEditorKit = new HTMLEditorKit();
RTFEditorKit rtfEditorKit = new RTFEditorKit();
String rtfStr = null;
htmlStr = htmlStr.replaceAll("<br.*?>","#NEW_LINE#");
htmlStr = htmlStr.replaceAll("</p>","#NEW_LINE#");
htmlStr = htmlStr.replaceAll("<p.*?>","");
InputStream is = new ByteArrayInputStream(htmlStr.getBytes());
try {
Document doc = htmlEditorKit.createDefaultDocument();
htmlEditorKit.read(is, doc, 0);
rtfEditorKit .write(os, doc, 0, doc.getLength());
rtfStr = os.toString();
rtfStr = rtfStr.replaceAll("#NEW_LINE#","\\\\par ");
} catch (IOException e) {
e.printStackTrace();
} catch (BadLocationException e) {
e.printStackTrace();
}
return rtfStr;
}
Here, I am replacing new line equivalent HTML tags to some special string and replacing back to new line representation chars sequence \par in RTF.
If you want to use more effective APIs and you have valid html, you should explore Apache-FOP.
Apache FOP can be used to convert to RTF. Following are some useful links -
http://www.torsten-horn.de/techdocs/java-xsl.htm#XSL-FO-Java
http://html2fo.sourceforge.net/index.html