I have an word document
Docx file
As you can see in the word document there are a number of questions with Bullet Points. Right now I am trying to extract each paragraph from the file using apache POI. Here is my current code
public static String readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
String whole = "";
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
whole += "\n" + para.getText();
}
fis.close();
document.close();
return whole;
} catch (Exception e) {
e.printStackTrace();
return "";
}
}
The problem with above method is that it is printing each line instead of paragraphs. Also the bullet points are also gone from extracted whole string. The whole is returned a plain string.
Can anyone explain what I am doing wrong. Also please suggest if you have a better idea to solve it.
Above code is correct and I ran your code on my system that giving each and every paragraphs , I think problem with writting content on docx file whenever I wrote content in bullet points and uses 'enter' key than that breaks my current bullet points and above code make that breaked-line as saparate paragraph.
I am writting below code sample may be It's useful for you take a look here I am using Set datastructure for ignoring duplicate questions from docx .
Dependency of apache poi is below
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.7</version>
</dependency>
Code Sample :
package com;
import java.io.File;
import java.io.FileInputStream;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.springframework.util.ObjectUtils;
public class App {
public static void main(String...strings) throws Exception{
Set<String> bulletPoints = fileExtractor();
bulletPoints.forEach(point -> {
System.out.println(point);
});
}
public static Set<String> fileExtractor() throws Exception{
FileInputStream fis = null;
try {
Set<String> bulletPoints = new HashSet<>();
File file = new File("/home/deskuser/Documents/query.docx");
fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
paragraphs.forEach(para -> {
System.out.println(para.getText());
if(!ObjectUtils.isEmpty(para.getText())){
bulletPoints.add(para.getText());
}
});
fis.close();
return bulletPoints;
} catch (Exception e) {
e.printStackTrace();
throw new Exception("error while extracting file.", e);
}finally{
if(!ObjectUtils.isEmpty(fis)){
fis.close();
}
}
}
}
I couldn't find which version of apache POI you are using. If it's the latest version (3.17), the XWPFParagraph object used in your code has a getNumFmt() method. From the apache poi documentation (https://poi.apache.org/apidocs/org/apache/poi/xwpf/usermodel/XWPFParagraph.html) this method will return the string "bullet" if the paragraph starts with a bullet. So regarding the second point of your question (what happens to the bullets), you can resolve with something like the following:
public class TestPoi {
private static final String BULLET = "•";
private static final String NEWLINE = "\n";
public static void main(String...args) {
String test = readDocxFile("/home/william/Downloads/anesthesia.docx");
System.out.println(test);
}
public static String readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
StringBuilder whole = new StringBuilder();
for (XWPFParagraph para : paragraphs) {
if ("bullet".equals(para.getNumFmt())) {
whole.append(BULLET);
}
whole.append(para.getText());
whole.append(NEWLINE);
}
fis.close();
document.close();
return whole.toString();
} catch (Exception e) {
e.printStackTrace();
return "";
}
}
}
Regarding your first point, what is the expected output? I ran your code with the provided docx and apart from the missing bullets you mentioned, it looked okay stepping through with the debugger.
Related
I am trying to write code for a word guessing game, and it works well when I use bufferedreader and inputstream combined. But when I try it using scanner, it cannot find the file, even though in both instances the file is in the same folder. It is in a folder called res under the src folder in my project folder(I am coding in eclipse).
import java.util.ArrayList;
import java.util.Scanner;
import java.io.File;
public class WordGen {
private final String filename = "/res/words.txt";
File file = new File(filename);
Scanner input = null;
private ArrayList<String> list = new ArrayList<>();
public WordGen() {
try {
input = new Scanner(file);
while (input.hasNextLine()) {
String w = input.nextLine();
list.add(w);
}
} catch (Exception ex) {
System.out.println("File not found.");
}
}
public String getword() {
if (list.isEmpty()) {
return "NOTHING";
}
return list.get((int) (Math.random() * list.size()));
}
}
public class test {
public static void main(String[] args) {
WordGen wordgen = new WordGen();
System.out.println(wordgen.getword());
}
}
I tried searching for this problem but couldn't find it here. I am guessing it's a very small error which I cannot figure out. Thanks and regards.
EDIT: Here's the other code that worked(Everything else same as before):
public WordGenerator()
{
try(InputStream input = getClass().getResourceAsStream(fileName);
BufferedReader bfreader = new BufferedReader(new InputStreamReader(input)))
{
String line = "";
while ((line = bfreader.readLine()) != null)
words.add(line);
}
catch (Exception e)
{
System.out.println("Couldn't find file");
}
}
Scanner is trying to load a file - and you're providing an absolute filename, /res/words.txt.
In order to create an InputStream, you're loading a resource, giving it an absolute resource name, even though you've called the variable fileName:
getClass().getResourceAsStream(fileName)
That works because it can load a resource called /res/words.txt from the classpath, but it's not loading a file with a filename of /res/words.txt.
You could use a filename of res/words.txt, if you run the code from the src directory... or you could just stick to using getResourceAsStream, which is probably a better idea as it doesn't rely on your working directory, and will continue to work even if your code and resources are packaged up into a jar file.
If you really want to use Scanner, you could always use new Scanner(input) - there's a Scanner constructor accepting an InputStream.
I copied the code below from a website. It reads .doc/ .docx file in Java using the Apache POI package.
WordExtractor we = new WordExtractor(doc); gives the following error:
reference to WordExtractor is ambiguous. Both constructor WordExtractor(POIFSFileSystem) in WordExtractor and WordExtractor(HWPFDocument) in WordExtractor match.
Sorry for any silly mistakes, I am doing this .doc reading for the first time.
Thank You all ! :)
Code:
package testdeployment;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
/**
*
* #author Aishwarya
*/
public class MsFileReader {
public static void readDocFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Total no of paragraph "+paragraphs.length);
for (String para : paragraphs) {
System.out.println(para.toString());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
System.out.println("Total no of paragraph "+paragraphs.size());
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
readDocxFile("C:\\Test.docx");
readDocFile("C:\\Test.doc");
}
}
Hi this code is working you can write this one,
public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(file.getAbsolutePath()));
HWPFDocument doc = new HWPFDocument(fs);
readParagraphs(doc);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void readParagraphs(HWPFDocument doc) throws Exception{
WordExtractor we = new WordExtractor(doc);
/**Get the total number of paragraphs**/
String[] paragraphs = we.getParagraphText();
System.out.println("Total Paragraphs: "+paragraphs.length);
for (int i = 0; i < paragraphs.length; i++) {
System.out.println("Length of paragraph "+(i +1)+": "+ paragraphs[i].length());
System.out.println(paragraphs[i].toString());
}
}
Anyone with a similar problem can use the following jar files for the code above.
poi-3.11-20141221.jar
poi-ooxml-3.11-20141221.jar
poi-ooxml-schemas-3.11-20141221.jar
poi-scratchpad-3.11-20141221.jar
xmlbeans-2.6.0.jar
from: http://poi.apache.org/download.html
dom4j-1.6.jar from: http://www.java2s.com/Code/Jar/d/Downloaddom4j16jar.htm
package healthbuddy;
/**
*
* #author tpzap_000
*/
import java.io.*;
import com.thoughtworks.xstream.XStream;
import com.thoughtworks.xstream.io.xml.StaxDriver;
import com.thoughtworks.xstream.persistence.FilePersistenceStrategy;
import com.thoughtworks.xstream.persistence.PersistenceStrategy;
import com.thoughtworks.xstream.persistence.XmlArrayList;
import java.util.List;
import java.util.Scanner;
public class PersistentDataModelCntl implements Serializable{
private File theFile = new File("PDM.txt");
private XStream xstream = new XStream(new StaxDriver());
public static PersistentDataModelCntl thePDMCntl;
private PersistentDataModel thePDM;
public PersistentDataModelCntl(){
this.readPDMFile();
}
public static PersistentDataModelCntl getPDMCntl(){
if(thePDMCntl == null){
thePDMCntl = new PersistentDataModelCntl();
}
return thePDMCntl;
}
public void readPDMFile(){
try
{
System.out.println("in read file");
StringBuilder fileContents = new StringBuilder();
Scanner in = new Scanner(theFile);
String tempXML;
boolean test = in.hasNextLine();
System.out.println(test);
while(in.hasNextLine()){
fileContents.append(in.nextLine());
System.out.println("reading file contents");
}
tempXML = fileContents.toString();
thePDM = (PersistentDataModel)xstream.fromXML(tempXML);
}
//If the file does not exist, thePDM is instantiated to be a new, empty, PDM file. The file is then written to disk, and then read from disk
// using some recursive stuff. Also creates a test UserList so that I don't get a NullPointerException in the LoginCntl.
catch(FileNotFoundException ex){
System.out.println("FileNotFound");
thePDM = new PersistentDataModel();
thePDM.thePDMFoodList = new FoodList();
thePDM.thePDMMealList = new MealList();
thePDM.thePDMDietList = new DietList();
thePDM.thePDMDiet = new Diet();
//Creates new attributes if things are null.
this.writePDMFile();
this.readPDMFile();
System.out.println("FileNotFound Exception");
}
catch(IOException ex){
System.out.println("IO Exception");
ex.printStackTrace();
}
}
//Problem Code is here:
public void writePDMFile(){
try{
String xml = xstream.toXML(thePDM);
PrintWriter writer = new PrintWriter(theFile);
System.out.println(xml);
writer.println(xml);
}
catch(Exception ex){
System.out.println("There was a problem writing the file.");
}
}
public PersistentDataModel getPDM(){
return thePDM;
}
}
Above is my code. I currently have an app that uses object serialization for it's data persistence, but I'm in the process of converting it to XML. I'm using the Xstream library to create the XML, but I'm having some trouble writing it to disc. Xstream gives me the XML as a String, which I then attempt to write to a text file using PrintWriter. However the text file is empty, but the String I'm attempting to write to it is not. My understanding of PrintWriter is such that you supply it the file name it should be writing to, it attempts to write to that file(creates it if it does not exist), and then it should write the contents of the String to the file.
Any assistance would be greatly appreciated. Not sure where I'm going wrong.
You need to add
writer.close()
to the end of your code. The writer only writes to file when it is closed.
You need to call PrintWriter::flush() or PrintWriter::close().
Try to close PrintWriter after you wrote xml to file
I'm an idiot. I didn't call close on my PrintWriter.
I currently have this code to open an xlsx file using apache POI
File existingXlsx = new File("/app/app.xlsx");
System.out.println("File Exists: " + existingXlsx.exists());
Workbook workbook = WorkbookFactory.create(existingXlsx);
When I try to execute this, I get the following output
File Exists: true
java.lang.NullPointerException
at org.apache.poi.xssf.usermodel.XSSFWorkbook.onDocumentRead(XSSFWorkbook.java:270)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:159)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:186)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:91)
The file I am trying to open can be opened in Excel and show the data correctly, what can I do to get POI to read the XLSX file?
Here is the file that breaks;
https://mega.co.nz/#!FJMWjQKI!CzihQgMVpxOQDTXzSnb3UFYSKbx4yFTb03-LI3iLmkE
Edit
I have also tried, this results in the same error;
Workbook workbook = new XSSFWorkbook(new FileInputStream(existingXlsx));
Edit
I found the line it is throwing the exception on;
WorkbookDocument doc = WorkbookDocument.Factory.parse(getPackagePart().getInputStream());
this.workbook = doc.getWorkbook();
Map<String, XSSFSheet> shIdMap = new HashMap<String, XSSFSheet>();
for(POIXMLDocumentPart p : getRelations())
{
if(p instanceof SharedStringsTable) sharedStringSource = (SharedStringsTable)p;
else if(p instanceof StylesTable) stylesSource = (StylesTable)p;
else if(p instanceof ThemesTable) theme = (ThemesTable)p;
else if(p instanceof CalculationChain) calcChain = (CalculationChain)p;
else if(p instanceof MapInfo) mapInfo = (MapInfo)p;
else if (p instanceof XSSFSheet) {
shIdMap.put(p.getPackageRelationship().getId(), (XSSFSheet)p);
}
}
stylesSource.setTheme(theme); <== BREAKS HERE
Edit
After some research POI seems to be unable to find the styles.xml and the workbook.xml, I find this strange because a simple reader like TextWrangler which shows the structure of the archive shows me the styles xml.
How do I fix this? Is there a default styles.xml and workbook.xml which I can insert into the archive?
Now I've dowloaded the latest packages:
poi-src-3.9-20121203.zip (As source)
xmlbeans-2.6.0.zip
jsr173_1.0_api.jar
resolver.jar
xbean.jar
xbean_xpath.jar
xmlbeans-qname.jar
xmlpublic.jar
ooxml-schemas-1.1.jar
dom4j-1.6.1.jar
commons-codec-1.8.jar
commons-logging-1.1.3.jar
ant.jar (ant 1.7)
And your test2.xlsx were read without problems:
public static void main(String arg []){
try {
//File existingXlsx = new File("/app/app.xlsx");
File existingXlsx = new File("c:/Java/poi-3.9/test-data/__theproblem/test2.xlsx");
System.out.println("File Exists: " + existingXlsx.exists());
Workbook workbook = WorkbookFactory.create(existingXlsx);
} catch (Exception e) {
e.printStackTrace();
}
}
Are you sure you're using ooxml-schemas-1.1.jar as the POI documentation recommends?
EDIT
Hmm. It's work for me from jar too.
I have downloaded poi-bin-3.9-20121203.tar.gz from
http://poi.apache.org/download.html
Made a new project in Eclipse, extracted all the jars from the zip:
lib/commons-codec-1.5.jar
lib/commons-logging-1.1.jar
lib/dom4j-1.6.1.jar
lib/junit-3.8.1.jar
lib/log4j-1.2.13.jar
lib/poi-3.9-20121203.jar
lib/poi-examples-3.9-20121203.jar
lib/poi-excelant-3.9-20121203.jar
lib/poi-ooxml-3.9-20121203.jar
lib/poi-ooxml-schemas-3.9-20121203.jar
lib/poi-scratchpad-3.9-20121203.jar
lib/stax-api-1.0.1.jar
lib/xmlbeans-2.3.0.jar
Add the test xlsx:
test-data/test2.xlsx
The test Java:
src/XlsxReadTest1.java
Source:
import java.io.File;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
public class XlsxReadTest1 {
public static void main(String arg []){
try {
File existingXlsx = new File("c:/Java/__Work/apache_POI/poi-3.9-bin/test-data/test2.xlsx");
System.out.println("File Exists: " + existingXlsx.exists());
Workbook workbook = WorkbookFactory.create(existingXlsx);
System.out.println("A1: " + workbook.getSheetAt(0).getRow(0).getCell(0).getStringCellValue());
} catch (Exception e) {
e.printStackTrace();
}
}
}
Run. (Tried with jdk1.7.0_07, jdk1.6.0_31)
Result:
File Exists: true
A1: Testing Edit
"Testing Edit" is the content of the first cell on the first sheet of your file.
I think, You may try this, from scratch.
(Maybe you are using other jars for your project, whom interfere with this jars in the class loader? Class loader is a cunning guy...)
I guess you just used the wrong poi package.
Try to download the following or you check the newest version from the page.
The following I tested in my Eclipse development:
http://www.apache.org/dyn/closer.cgi/poi/release/bin/poi-bin-3.9-20121203.zip
extract it, and include all the jars into your eclipse lib
I combine user1234's answer and my own approach, both are working on your test2.xlsx
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.extractor.XSSFExcelExtractor;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.xmlbeans.XmlException;
public class Main {
/**
* #param args
*/
public static void main(String[] args) {
// File existingXlsx = new File("app.xlsx");
File file = new File("test2.xlsx");
FileInputStream fs;
try {
fs = new FileInputStream(file);
OPCPackage xlsx = OPCPackage.open(fs);
XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);
System.out.println(xe.getText());
} catch (FileNotFoundException e1) {
e1.printStackTrace();
} catch (XmlException e) {
e.printStackTrace();
} catch (OpenXML4JException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
/// -------------- Another approach
File existingXlsx = new File("test2.xlsx");
System.out.println("File Exists: " + existingXlsx.exists());
try {
Workbook workbook = new XSSFWorkbook(new FileInputStream(
existingXlsx));
Sheet worksheet = workbook.getSheet("Filter criteria");
Row row1 = worksheet.getRow(0);
Cell cellA1 = row1.getCell((short) 0);
String a1Val = cellA1.getStringCellValue();
System.out.println("A1: " + a1Val);
} catch (IOException e) {
e.printStackTrace();
}
}
}
finally I got the result:
If you want to read .xlsx, could you please try this code (uses apache poi 3.9) :
File file = new File("/app/app.xlsx");
FileInputStream fs = new FileInputStream(file);
OPCPackage xlsx = OPCPackage.open(fs);
XSSFExcelExtractor xe = new XSSFExcelExtractor(xlsx);
System.out.println(xe.getText());
The above code should display the content of the file app.xlsx.
I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove
Hyperlinks
All special characters
Blank lines
headers footers of pdf files
“1)”,“2)”, “a)”, “bullets”, etc.
I want to get valid text line by line like this:
We propose OntoGain, a method for ontology learning from multi-word concept terms extracted from plain text. OntoGain follows an ontology learning process dened by distinct processing layers. Building upon plain term extraction a con-cept hierarchy is formed by clustering the extracted concepts. The derived term taxonomy is then enriched with non-taxonomic relations. Several dierent state-of-the-art methods have been examined for implementing each layer. OntoGain is based upon multi-word term concepts, as multi-word or compound terms are vested with more solid and distinctive semantics than plain single word terms. We opted for a hierarchical clustering method and Formal Concept Analysis (FCA) algorithm for building the term taxonomy. Furthermore an association rule algorithm is applied for revealing non-taxonomic relations. A method which tries to carry out the most appropriate generalization level between a relation's concepts is also implemented. To show proof of concept, a system prototype is implemented. The OntoGain allows transformation of the derived ontology into OWL using Jena Semantic Web Frame-work1. OntoGain is applied on two separate data sources a medical and computer corpus and its results are compared with similar results obtained by Text2Onto, a state-of-the-art-ontology learning method. The analysis of 11.5 CCD1.1 results indicates that OntoGain performs better than Text2Onto in terms of precision extracts more correct concepts while being more selective extracts fewer but more reasonable concepts.
How can I achieve this?
Using pdfbox we can achive this
Example :
public static void main(String args[]) {
PDFParser parser = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
PDFTextStripper pdfStripper;
String parsedText;
String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
File file = new File(fileName);
try {
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
} catch (Exception e) {
e.printStackTrace();
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
}
Hi we can extract the pdf files using Apache Tika
The Example is :
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
public class WebPagePdfExtractor {
public Map<String, Object> processRecord(String url) {
DefaultHttpClient httpclient = new DefaultHttpClient();
Map<String, Object> map = new HashMap<String, Object>();
try {
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
InputStream input = null;
if (entity != null) {
try {
input = entity.getContent();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext parseContext = new ParseContext();
parser.parse(input, handler, metadata, parseContext);
map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
map.put("title", metadata.get(TikaCoreProperties.TITLE));
map.put("pageCount", metadata.get("xmpTPg:NPages"));
map.put("status_code", response.getStatusLine().getStatusCode() + "");
} catch (Exception e) {
e.printStackTrace();
} finally {
if (input != null) {
try {
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
} catch (Exception exception) {
exception.printStackTrace();
}
return map;
}
public static void main(String arg[]) {
WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://math.about.com/library/q20.pdf");
System.out.println(extractedMap.get("text"));
}
}
You can use iText for do such things
//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
for example:
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
System.out.println(str);
reader.close();
} catch (Exception e) {
System.out.println(e);
}
another one
try {
PdfReader reader = new PdfReader("c:/temp/test.pdf");
System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
String page = PdfTextExtractor.getTextFromPage(reader, 2);
System.out.println("Page Content:\n\n"+page+"\n\n");
System.out.println("Is this document tampered: "+reader.isTampered());
System.out.println("Is this document encrypted: "+reader.isEncrypted());
} catch (IOException e) {
e.printStackTrace();
}
the above examples can only extract the text, but you need to do some more to remove hyperlinks, bullets, heading & numbers.
For the newer versions of Apache pdfbox. Here is the example from the original source
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.pdfbox.examples.util;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.text.PDFTextStripper;
/**
* This is a simple text extraction example to get started. For more advance usage, see the
* ExtractTextByArea and the DrawPrintTextLocations examples in this subproject, as well as the
* ExtractText tool in the tools subproject.
*
* #author Tilman Hausherr
*/
public class ExtractTextSimple
{
private ExtractTextSimple()
{
// example class should not be instantiated
}
/**
* This will print the documents text page by page.
*
* #param args The command line arguments.
*
* #throws IOException If there is an error parsing or extracting the document.
*/
public static void main(String[] args) throws IOException
{
if (args.length != 1)
{
usage();
}
try (PDDocument document = PDDocument.load(new File(args[0])))
{
AccessPermission ap = document.getCurrentAccessPermission();
if (!ap.canExtractContent())
{
throw new IOException("You do not have permission to extract text");
}
PDFTextStripper stripper = new PDFTextStripper();
// This example uses sorting, but in some cases it is more useful to switch it off,
// e.g. in some files with columns where the PDF content stream respects the
// column order.
stripper.setSortByPosition(true);
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
// Set the page interval to extract. If you don't, then all pages would be extracted.
stripper.setStartPage(p);
stripper.setEndPage(p);
// let the magic happen
String text = stripper.getText(document);
// do some nice output with a header
String pageStr = String.format("page %d:", p);
System.out.println(pageStr);
for (int i = 0; i < pageStr.length(); ++i)
{
System.out.print("-");
}
System.out.println();
System.out.println(text.trim());
System.out.println();
// If the extracted text is empty or gibberish, please try extracting text
// with Adobe Reader first before asking for help. Also read the FAQ
// on the website:
// https://pdfbox.apache.org/2.0/faq.html#text-extraction
}
}
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println("Usage: java " + ExtractTextSimple.class.getName() + " <input-pdf>");
System.exit(-1);
}
}
Extracting all keywords from PDF(from a web page) file on your local machine or Base64 encoded string:
import org.apache.commons.codec.binary.Base64;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class WebPagePdfExtractor {
public static void main(String arg[]) {
WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
System.out.println("From file: " + webPagePdfExtractor.processRecord(createByteArray()).get("text"));
System.out.println("From string: " + webPagePdfExtractor.processRecord(getArrayFromBase64EncodedString()).get("text"));
}
public Map<String, Object> processRecord(byte[] byteArray) {
Map<String, Object> map = new HashMap<>();
try {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(false);
stripper.setShouldSeparateByBeads(true);
PDDocument document = PDDocument.load(byteArray);
String text = stripper.getText(document);
map.put("text", text.replaceAll("\n|\r|\t", " "));
} catch (Exception exception) {
exception.printStackTrace();
}
return map;
}
private static byte[] getArrayFromBase64EncodedString() {
String encodedContent = "data:application/pdf;base64,JVBERi0xLjMKJcTl8uXrp/Og0MTGCjQgMCBvYmoKPDwgL0xlbmd0aCA1IDAgUiAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeAGF0E0OgjAQBeA9p3hL3UCHlha2Gg9A0sS1AepPxIDl/rFFErVESDddvPlm8nqU6EFpzARjBCVkLHNkipBzPBsc8UCyt4TKgmCr/9HI+GDqg2x8Luzk8UtfYwX5DVWLnQaLmd+qHTsF3V5QEekWidZuDNpgc7L1FvqGg35fOzPlqslFYJrzZdnkq6YI77TXtrs3GBo7oKvNss9mfhT0IAV+e6CUL5pSTWb0t1tVBKbI5McsXxNmciYKZW5kc3RyZWFtCmVuZG9iago1IDAgb2JqCjE4NQplbmRvYmoKMiAwIG9iago8PCAvVHlwZSAvUGFnZSAvUGFyZW50IDMgMCBSIC9SZXNvdXJjZXMgNiAwIFIgL0NvbnRlbnRzIDQgMCBSIC9NZWRpYUJveCBbMCAwIDU5NSA4NDJdC" +
"j4+CmVuZG9iago2IDAgb2JqCjw8IC9Qcm9jU2V0IFsgL1BERiAvVGV4dCBdIC9Db2xvclNwYWNlIDw8IC9DczEgNyAwIFIgL0NzMiA4IDAgUiA+PiAvRm9udCA8PAovVFQxIDkgMCBSID4+ID4+CmVuZG9iagoxMCAwIG9iago8PCAvTGVuZ3RoIDExIDAgUiAvTiAxIC9BbHRlcm5hdGUgL0RldmljZUdyYXkgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCngBhVVdaBxVFD67c2cDEgcftA0ttIM/bQnpMolWE4u12026SRO362ZTmyrKdHY2O81kZpyZ3SahT6XgmxYE6augPsaCCLYqNi/2paXFkko1DwoRWowgKH1S8Dsz22R2QTLDnfnuueeee8537rmXqOtv3fPstEo054R+oZybPjl9Su26TWlSqJvw6Ebg5UqlCcaO65j8b38e3qUUS+7sZ1vtY1v25KoZGNC6huZWA2OOKKURZWqG54dEXZcgHzwbeoxvAz85WynngdeAldZcQHqqYDqmbxlqwdcX1JLv1i" +
"w76etW42xjy2fObrCv/OxG6w5mJ8fx74XPF0xnahJ4H/CSoY8w7gO+27ROFGOcTnvhkXKsn842ZqdyLfnJmn90qiW/UG+MMs4SpZcW65U3gJ8AXnVOF4+39Ndn3XG200Mk9RhB/hTws8Ba3RzjPKnAFd8tsz7Lw6o5PAL8MvAlKxyrAMO+9EPQnGQ5sKDFep79xFoie0Y/VgLeBnzItAu8FuyIiheW2OYg8LxjF3ktxC4um0EUL2IXP4X1ymisL6dDv8JznyaS99Sso2PA4EQerfujLIc/cujZ0d56EXjJb5Q59j3Aa7o/UgCGzcxjVX2YeX4BeIBOpHQyyaXT+Brk0L+INyCLmhHyyMdYDX2bCtBw0Hz0DGgVgHRaAColtEz0WCeeo1IVPZVmollBhNjK/ahvUH7Xp9SAtE7rkNaBXqNfIsk8/Upz6OchbWBspsNuHl44tAgP2BO2+aBl0xXbhSaeRzsoJsQrYlAMkSpeFYfFITEM6ZA4GM2JvU/6zn4+2LD0LtZN+r4MDkKsZ8MzB6xwNAE8+AfrzkaaCbYu7mjs87yP3j/vv2MZtz74s429APoxJ7/BogtrJiXmXj/3TU/CQ3VFfPXWne7r5+h4MktR3qqdWZLX5PvyCr735NWkDflneRXvvbZcPcoL/5O5zSFGO5LNQc48m1G0ccYbwCG4qUVz9rdZTLLptmK0YMlClJ2ruP/LCfPDPLexUnMu7vC8tz9jNs33ig+LdL5Pu6y" +
"ta59oP2p/aCvax0C/Sx9KX0rfSlekq9INUqVr0rL0nfS99Ln0NXpfQLosXenYSXHsG7sHfsZ71mjtMGaGsxQQ88LazApLH/F3BmOb+TOh1V4Dnbt/Yy3liLJTeUYZVnYrzykTSq9yQDmsbFcG0PqVUWUvRnZusGRjPc6AhX+SZ4umI67iPLFXdbDnw0sd76ZfXMPWhjXYST0Ontnapg6vEVe/FVVjvDtdnAY6TSFii84ich86nB8nqv7O2VyTODVSb+KUsMQu0S/GWjWYEwdQheNt9TjIVZoZyQxncqRmejNDmf7MMcZRrNH5ktmL0SF8RxLeM8sx/5s1xGcY7x3mqAlso4dbKzTncd8R5V1vwbdm6qE6oGkvqTlcr6Y65hjZPlW3bTUaClTfDEy/aVazxHc3zyP66/XoTk5tu2E0/GYso1TqJtF/t4+TNAplbmRzdHJlYW0KZW5kb2JqCjExIDAgb2JqCjExMTYKZW5kb2JqCjcgMCBvYmoKWyAvSUNDQmFzZWQgMTAgMCBSIF0KZW5kb2JqCjEyIDAgb2JqCjw8IC9MZW5ndGgg" +
"MTMgMCBSIC9OIDMgL0FsdGVybmF0ZSAvRGV2aWNlUkdCIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4AYVVW4gbVRj+kznJCrvO09rVLaRDvXQpu0u2Fd2ltJpbk7RrGrLZ1RZBs5OTZMzsJM5M0gt9KoLii6u+SUG8vS0IgtJ6wdYH+1KpUFZ36yIoPrR4QSj0RbfxO5NkJllqm2XPfPP93/lv558ZooG1Qr2u+xWiJcM2c8mo8tzRY8rAOvnpIRqkURosqFY9ks3OEn5CK679v1s/kE8wVyfubO9Xb7kbLHJLJfLdB75WtNQl4BNEgbNq3bSJBobBTx+36wKLHIZNJAj8osDlNoaNhhfb+DVHk8/FoDkLLKuVQhF4BXh8sYcv9+B2DlDAT5Ib3NRURfQia9ZKms4dQ3u5h7lHeTe4pDdQs/PbgXXIqs4dxnUMtb9SLMQFngReUQuJOeBHgK81tYVMB9+u29Ec8GNE/p2N6nwEeDdwqmQenAeGH79ZaaS6+J1Tlfyz4LeB/8ZYzBzp7F1TrRh6STvB367wtOhviEhSN" +
"DudB4Yf6YBZywk9cpBKRR5PAI8Dv16tHRY5wKf0mdWcE7zIZ+1UJSbyFPzllwqHssCjwL9yPSn0iCX9W7eznRxYyNAzIi5isTi3nHrhh4XsSj4FHnGZbpv5zl62XNIOpjv6TypmSvBi77W67swocgv4zUZO1I5YgcmCmUgCw2cgy4150U+Bm7TgKxCnGi1iVcmgTVIoR0mK4lonE5YSaaSD4bByMBx3Xc2Es8+iKniNmo7Nwpp1lO2dXa1CZbAGXXe0KsVCH1EDnir0B9iK61OhGO4a4Mr/46edy42OnxobYWG2F//72Czbz6bZDCnsKfY0O8DiYGfYPtd3Fnu6FYl8biBK28/LiMgd3QJqv4gabSpg/QWKGlmuh76uLI82xjzLGfMFTb3yxt89vdKws+oqJvo6euRePQ/8FrgeWMW6HthwfSiBnwIb+FtHb7xaap6902VxUhpOtNan23oWXVUElerOziV0QUPNvKfmiV4fl05/+aAXbZWde/7q0KXTJWN51GNFF/irmVsZOjPuseEfw3+GV8PvhT8M/y69LX0qfSWdlz6XLpMiXZ" +
"AuSl9L30ofS1+4+rvNkHv2JDIXcyXyFtPVrbC315hYOSpvlx+W4/IO+VF51lUp8og8JafkXbBsd8/Nm2+lt3L05Siidftz51jiWdFcTzgD3/2YAM2L2DcD88hYo+PwaaLfYt4MOglt75PXqYiF2BRLb5nuaTHzXd/BRDAejJAS3B2cCU4FDwncfZaDu2CbwZrozQ3z4Sr6KuU2PyG+JxSr1U+aWrliK3vC4SeVCD59XEkb6uS4UtB1xTFZisktbjZ5cZLEd1PsI7qZc76Hvm1XPM5+hmj/X3j3fe9xxxpEKxbRyOMeN4Z35QPvEp17Qm2YzbY/8vm+I7JKe/c4976hKN5fP7daN/EeG3iLaPPNVuuf91utzQ/gf4Pogv4foJ98VQplbmRzdHJlYW0KZW5kb2JqCjEzIDAgb2JqCjEwNzkKZW5kb2JqCjggMCBvYmoKWyAvSUNDQmFzZWQgMTIgMCBSIF0KZW5kb2JqCjMgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9NZWRpYUJveCBbMCAwIDU5NSA4NDJdIC9Db3VudCAxIC9LaWR" +
"zIFsgMiAwIFIgXSA+PgplbmRvYmoKMTQgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDMgMCBSID4+CmVuZG9iago5IDAgb2JqCjw8IC9UeXBlIC9Gb250IC9TdWJ0eXBlIC9UcnVlVHlwZSAvQmFzZUZvbnQgL0NOVFpYVStNZW5" +
"sby1SZWd1bGFyIC9Gb250RGVzY3JpcHRvcgoxNSAwIFIgL0VuY29kaW5nIC9NYWNSb21hbkVuY29kaW5nIC9GaXJzdENoYXIgMzIgL0xhc3RDaGFyIDExNiAvV2lkdGhzIFsgNjAyCjAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgNjAyIDYwMiA2MDIgNjAyIDYwMiA2MDIgMCAwIDAgMCAwIDAgMCAwIDAKMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgNjAyIDAgMAo2MDIgNjAyIDYwMiA2MDIgNjAyIDYwMiAwIDAgNjAyIDYwMiAwIDAgNjAyIDAgMCA2MDIgNjAyIF0gPj4KZW5kb2JqCjE1IDAgb2JqCjw8IC9UeXBlIC9Gb250RGVzY3JpcHRvciAvRm9udE5hbWUgL0NOVFpYVStNZW5sby1SZWd1bGFyIC9GbGFncyAzMyAvRm9udEJCb3gKWy01NTggLTM3NSA3MTggMTA0MV0g" +
"L0l0YWxpY0FuZ2xlIDAgL0FzY2VudCA5MjggL0Rlc2NlbnQgLTIzNiAvQ2FwSGVpZ2h0IDcyOQovU3RlbVYgOTkgL1hIZWlnaHQgNTQ3…/ZfICj5JcLdi/ATmQZKogDPg0lIDBunI0ZGOB1OB/Lpyce1TbJqCpBThycVs3GyQPZSLKexbMGyFss8LF4sNb2lElu5HPlJ2439G1jKsbRh6cTyPNpx8I6AFxa8P+xD2E4e/G+5PqJ/8aDzERFvGBJR/WLkfwcM3kRCiZpokDMdxhn5MeD9Rn5MSm0mYUpLSF98J5HXaQgtpJvoDWGesEe4C4NgK3woWsQ88RgzszXsMM4WyALeIC5gO5B/FYk/pNxVCJGoZT8NYc8LIknrONeVQYznus51pYeZHCaXw+RYIJLAEogJfMEbVPrvv31S6icvTMlp1EQhO41cOuXb0EEkSYkmGaMXSzuIfhCKAA4Y/YScTs9ASizblWVyWB1UT4fwNfSp9+mgwLFd4oI3D++9++kuheYWpOnEeBhLJrv7kVg" +
"Xk1hkVDRExLgkieUZTTt1jZYGkTTiXU8tULUtIsEIfeKMgY5AV3u7yZyTQdK6Mm923fwgHe1GZWTfmCJy5CYi05PgwqWzB5HBw2n2wL7OBEmVPZxmZYpWi6TSU7pM2BNY1kojs0sLN1bPOLZ4/nuzL1CNp/S+zt27dx+lA4Y/2/hA1fq8kR9kZF57u6R96YgvZRmsvfe5OBj5TSKjkN+wRqu6LrRF1yjF19lbYhudDVKTdVe/8DAClihbX6MNEuItofH9kF9k+FwXMofC7rqCDHcZu293G7tz0qmNWi2iM6FvYrYN2RuEvCbT7GDnZ0xDyMZt/Otb8z+aP+/dOS17927ZurVu24ZVnrYFz7w95jxlayE+8b3Nf/m6b5/j2QMb1v26qeXZ8iWVSUkH7fYLb1bKBw7aA57LYgVGYAGtLM8dT3WgIwC6PAIaVSOjsCaUatXEFiJKBm0fvTEQODe0K9Mki/mK3DPnBOUsHkchH/ckhFIHZJmyrE6T0+TIFi7xfvRjx/X33jves5rFBb6Gk4GsHXwbLX1Hlp0XZZeKa8eRYe4EURUX3agy" +
"1RnXWxp1QiNZo2tS7baBjUTYqDqBGONtspI7UEwosSsoMUVevAM5CEO9mmRVEquF/ExwsrxOCTd7OpKnpXxFjfzzO8uPTnz44Oydb7bunLy1kHXu5huMBt59vYvfsNtPZmb4tjfvdblQGjXI23jUayTpg9w5VfFRjer4RqP6DRGPs/ViY3iDscmVYCN9dQkqKZaGxbuMga6uwBXZeYLq/MKI6jShPq0DqDNBUBg0Wy2C0y6YjMSRGU4TJKslPKhYuJS7fkL7u+m7F33yzc3PeOBb6qSWsZv4Zys3bVq5as0atu+gK5Ff4ldLH+d3vvuW36bL6Ab6LF0X37Pw4I4dB//4+z0+RZ8y306xEuNGP7LI3V+tItF2baRBRfZHqurNjjr7O3H1fdrMTZE6GilG6dWSNt8uStbh/Y03u9AkM1G3snI7rtwMyCKWd2DKMeegZ6W749Lj0+3pjvSEZtJMm4VmdbNme3hzRHNkc1RztH4m7rJ3Q4OzB5uc2XpE9M0eOOh+mi1LoNfdwlGfQtuwV159duGWPfTAgfv/VP3GBz98d4eu2jirfca81uK6o8P62oWsJxaXLT57sN/4npUtpY/8eXvr4bhVzwwa6E9MnDIlc2PQditxr2bMIowYLdLd0YxYouv1lvqQJn0bfREiRCIJo0xmzeg43Ju8NTk2KIaDe0ynpqxeHlEd5ixUx790gazCdL9/QFPpiWvX3y/byg1ramr" +
"q6mpq1sAZYeQ/utYVTaP3Uys10cHTuOaj8xfPdV44L/uSzE8Jyt6K/GQFIyKGKSUIChgEY09jScPcUUJf02C4DCNRShuL32qS0zOo1WGVZIMYbEXZ2QmaSVamWRUUnlgS+DzknT3F7eWPHpnBf+Dnqf3GR3d82g1ran4XItRPl744dl/OW8nJNIeGUS118782Lt3lWyT72RH08USUUxgZiFIyUm3IfonWkxf10mG1EKYioUzSGTQW47mhHYGhHZmKAVzJDKD60b9lA0aJxFHZyWSndqBKs8TEM3Mn0JV8hZ930uRdf5IsTZPnz/UG0uCMd6JfTq9lefDRornXFke7E6O0tpjEUDDXhYWH1tvC6w2AlmgzHEk63D8xikjaUZLZ7BiNhtjRqy10846gERo7u9EC0RKRGyV0Bx0nDL3pxzg5TJANrleZEdlZMH31ytXrvWtWrPZ3Xx3fUjSneeTmNSlbyjuuX+9Y2JDmF3JOffzxqVOfnuefBXggNmb/gJTtvpCqWQ/TIVSFZ+mQh6ZvkPcRlF+MIr8Ud2SoHvC3OKne1KZ9UU0FiYzVhUqaQgvaGIoMTWwoxnKM67KFOU1BZrGTpfh/uBhz4LEnVtb5/RmvL3ljl7C/Z6ywv3H9W2/0rJYsPTtK5l6W5daN+qqWDHh+6oicYqKpyD9kyocpRTtSXcR8Em1J7muwlW1LexrtSs5JZLuScwa5pXgGyx/hdZxIeA" +
"JTPMvxBMSaYoSmQ+nHtDywiJbzyzTe7xcfDmR5vTBcyPsKv7yBPEyXopGDxCAHOoUoppRILPQiribnP/IqOjlLka3XEo5eIbu8bCTCqRmej6+99ib/lF6im3/13ItnD8OtF4LyxN+YxQq0iwTyqjsx0mwIFVUkLkZSWbX1dmiLORxlVBGTIWSC" +
"NNE0wTAxNnJCdIHTeHOcTzt1nM80dUbxARJ9r/0+T2BoAA0leOYPHXrlpnIwoYmg8NPdo9LFdJYupavSQ9JD09Xpmtzw3IjcyNyo3OjcmNzY3LhcWzVUi9WsWqpWVYdUh1arqzXecG+EN9Ib5Y32xnhjvXFem5POpPLJEh5Ff6LMf2vVqgwKOx" +
"IeHbu64vXswkn3v54zdkzOzp2Oubnjy6B7dMEZfqlnubDymyWVX/SsEFbeWCy3YknJ0NxCWddt/CFxKspCjmFZ7tgfY1ibvokegcNxGL9GKZGsUI5imYqJoV/8GMZcsm0pXJhNRtkbfuofdPmBA3IYu/rV+/Oa6I3VNatqa1fVrF7Xc1xSe4um" +
"8Xf5df53fnwavfXR+Qud5y5iFJPtvRPjmIQ8JZJlbrdOK+g1EfG2kFBBpY6wxdvy4myRao0tXrSSOtouWuqs7ZH1JrHe1WZqSopTa+JjVOSBGEk/RiVZEgqSgu58RXZfObDIh6OR3+o23uo2RyjHipKn6ZU8Tak9CcETUz5L4pVkSPq3k2cPTBMGYPo2CFUCJx9oLqqqfPitsWvXdX1YtP+x+YemPrvqVkjBy789//70FjFn34ABk4vGjXXqo7dVtbQ6nW3Z2XM91RmCPn7jilf+4FD2incAMYS9hLExwx2pZyEG2E9M9HDIfnWIJhTzYclo1v88MnbdHNohp0Cyg8sx8Wdmb8Lr7nY+a9ayU5dP7ZZDI3uJH/b2NP9qzsaWE0KJlw5HnSvPvefwlwRZ2r988CqMnmzBUyScRGD+EUXyySgymowhY8k4Mp48gLl+EXmITFM+pPjvhSANSb5Ej5w4dXrxg8kTyhYtrEidUjZ/2cLZTxLyT2S78dEKZW5kc3RyZWFtCmVuZG9iagoxNyAwIG9iago0ODAxCmVuZG9iagoxOCAwIG9iagooKQplbmRvYmoKMTkgMCBvYmoKKE1hYyBPUyBYIDEwLjEyLjYgUXVhcnR6IFBERkNvbnRleHQpCmVuZG9iagoyMCAwIG9iagooKQplbmRvYmoKMjEgMCBvYmoKKCkKZW5kb2JqCjIyIDAgb2JqCihUZXh0TWF0ZSkKZW5kb2JqCjIzIDAgb2JqCihEOjIwMTcxMjEyMTMwMzQ4WjAwJzAwJykKZW5kb2JqCjI0IDAgb2JqCigpCmVuZG9iagoyNSAwIG9iagpbICgpIF0KZW5kb2JqCjEgMCBvYmoKPDwgL1RpdGxlIDE4IDAgUiAvQXV0aG9yIDIwIDAgUiAvU3ViamVjdCAyMSAwIFIgL1Byb2R1Y2VyIDE5IDAgUiAvQ3JlYXRvcgoyMiAwIFIgL0NyZWF0aW9uRGF0ZSAyMyAwIFIgL01vZERhdGUgMjMgMCBSIC9LZXl3b3JkcyAyNCAwIFIgL0FBUEw6S2V5d29yZHMKMjUgMCBSID4+CmVuZG9iagp4cmVmCjAgMjYKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDA4OTI5IDAwMDAwIG4gCjAwMDAwMDAzMDAgMDAwMDAgbiAKMDAwMDAwMzAyOCAwMDAwMCBuIAowMDAwMDAwMDIyIDAwMDAwIG4gCjAwMDAwMDAyODEgMDAwMDAgbiAKMDAwMDAwMDQwNCAwMDAwMCBuIAowMDAwMDAxNzUzIDAwMDAwIG4gCjAwMDAwMDI5OTIgMDAwMDAgbiAKMDAwMDAwMzE2MSAwMDAwMCBuIAowMDAwMDAwNTEyIDAwMDAwIG4gCjAwMDAwMDE3MzIgMDAwMDAgbiAKMDAwMDAwMTc4OSAwMDAwMCBuIAowMDAwMDAyOTcxIDAwMDAwIG4gCjAwMDAwMDMxMTEgMDAwMDAgbiAKMDAwMDAwMzU0NCAwMDAwMCB" +
"uIAowMDAwMDAzNzk2IDAwMDAwIG4gCjAwMDAwMDg2ODcgMDAwMDAgbiAKMDAwMDAwODcwOCAwMDAwMCBuIAowMDAwMDA4NzI3IDAwMDAwIG4gCjAwMDAwMDg3ODAgMDAwMDAgbiAKMDAwMDAwODc5OSAwMDAwMCBuIAowMDAwMDA4ODE4IDAwMDAwIG4gCjAwMDAwMDg4NDUgMDAwMDAgbiAKMDAwMDAwODg4NyAwMDAwMCBuIAowMDAwMDA4OTA2IDAwMDAwIG4gCnRyYWlsZXIKPDwgL1NpemUgMjYgL1Jvb3QgMTQgMCBSIC9JbmZvIDEgMCBSIC9JRCBbIDxkYjc4M2NhNDM2Mzg4YzI5ZDc5MDQ2NzY3NjUxNjE3OT4KPGRiNzgzY2E0MzYzODhjMjlkNzkwNDY3Njc2NTE2MTc5PiBdID4+CnN0YXJ0eHJlZgo5MTA0CiUlRU9GCg==";
String content = encodedContent.substring("data:application/pdf;base64," .length());
return Base64.decodeBase64(content);
}
public static byte[] createByteArray() {
String pathToBinaryData = "/bla-bla/src/main/resources/small.pdf";
File file = new File(pathToBinaryData);
if (!file.exists()) {
System.out.println(" could not be found in folder " + pathToBinaryData);
return null;
}
FileInputStream fin = null;
try {
fin = new FileInputStream(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
byte fileContent[] = new byte[(int) file.length()];
try {
fin.read(fileContent);
} catch (IOException e) {
e.printStackTrace();
}
return fileContent;
}
}