I have a project that used to split a pdf file that uploaded by user, after split then get the same content inside pdf then merge the page base on pdf content using PDODocument and for merge pdf i use PDFMergerUtility, after marge i save the merge pdf to database using bytearray.
and, after save to DB, user also can download the pdf that already split and merge base on content and reupload when needed.
but i have found a problem, after merge the size of pdf is bigger than pdf before split.
i have try to found the solution, but not found that working to my problem, such us
Android PdfDocument file size
Is there a way to compress PDF to small size using Java?
and another else solution
is there any solution to solve my problem?
I would be glad for any help.
and here is my code
//file: MultipartFile -> file is send from front-end using API
var inpStream: InputStream = file.getInputStream()
inpStream = file.getInputStream()
pdfDocument = PDDocument.load(inpStream)
// splitting the pages of a PDF document
pagesPdf = splitter.split(pdfDocument)
val n = pdfDocument.numberOfPages
val batchSize:Int = 200
val finalBatchSize: Int = n % batchSize
val numOfBatch: Int = (n - finalBatchSize) / batchSize
val batchFinal: Int = if (finalBatchSize == 0) numOfBatch else (numOfBatch + 1)
var batchNo: Int = 1
var startPage: Int
var endPage: Int = 0
while (batchNo <= batchFinal) {
startPage = endPage + 1
if (batchNo > numOfBatch) {
endPage = endPage + finalBatchSize
} else {
endPage = endPage + batchSize
}
val splitter:Splitter = Splitter()
splitter.setStartPage(startPage)
splitter.setEndPage(endPage)
// splitting the pages of a PDF document
pagesPdf = splitter.split(pdfDocument)
batchNo++
i = startPage
var groupPage: Int = i
var pageNo = 0
var pdfMerger: PDFMergerUtility = PDFMergerUtility()
var mergedFileByteArrOut: ByteArrayOutputStream = ByteArrayOutputStream()
pdfMerger.setDestinationStream(mergedFileByteArrOut)
var fileObj:ByteArray? = null,
for (pd in pagesPdf) {
pageNo++;
if (!pd.isEncrypted) {
val stripper = PDFTextStripper()
//CODE TO GET CONTEN
if(condition1 == true){
var fileByteArrOut: ByteArrayOutputStream = ByteArrayOutputStream()
pd.save(fileByteArrOut)
pd.close()
var fileByteArrIn: ByteArrayInputStream = ByteArrayInputStream(fileByteArrOut.toByteArray())
pdfMerger.addSource(fileByteArrIn)
fileObj = fileByteArrOut.toByteArray(),
}
if(condition2 == true){
//I want to compress fileObj first before save to DB
//code to save to DB
fileObj = null
pdfMerger = PDFMergerUtility()
mergedFileByteArrOut= ByteArrayOutputStream()
pdfMerger.setDestinationStream(mergedFileByteArrOut)
}
}
}
You can use cpdf https://community.coherentpdf.com to losslessly squeeze the PDF files afterward. This will reconcile any identical object and common parts, and remove any unneeded parts.
From the command line
cpdf -squeeze in.pdf -o out.pdf
Or, from Java:
jcpdf.squeezeInMemory(pdf);
Related
I created a Scala application that as part of its functions upload any type of file (BZ2, csv, txt, etc) to google cloud storage.
If I use the direct upload works fine for small files, but to upload a big file google recommends using “signUrl” and this is not creating the file, or updating a file if I create the file before, or throwing any exception with the error.
Works with small files:
val storage: Storage = StorageOptions.getDefaultInstance.getService
val fileContent = Files.readAllBytes(file.toPath)
val fileId: BlobId = BlobId.of(bucketName, s"$folderName/$fileName")
val fileInfo: BlobInfo = BlobInfo.newBuilder(fileId).build()
storage.create(fileInfo, fileContent)
Don´t work:
val storage: Storage = StorageOptions.getDefaultInstance.getService
val outputPath = s"$folderName/$fileName"
val fileInfo: BlobInfo = BlobInfo.newBuilder(bucketName, outputPath).build()
val optionWrite: SignUrlOption = Storage.SignUrlOption.httpMethod(HttpMethod.PUT)
val signedUrl: URL = storage.signUrl(fileInfo, 30, TimeUnit.MINUTES, optionWrite)
val connection = signedUrl.openConnection
connection.setDoOutput(true)
val out = connection.getOutputStream
val inputStream = Files.newInputStream(file.toPath)
var nextByte = inputStream.read()
while (nextByte != -1) {
out.write(nextByte)
nextByte = inputStream.read()
}
out.flush()
inputStream.close()
out.close()
I try reading/writing byte by byte, using and array of bytes, and using a OutputStreamWriter but neither work.
The library that I´m using is:
"com.google.cloud" % "google-cloud-storage" % "1.12.0"
Anyone know why this is not working?
Thanks
I found an easier way to write big files in Google storage, without using the signURL.
I will add the code just in case someone found it useful.
storage.create(fileInfo)
var writer:WriteChannel = storage.get(fileInfo.getBlobId).writer()
var inputStream: InputStream = Files.newInputStream(file.toPath)
val packetToRead: Array[Byte] = new Array[Byte](sizeBlockDefault toInt)
while (inputStream.available() > 0){
val numBytesReaded = inputStream.read(packetToRead,0,sizeBlockDefault toInt)
writer.write(ByteBuffer.wrap(packetToRead, 0, numBytesReaded))
}
inputStream.close()
writer.close()
Is it possible to use the AngelList API to get the entire list of companies (startups) inside the AngelList website without knowing the ID of all of them?
Or, is there a way to get all of the company IDs?
I'm trying a JSON parser with random URL, because the AngelList URLs are randomly used for location, market and other.
I would like to obtain all the AngelList startups and companies in a txt file
for (int h = 1612; h<=1885; h++){
do {
// change the URL as per the requirement and also paginating it
String libURL = "https://api.angel.co/1/tags/" + h + "/startups?page="
+ i;
InputStream in = URI.create(libURL).toURL().openStream();
// writing each page into a seperate file
FileOutputStream fout = new FileOutputStream(
"/Users/Fabio/Desktop/FilesAngellist/file" + i + ".txt");
byte data[] = new byte[1024];
int count;
while ((count = in.read(data, 0, 1024)) != -1) {
fout.write(data, 0, count);
}// end of while
if (i == 1) {
DownloadJobJson db = new DownloadJobJson(); // code to pull
// "last_page" value
// from json file
pagenumber = db.DownloadJobJson1();
}// end of if
i = i + 1;
} while (i <= pagenumber);// end of do-while()
}
This is the code of the JSON downloader from AngelList URL
Solved using linkedin j project
I have (in most cases) 2 PDF files, 1 main PDF containing around 30000 pages, and another one containing 1 Page which i want to insert after every X amount of pages to the main one (depending on my seperate index file), while also adding barcodes to each page.
The problem i have is that the result PDF becomes VERY large (10GB+), while the in-files are 350Mb and the small one i want to insert <50kb.
Whats a good way to optimize the size of the PDF Im creating?
Heres the relevant parts of the code handling the PDF merge.
PdfImportedPage page;
PdfSmartCopy outPdf;
PdfSmartCopy.PageStamp stamp;
PdfReader pdfReader, pdfInsertReader;
...
outDoc = new Document();
outPdf = new PdfSmartCopy(outDoc, new FileOutputStream(outFile));
pdfToolset = new PDFToolset();
outDoc.open();
...
//loop over pages in my index-file
for (IndexPage index_page : item.pages) {
if (indexpage.insertPage){
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfReader);
outPdf.flush();
if (!currentDoc.equals(insertDoc)) {
insertDoc = currentDoc;
pdfInsertReader = new PdfReader(currentDoc);
}
} else if (!currentDoc.equals(indexpage.source_file)) {
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfInsertReader);
outPdf.flush();
if (!mainDoc.equals(currentDoc)){
mainDoc = currentDoc;
pdfReader = new PdfReader(mainDoc);
}
}
if (indexpage.insertPage)
page = outPdf.getImportedPage(pdfInsertReader, indexpage.source_page);
else
page = outPdf.getImportedPage(pdfReader, indexpage.source_page);
if (!duplex || (duplex && indexpage.nr % 2 == 1)) {
stamp = outPdf.createPageStamp(page);
stamp = pdfToolset.applyBarcode(stamp, indexpage.omr, indexpage.nr);
stamp.alterContents();
}
outPdf.addPage(page);
}
...
Using rjb with ruby 1.9.3 and itext 4.2.0:
What I'm trying to do is to merge tiff files into pdfs. However, I would like to the merge to happen in memory rather than saving the conversion of the tiff to pdf to a file and then running a combine on the files. Rather than supply PdfReader with a filename, I have seen some examples saying it can also take a Byte Array as an input. I can get the file version working, but I get an error when I use a ByteArrayOutStream, and I'm not sure why.
The following seems to work fine when using a filestream:
def tiff_to_pdf_by_file(image_file_name)
#document = Rjb::import('com.lowagie.text.Document')
#tifreader = Rjb::import('com.lowagie.text.pdf.codec.TiffImage')
#randomaccess = Rjb::import('com.lowagie.text.pdf.RandomAccessFileOrArray')
#pdfwriter = Rjb::import('com.lowagie.text.pdf.PdfWriter')
#filestream = Rjb::import('java.io.FileOutputStream')
pdf = #document.new
#pdfwriter.getInstance(pdf, #filestream.new('test_temp.pdf'))
pdf.open()
ra = #randomaccess.new(image_file_name)
pages = #tifreader.getNumberOfPages(ra)
(1..pages).each do |i|
image = #tifreader.getTiffImage(ra,i)
scaler = ((pdf.getPageSize().getWidth() - pdf.leftMargin() - pdf.rightMargin()) / image.getWidth() * 100)
image.scalePercent(scaler)
pdf.add(image)
end
pdf.close()
return 'test_temp.pdf'
end
#pdfreader = Rjb::import('com.lowagie.text.pdf.PdfReader')
#pdfcopyfields = Rjb::import('com.lowagie.text.pdf.PdfCopyFields')
#filestream = Rjb::import('java.io.FileOutputStream')
filestream = #filestream.new('new_combined_pdf.pdf')
copy = #pdfcopyfields.new(filestream)
copy.addDocument(#pdfreader.new(tiff_to_pdf_by_file('test_image.tif')))
copy.addDocument(#pdfreader.new('test_template.pdf))
copy.close()
But when I try to use a byte array like the code below, I get the error "Not found as a file or resource" on the line the pdfreader reads the byte array.
def tiff_to_pdf_by_file(image_file_name)
#document = Rjb::import('com.lowagie.text.Document')
#tifreader = Rjb::import('com.lowagie.text.pdf.codec.TiffImage')
#randomaccess = Rjb::import('com.lowagie.text.pdf.RandomAccessFileOrArray')
#pdfwriter = Rjb::import('com.lowagie.text.pdf.PdfWriter')
#bytestream = Rjb::import('java.io.ByteArrayOutputStream')
pdf = #document.new
outstream = #bytestream.new
#pdfwriter.getInstance(pdf, outstream)
pdf.open()
ra = #randomaccess.new(image_file_name)
pages = #tifreader.getNumberOfPages(ra)
(1..pages).each do |i|
image = #tifreader.getTiffImage(ra,i)
scaler = ((pdf.getPageSize().getWidth() - pdf.leftMargin() - pdf.rightMargin()) / image.getWidth() * 100)
image.scalePercent(scaler)
pdf.add(image)
end
pdf.close()
outstream.flush()
return outstream.toByteArray()
end
#pdfreader = Rjb::import('com.lowagie.text.pdf.PdfReader')
#pdfcopyfields = Rjb::import('com.lowagie.text.pdf.PdfCopyFields')
#filestream = Rjb::import('java.io.FileOutputStream')
filestream = #filestream.new('new_combined_pdf.pdf')
copy = #pdfcopyfields.new(filestream)
copy.addDocument(#pdfreader.new(tiff_to_pdf_by_file('test_image.tif')))
copy.addDocument(#pdfreader.new('test_template.pdf))
copy.close()
I have been trying to split one big PDF file to multiple pdf files based on its size. I was able to split it but it only creates one single file and rest of the file data is lost. Means it does not create more than one files to split it. Can anyone please help? Here is my code
public static void main(String[] args) {
try {
PdfReader Split_PDF_By_Size = new PdfReader("C:\\Temp_Workspace\\TestZip\\input1.pdf");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream("C:\\Temp_Workspace\\TestZip\\File1.pdf"));
document.open();
int number_of_pages = Split_PDF_By_Size.getNumberOfPages();
int pagenumber = 1; /* To generate file name dynamically */
// int Find_PDF_Size; /* To get PDF size in bytes */
float combinedsize = 0; /* To convert this to Kilobytes and estimate new PDF size */
for (int i = 1; i < number_of_pages; i++ ) {
float Find_PDF_Size;
if (combinedsize == 0 && i != 1) {
document = new Document();
pagenumber++;
String FileName = "File" + pagenumber + ".pdf";
copy = new PdfCopy(document, new FileOutputStream(FileName));
document.open();
}
copy.addPage(copy.getImportedPage(Split_PDF_By_Size, i));
Find_PDF_Size = copy.getCurrentDocumentSize();
combinedsize = (float)Find_PDF_Size / 1024;
if (combinedsize > 496 || i == number_of_pages) {
document.close();
combinedsize = 0;
}
}
System.out.println("PDF Split By Size Completed. Number of Documents Created:" + pagenumber);
}
catch (Exception i)
{
System.out.println(i);
}
}
}
(BTW, it would have been great if you had tagged your question with itext, too.)
PdfCopy used to close the PdfReaders it imported pages from whenever the source PdfReader for page imports switched or the PdfCopy was closed. This was due to the original intended use case to create one target PDF from multiple source PDFs in combination with the fact that many users forget to close their PdfReaders.
Thus, after you close the first target PdfCopy, the PdfReader is closed, too, and no further pages are extracted.
If I interpret the most recent checkins into the iText SVN repository correctly, this implicit closing of PdfReaders is in the process of being removed from the code. Therefore, with one of the next iText versions, your code may work as intended.