Extract unselectable content from PDF

Extract unselectable content from PDF - java

I'm using Apache PDFBox to extract pages from PDF files and I can't find a way to extract content that is unselectable (either text or images). With content that is selectable from within the PDF files there is no problem.
Note that the PDFs in question dont have any restrictions regarding copying content, at least from what I saw on the files's "Document Restrictions Summary": they all have "Content Copying" and "Content Copying for Accessbility" allowed! On the same PDF file there is content that is selectable and other parts that aren't. What happens is that, the extracted pages come with "holes", i.e., they only have the selectable parts of the PDF. On MS Word though, if I add the PDFs as objects, the whole content of the PDF pages appear! So I was hoping to do the same with PDFBox lib or any other Java lib for that matter!
Here is the code I'm using to convert PDF pages to images:
private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
for (PDPage pdPage : pdPages) {
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
}
document.close();
}
Is there a way to extract unselectable content from an PDF with this Apache PDFBox library (or with any of the other similar libraries)? Or this is not possible at all? And if indeed it's not, why?
Much appreciated for any help!
EDIT: I'm using Adobe Reader as PDF viewer and PDFBox v1.8. Here is a sample PDF: https://dl.dropboxusercontent.com/u/2815529/test.pdf

The two images in question, the fischer logo in the upper right and the small sketch a bit down, are each drawn by filling a region on the page with a tiling pattern which in turn in its content stream draws the respective image.
Adobe Reader does not allow to select contents of patterns, and automatic image extractors often do not walk the Pattern resource tree either.
PDFBox 1.8.10
You can use PDFBox to fairly easily build a pattern image extractor, e.g. for PDFBox 1.8.10:
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
if (pages == null)
return;
for (int i = 0; i < pages.size(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Map<String, PDXObject> xObjects = resources.getXObjects();
if (xObjects == null)
return;
for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
{
PDXObject xObject = entry.getValue();
String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
if (xObject instanceof PDXObjectForm)
extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
else if (xObject instanceof PDXObjectImage)
extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
}
Map<String, PDPatternResources> patterns = resources.getPatterns();
for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
{
String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
extractPatternImages(patternEntry.getValue(), patternFormat);
}
}
public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
{
image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
}
(ExtractPatternImages.java)
I applied it to your sample PDF like this
public void testtestDrJorge() throws IOException
{
try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
{
PDDocument document = PDDocument.load(resource);
extractPatternImages(document, "testDrJorge%s.%s");;
}
}
(ExtractPatternImages.java)
and got two images:
`testDrJorge-0-R15-R14.png
testDrJorge-0-R38-R37.png
The images have lost their red parts. This most likely is dues to the fact that PDFBox version 1.x.x do not properly support extraction of CMYK images, cf. PDFBOX-2128 (CMYK images are not supported correctly), and your images are in CMYK.
PDFBox 2.0.0 release candidate
I updated the code to PDFBox 2.0.0 (currently available as release candidate only):
public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
{
PDPageTree pages = document.getDocumentCatalog().getPages();
if (pages == null)
return;
for (int i = 0; i < pages.getCount(); i++)
{
String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
extractPatternImages(pages.get(i), pageFormat);
}
}
public void extractPatternImages(PDPage page, String pageFormat) throws IOException
{
PDResources resources = page.getResources();
if (resources == null)
return;
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
{
COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
if (resourcesDict == null)
return;
PDResources resources = new PDResources(resourcesDict);
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
}
public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
{
PDResources resources = form.getResources();
if (resources == null)
return;
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames == null)
return;
for (COSName xObjectName : xObjectNames)
{
PDXObject xObject = resources.getXObject(xObjectName);
String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
if (xObject instanceof PDFormXObject)
extractPatternImages((PDFormXObject)xObject, xObjectFormat);
else if (xObject instanceof PDImageXObject)
extractPatternImages((PDImageXObject)xObject, xObjectFormat);
}
Iterable<COSName> patternNames = resources.getPatternNames();
for (COSName patternName : patternNames)
{
String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
extractPatternImages(resources.getPattern(patternName), patternFormat);
}
}
public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
{
String filename = String.format(imageFormat, "", image.getSuffix());
ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
}
and get
testDrJorge-0-COSName{R15}-COSName{R14}.png
testDrJorge-0-COSName{R38}-COSName{R37}.png
Looks like an improvement... ;)

Related

PDFBOX extract image with Color space Indexed

I'm trying to extract all the images from pdf by using the below code, it work fine for all images except the images with color space indexed.
try (final PDDocument document = PDDocument.load(new File("./pdfs/22.pdf"))){
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
int i = 1;
for (COSName name : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(name);
if (o instanceof PDImageXObject) {
PDImageXObject image = (PDImageXObject)o;
String filename = OUTPUT_DIR + "extracted-image-" + i + ".png";
ImageIO.write(image.getImage(), "png", new File(filename));
i++;
}
}
}
} catch (IOException e){
System.err.println("Exception while trying to create pdf document - " + e);
}
Do i miss something? How can I extract such type of images??

Insert image with apache-poi in a .word file, increase the image size

I am new with Apache and I am checking that the image that I insert with the picture is resized in the word document. I am using the example that comes in the Apache documentation, just modified. The image is considerably enlarged from the original size and when the created .word document is opened, the picture is shown resized on document and I find no explanation, when I am forcing the size the picture should be.
Below is the code used:
public class SimpleImages {
public static void main(String\[\] args) throws IOException, InvalidFormatException {
try (XWPFDocument doc = new XWPFDocument()) {
XWPFParagraph p = doc.createParagraph();
XWPFRun r = p.createRun();
for (String imgFile : args) {
int format;
if (imgFile.endsWith(".emf")) {
format = XWPFDocument.PICTURE_TYPE_EMF;
} else if (imgFile.endsWith(".wmf")) {
format = XWPFDocument.PICTURE_TYPE_WMF;
} else if (imgFile.endsWith(".pict")) {
format = XWPFDocument.PICTURE_TYPE_PICT;
} else if (imgFile.endsWith(".jpeg") || imgFile.endsWith(".jpg")) {
format = XWPFDocument.PICTURE_TYPE_JPEG;
} else if (imgFile.endsWith(".png")) {
format = XWPFDocument.PICTURE_TYPE_PNG;
} else if (imgFile.endsWith(".dib")) {
format = XWPFDocument.PICTURE_TYPE_DIB;
} else if (imgFile.endsWith(".gif")) {
format = XWPFDocument.PICTURE_TYPE_GIF;
} else if (imgFile.endsWith(".tiff")) {
format = XWPFDocument.PICTURE_TYPE_TIFF;
} else if (imgFile.endsWith(".eps")) {
format = XWPFDocument.PICTURE_TYPE_EPS;
} else if (imgFile.endsWith(".bmp")) {
format = XWPFDocument.PICTURE_TYPE_BMP;
} else if (imgFile.endsWith(".wpg")) {
format = XWPFDocument.PICTURE_TYPE_WPG;
} else {
System.err.println("Unsupported picture: " + imgFile +
". Expected emf|wmf|pict|jpeg|png|dib|gif|tiff|eps|bmp|wpg");
continue;
}
r.setText(imgFile);
r.addBreak();
try (FileInputStream is = new FileInputStream(imgFile)) {
BufferedImage bimg = ImageIO.read(new File(imgFile));
int anchoImagen = bimg.getWidth();
int altoImagen = bimg.getHeight();
System.out.println("anchoImagen: " + anchoImagen);
System.out.println("altoImagen: " + anchoImagen);
r.addPicture(is, format, imgFile, Units.toEMU(anchoImagen), Units.toEMU(altoImagen));
}
r.addBreak(BreakType.PAGE);
}
try (FileOutputStream out = new FileOutputStream("C:\\W_Ejm_Jasper\\example-poi-img\\src\\main\\java\\es\\eve\\example_poi_img\\images.docx")) {
doc.write(out);
System.out.println(" FIN " );
}
}
}
}
the image inside the word
the original image is (131 * 216 pixels):
the image is scaled in the word

find and replace a text in different header for each section in docx using java

I am trying to find and replace a text different sections of header in each page using Apache poi but getting only null data, but Docx has different header sections and footer too
package com.concretepage;
import java.io.FileInputStream;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFFooter;
import org.apache.poi.xwpf.usermodel.XWPFHeader;
public class ReadDOCXHeaderFooter {
public static void main(String[] args) {
try {
FileInputStream fis = new FileInputStream("D:/docx/read-test.docx");
XWPFDocument xdoc=new XWPFDocument(OPCPackage.open(fis));
XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);
//read header
for(int i=0;i<90;i++)
{
XWPFHeader header = policy.getHeader(i);
List<XWPFRun> runs = header.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
if (text != null && text.contains("$$key$$")) {
text = text.replace("$$key$$", "ABCD");//your content
r.setText(text, 0);
}
}
System.out.println(header.getText());
//read footer
XWPFFooter footer = policy.getFooter(i);
System.out.println(footer.getText());
}
} catch(Exception ex) {
ex.printStackTrace();
}
}
}
1.Screen shot of Docx header sections.
2.Screen shot of Docx header another section.
3.Screen shot of Docx header another section.
4.Screen Shot

In a *.docx document, which contains multiple sections, each section starts in a paragraph which has section properties set. To get the headers and footers out of section properties there is public XWPFHeaderFooterPolicy(XWPFDocument doc, org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr sectPr) constructor.
Only the section properties for the last section are set in document's body.
So the following code should get all headers and footers out of all sections in the document.
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy;
public class ReadWordAllHeaderFooters {
static void getAllHeaderFooterFromPolicy(XWPFHeaderFooterPolicy headerFooterPolicy) {
XWPFHeader header;
XWPFFooter footer;
header = headerFooterPolicy.getDefaultHeader();
if (header != null) System.out.println("DefaultHeader: " + header.getText());
header = headerFooterPolicy.getFirstPageHeader();
if (header != null) System.out.println("FirstPageHeader: " + header.getText());
header = headerFooterPolicy.getEvenPageHeader();
if (header != null) System.out.println("EvenPageHeader: " + header.getText());
header = headerFooterPolicy.getOddPageHeader();
if (header != null) System.out.println("OddPageHeader: " + header.getText());
footer = headerFooterPolicy.getDefaultFooter();
if (footer != null) System.out.println("DefaultFooter: " + footer.getText());
footer = headerFooterPolicy.getFirstPageFooter();
if (footer != null) System.out.println("FirstPageFooter: " + footer.getText());
footer = headerFooterPolicy.getEvenPageFooter();
if (footer != null) System.out.println("EvenPageFooter: " + footer.getText());
footer = headerFooterPolicy.getOddPageFooter();
if (footer != null) System.out.println("OddPageFooter: " + footer.getText());
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("MultipleHeaderFooters.docx"));
XWPFHeaderFooterPolicy headerFooterPolicy;
//are there paragraphs to start sections?
int section = 1;
for (XWPFParagraph paragraph : document.getParagraphs()) {
if (paragraph.getCTP().isSetPPr()) { //paragraph has paragraph properties set
if (paragraph.getCTP().getPPr().isSetSectPr()) { //paragraph property has section properties set
//headers and footers in paragraphs section properties:
headerFooterPolicy = new XWPFHeaderFooterPolicy(document, paragraph.getCTP().getPPr().getSectPr());
System.out.println("headers and footers in section properties of section " + section++ + ":");
getAllHeaderFooterFromPolicy(headerFooterPolicy);
}
}
}
//headers and footers in documents body = headers and footers of last section:
headerFooterPolicy = new XWPFHeaderFooterPolicy(document);
System.out.println("headers and footers in documents body = headers and footers of last section " + section + ":");
getAllHeaderFooterFromPolicy(headerFooterPolicy);
}
}

This function should do the job
static void replaceHeaderText(XWPFDocument document, String searchValue, String replacement)
{
List<XWPFHeader> headers = document.getHeaderList();
for(XWPFHeader h : headers)
{
for (XWPFParagraph p : h.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
if (runs != null) {
for (XWPFRun r : runs) {
String text = r.getText(0);
if (text != null && text.contains(searchValue)) {
text = text.replace(searchValue, replacement);
r.setText(text, 0);
}
}
}
}
for (XWPFTable tbl : h.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains(searchValue)) {
text = text.replace(searchValue, replacement);
r.setText(text,0);
}
}
}
}
}
}
}
}

Here is a code snippet that replaces all the occurrences of a key like ${firstName} with its String value in an XWPFDocument. All the parameters to be replaced will be stored in a Map<String, String>.
private void replaceParameters(XWPFDocument document, Map<String, String> replacements) {
for (Map.Entry<String, String> parameter : replacements.entrySet()) {
// replaces all occurrences in the headers
replaceHeadersParams(document.getHeaderList(), parameter);
// replaces all occurrences in the document's body
replaceParagraphsParams(document.getParagraphs(), parameter);
replaceTablesParams(document.getTables(), parameter);
// replaces all occurrences in the footers
replaceFootersParams(document.getFooterList(), parameter);
}
}
private void replaceHeadersParams(List<XWPFHeader> headers, Map.Entry<String, String> paramToReplace) {
for (XWPFHeader header : headers) {
replaceParagraphsParams(header.getParagraphs(), paramToReplace);
replaceTablesParams(header.getTables(), paramToReplace);
}
}
private void replaceFootersParams(List<XWPFFooter> footers, Map.Entry<String, String> parameter) {
for (XWPFFooter footer : footers) {
replaceParagraphsParams(footer.getParagraphs(), parameter);
replaceTablesParams(footer.getTables(), parameter);
}
}
private void replaceTablesParams(List<XWPFTable> tables, Map.Entry<String, String> paramToReplace) {
for (XWPFTable tbl : tables) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
replaceParagraphsParams(cell.getParagraphs(), paramToReplace);
}
}
}
}
Replace the startIndex and endIndex with your values $$. Keep in mind that this implementation replaces all the occurrences ignoring the case.
private void replaceParagraphsParams(List<XWPFParagraph> paragraphs, Map.Entry<String, String> paramToReplace) throws POIXMLException {
for (XWPFParagraph p : paragraphs) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.toLowerCase().contains(paramToReplace.getKey().toLowerCase())) {
int startIndex = text.indexOf("${");
int endIndex = text.indexOf("}");
String toBeReplaced = text.substring(startIndex, endIndex + 1);
text = text.replace(toBeReplaced, paramToReplace.getValue());
r.setText(text, 0);
}
}
}
}

Jpg with black background (or completely white) when extracted via PDFBox from a PDF file

I need to extract images from a PDF and I am doing it via PDFBox (v 1.8.9).
It works well the 90% of cases but I have some images that when extracted are saved with black background (or are completely white) even if they look perfectly good in the original pdf. I imagine it is something with those jpgs files. What should I check in the jpgs?
I am trying to see If I can upload an example pdf
This is the relevant (quite standard) piece of code...
String pdfFile = promptForPDFFile(jf, "Select PDF file");
// Load pdf file
PDDocument document=PDDocument.load(pdfFile);
//Get the pdf pages
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
int pagetot = pages.size();
int pagenum = 1;
while( iter.hasNext() )
{
// Cycle on the pages for the images
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
PDFTextStripper textStripper=new PDFTextStripper();
textStripper.setStartPage(pagenum);
textStripper.setEndPage(pagenum);
Map images = resources.getImages();
// Get page text content and use it as file name
String pagecontent= textStripper.getText(document);
pagecontent = pagecontent.replaceAll("\n", "");
pagecontent = pagecontent.replaceAll("\r", "");
if( images != null )
{
Iterator imageIter = images.keySet().iterator();
while( imageIter.hasNext() )
{
String key = (String)imageIter.next();
PDXObjectImage image = (PDXObjectImage)images.get( key );
File tempdir = new File(tempPath+"/temp/");
tempdir.mkdirs();
String name = tempPath+"/temp/"+pagecontent;
//System.out.println( "Writing image:" + name );
//Write the image to file
image.write2file( name );
}
}
pagenum ++;
if (pagenum % 10 ==0)
{
System.out.print("\n--- "+ pagenum +"/"+pagetot);
}
}
Thanks in advance

I ran ExtractImages.java against the two files you sent me. The problem file has CMYK images, as can be seen with this screenshot from PDFDebugger:
The problem is that the 1.8 version doesn't handle CMYK images properly.
But there's a trick:
The images are encoded with the DCTDecode filter, which is JPEG. You have "real JPEGs" in the PDF.
I am able to extract your images properly by using the "-directJPEG" option of that tool, which bypasses the decoding mechanism of PDFBox, and just saves the JPEG files "as is".
Note that while this works nicely with your files, it doesn't work properly if the images have an external colorspace specified in the PDF.
Here's the full source code. See writeJpeg2file() for the raw extraction details.
public class ExtractImages
{
private int imageCounter = 1;
private static final String PASSWORD = "-password";
private static final String PREFIX = "-prefix";
private static final String ADDKEY = "-addkey";
private static final String NONSEQ = "-nonSeq";
private static final String DIRECTJPEG = "-directJPEG";
private static final List<String> DCT_FILTERS = new ArrayList<String>();
static
{
DCT_FILTERS.add( COSName.DCT_DECODE.getName() );
DCT_FILTERS.add( COSName.DCT_DECODE_ABBREVIATION.getName() );
}
private ExtractImages()
{
}
/**
* This is the entry point for the application.
*
* #param args The command-line arguments.
*
* #throws Exception If there is an error decrypting the document.
*/
public static void main( String[] args ) throws Exception
{
ExtractImages extractor = new ExtractImages();
extractor.extractImages( args );
}
private void extractImages( String[] args ) throws Exception
{
if( args.length < 1 || args.length > 4 )
{
usage();
}
else
{
String pdfFile = null;
String password = "";
String prefix = null;
boolean addKey = false;
boolean useNonSeqParser = false;
boolean directJPEG = false;
for( int i=0; i<args.length; i++ )
{
if( args[i].equals( PASSWORD ) )
{
i++;
if( i >= args.length )
{
usage();
}
password = args[i];
}
else if( args[i].equals( PREFIX ) )
{
i++;
if( i >= args.length )
{
usage();
}
prefix = args[i];
}
else if( args[i].equals( ADDKEY ) )
{
addKey = true;
}
else if( args[i].equals( NONSEQ ) )
{
useNonSeqParser = true;
}
else if( args[i].equals( DIRECTJPEG ) )
{
directJPEG = true;
}
else
{
if( pdfFile == null )
{
pdfFile = args[i];
}
}
}
if(pdfFile == null)
{
usage();
}
else
{
if( prefix == null && pdfFile.length() >4 )
{
prefix = pdfFile.substring( 0, pdfFile.length() -4 );
}
PDDocument document = null;
try
{
if (useNonSeqParser)
{
document = PDDocument.loadNonSeq(new File(pdfFile), null, password);
}
else
{
document = PDDocument.load( pdfFile );
if( document.isEncrypted() )
{
StandardDecryptionMaterial spm = new StandardDecryptionMaterial(password);
document.openProtection(spm);
}
}
AccessPermission ap = document.getCurrentAccessPermission();
if( ! ap.canExtractContent() )
{
throw new IOException(
"Error: You do not have permission to extract images." );
}
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() )
{
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
// extract all XObjectImages which are part of the page resources
processResources(resources, prefix, addKey, directJPEG);
}
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
}
public void writeJpeg2file(PDJpeg image, String filename) throws IOException
{
FileOutputStream out = null;
try
{
out = new FileOutputStream(filename + ".jpg");
InputStream data = image.getPDStream().getPartiallyFilteredStream(DCT_FILTERS);
byte[] buf = new byte[1024];
int amountRead;
while ((amountRead = data.read(buf)) != -1)
{
out.write(buf, 0, amountRead);
}
IOUtils.closeQuietly(data);
out.flush();
}
finally
{
if (out != null)
{
out.close();
}
}
}
private void processResources(PDResources resources, String prefix,
boolean addKey, boolean directJPEG) throws IOException
{
if (resources == null)
{
return;
}
Map<String, PDXObject> xobjects = resources.getXObjects();
if( xobjects != null )
{
Iterator<String> xobjectIter = xobjects.keySet().iterator();
while( xobjectIter.hasNext() )
{
String key = xobjectIter.next();
PDXObject xobject = xobjects.get( key );
// write the images
if (xobject instanceof PDXObjectImage)
{
PDXObjectImage image = (PDXObjectImage)xobject;
String name = null;
if (addKey)
{
name = getUniqueFileName( prefix + "_" + key, image.getSuffix() );
}
else
{
name = getUniqueFileName( prefix, image.getSuffix() );
}
System.out.println( "Writing image:" + name );
if (directJPEG && "jpg".equals(image.getSuffix()))
{
writeJpeg2file((PDJpeg) image, name);
}
else
{
image.write2file(name);
}
image.clear(); // PDFBOX-2101 get rid of cache ASAP
}
// maybe there are more images embedded in a form object
else if (xobject instanceof PDXObjectForm)
{
PDXObjectForm xObjectForm = (PDXObjectForm)xobject;
PDResources formResources = xObjectForm.getResources();
processResources(formResources, prefix, addKey, directJPEG);
}
}
}
resources.clear();
}
private String getUniqueFileName( String prefix, String suffix )
{
String uniqueName = null;
File f = null;
while( f == null || f.exists() )
{
uniqueName = prefix + "-" + imageCounter;
f = new File( uniqueName + "." + suffix );
imageCounter++;
}
return uniqueName;
}
/**
* This will print the usage requirements and exit.
*/
private static void usage()
{
System.err.println( "Usage: java org.apache.pdfbox.ExtractImages [OPTIONS] <PDF file>\n" +
" -password <password> Password to decrypt document\n" +
" -prefix <image-prefix> Image prefix(default to pdf name)\n" +
" -addkey add the internal image key to the file name\n" +
" -nonSeq Enables the new non-sequential parser\n" +
" -directJPEG Forces the direct extraction of JPEG images regardless of colorspace\n" +
" <PDF file> The PDF document to use\n"
);
System.exit( 1 );
}
}

Extract measures on each line from sheet music [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I would like to know a way to extract individual line of measures. I am not sure if an algorithm for this already exists so I've thought of scanning a sheet music from left to right, extract all the white spaces from above and below a line of measures.
I am not looking for a way to convert the sheet music into MusicXML or extract other useful information. No, essentially what I am dealing with is a regular document. I need to separate the paragraphs. I am not interested in the information conveyed by the paragraph but simply chunking them separately from the regions of the document. In this case a paragraph would be one line of measures. I don't need individual measures but all the measure on each line of sheet music.
This is one of the output I would like from the full sheet music but without the title, composer and etc.

Supposing you have the sheet music in PDF File, I would use Apache PDFBox to get images from an input PDF File containing the sheet music, then locate the coordinates of the whole bar you need, the with a selected image define the coordinates to crop the image and manipulate it until you get the desired result.
PDDocument document = null;
document = PDDocument.load(inFile);
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2OutputStream(/* some output stream */);
}
}
}
Here is a sample code available in Apache PDFBox.
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
/**
* This will read a read pdf and extract images. <br/><br/>
*
* usage: java org.apache.pdfbox.ExtractImages <pdffile> <password> [imageprefix]
*
* #author Ben Litchfield
* #version $Revision: 1.7 $
*/
public class ExtractImages
{
private int imageCounter = 1;
private static final String PASSWORD = "-password";
private static final String PREFIX = "-prefix";
private static final String ADDKEY = "-addkey";
private static final String NONSEQ = "-nonSeq";
private ExtractImages()
{
}
/**
* This is the entry point for the application.
*
* #param args The command-line arguments.
*
* #throws Exception If there is an error decrypting the document.
*/
public static void main( String[] args ) throws Exception
{
ExtractImages extractor = new ExtractImages();
extractor.extractImages( args );
}
private void extractImages( String[] args ) throws Exception
{
if( args.length < 1 || args.length > 4 )
{
usage();
}
else
{
String pdfFile = null;
String password = "";
String prefix = null;
boolean addKey = false;
boolean useNonSeqParser = false;
for( int i=0; i<args.length; i++ )
{
if( args[i].equals( PASSWORD ) )
{
i++;
if( i >= args.length )
{
usage();
}
password = args[i];
}
else if( args[i].equals( PREFIX ) )
{
i++;
if( i >= args.length )
{
usage();
}
prefix = args[i];
}
else if( args[i].equals( ADDKEY ) )
{
addKey = true;
}
else if( args[i].equals( NONSEQ ) )
{
useNonSeqParser = true;
}
else
{
if( pdfFile == null )
{
pdfFile = args[i];
}
}
}
if(pdfFile == null)
{
usage();
}
else
{
if( prefix == null && pdfFile.length() >4 )
{
prefix = pdfFile.substring( 0, pdfFile.length() -4 );
}
PDDocument document = null;
try
{
if (useNonSeqParser)
{
document = PDDocument.loadNonSeq(new File(pdfFile), null, password);
}
else
{
document = PDDocument.load( pdfFile );
if( document.isEncrypted() )
{
StandardDecryptionMaterial spm = new StandardDecryptionMaterial(password);
document.openProtection(spm);
}
}
AccessPermission ap = document.getCurrentAccessPermission();
if( ! ap.canExtractContent() )
{
throw new IOException(
"Error: You do not have permission to extract images." );
}
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() )
{
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
// extract all XObjectImages which are part of the page resources
processResources(resources, prefix, addKey);
}
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
}
private void processResources(PDResources resources, String prefix, boolean addKey) throws IOException
{
if (resources == null)
{
return;
}
Map<String, PDXObject> xobjects = resources.getXObjects();
if( xobjects != null )
{
Iterator<String> xobjectIter = xobjects.keySet().iterator();
while( xobjectIter.hasNext() )
{
String key = xobjectIter.next();
PDXObject xobject = xobjects.get( key );
// write the images
if (xobject instanceof PDXObjectImage)
{
PDXObjectImage image = (PDXObjectImage)xobject;
String name = null;
if (addKey)
{
name = getUniqueFileName( prefix + "_" + key, image.getSuffix() );
}
else
{
name = getUniqueFileName( prefix, image.getSuffix() );
}
System.out.println( "Writing image:" + name );
image.write2file( name );
}
// maybe there are more images embedded in a form object
else if (xobject instanceof PDXObjectForm)
{
PDXObjectForm xObjectForm = (PDXObjectForm)xobject;
PDResources formResources = xObjectForm.getResources();
processResources(formResources, prefix, addKey);
}
}
}
}
private String getUniqueFileName( String prefix, String suffix )
{
String uniqueName = null;
File f = null;
while( f == null || f.exists() )
{
uniqueName = prefix + "-" + imageCounter;
f = new File( uniqueName + "." + suffix );
imageCounter++;
}
return uniqueName;
}
/**
* This will print the usage requirements and exit.
*/
private static void usage()
{
System.err.println( "Usage: java org.apache.pdfbox.ExtractImages [OPTIONS] <PDF file>\n" +
" -password <password> Password to decrypt document\n" +
" -prefix <image-prefix> Image prefix(default to pdf name)\n" +
" -addkey add the internal image key to the file name\n" +
" -nonSeq Enables the new non-sequential parser\n" +
" <PDF file> The PDF document to use\n"
);
System.exit( 1 );
}
}
Now to crop image you can use:
/**
* Crop the main image according to this rectangle, and scale it to the
* correct size for a thumbnail.
*/
public InputStream cropAndScale(InputStream mainImageStream,
CropRectangle crop) {
try {
RenderedOp mainImage = loadImage(mainImageStream);
RenderedOp opaqueImage = makeImageOpaque(mainImage);
RenderedOp croppedImage = cropImage(opaqueImage, crop);
RenderedOp scaledImage = scaleImage(croppedImage);
byte[] jpegBytes = encodeAsJpeg(scaledImage);
return new ByteArrayInputStream(jpegBytes);
} catch (Exception e) {
throw new IllegalStateException("Failed to scale the image", e);
}
}
which is available in this page and the project
There is other option to parse images inside a pdf file, take a look at this code specially this

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract unselectable content from PDF - java

Related

PDFBOX extract image with Color space Indexed

Insert image with apache-poi in a .word file, increase the image size

find and replace a text in different header for each section in docx using java

Jpg with black background (or completely white) when extracted via PDFBox from a PDF file

Extract measures on each line from sheet music [closed]

Categories

Resources