is image descripting same as metadata extracting? - java

i am trying to understand my project description: One of the techniques used to search for images is the analysis of its content through the tags by first extracting “image descriptors” such as MPEG-7 descriptors, shown in the Figure below. These tags are snippets of text that describe an image’s content; the tags don’t appear on the image itself, but only in the extracted descriptors. The tags will help and tell the search engines what an image content is about.
But i never worked with image
i just want to know if i need to extract the metadata from the image to get the tags
and does the tags contains the image content?

Related

Split PDF Into Separate Files By Child Bookmarks

I am trying to split PDF file (book) to multiple files by child bookmarks in code
Use case: table of contents of a book is available for a user. User can select up to n sections (might be not sequential) to preview. Application need to extract this sections and merge into single preview PDF
I found few tools, while looking into the solution in internet: Aspose, Spire (E-IceBlue), etc.
All of them can split PDF by pages (top bookmarks), but I need to split PDF by child bookmarks. It means, that area to extract can be started and/or finished at the middle of the page.
Ideally to have abiliti to do this in java code, but if someone knows solution in any other programming language or CLI program - it also would be great
It depends whether you insist that the non-chosen content on a page be redacted or not. For example, if section 6.3.2 takes up the middle half of a page, do you care if the end of 6.3.1 and the beginning of 6.3.3 are shown in the output on the same page?
If you don't care, cpdf can do this easily. Just output the bookmark data as JSON:
cpdf -list-bookmarks-json -utf8 in.pdf > marks.json
Then you can parse this JSON to show the list of bookmarks, and choose which pages to extract based on child bookmark page numbers.
As for redaction, you could use -add-rectangle or -hard-box to clean up the output based on the coordinates from the JSON bookmarks file, but that's not real redaction -- it just removes the content from view.

How can I overlay / insert custom text into javax.print documents (pdf, doc, docx...)?

I print documents (pdf, doc, docx...) via javax.print and it works fine, but I need to add a custom overlay text on any place at document; the text is necessary to insert a printed reference from my code to the document.
How do I do this?
I do not believe adding text (watermarking) is an aspect of javax.print.
You can pre-process each page of each document via Graphics2d and add in the text/graphic individually based on that document's type, then send that document to the printer.
There are several threads here and elsewhere on "watermarking", which is essentially what you're looking to do. The process is different based on the type of document you're working with. Single-page images, multi-page images, PDFs, DOCs, and DOCXs all require different ways to be processed.
Here are some threads to get you started:
GIF/JPEG/TIFF: How can I watermark an image in Java?
GIF/JPEG/TIFF: Adding a watermark over an image programmatically using Java
DOCX: Creating watermark programmatically
PDF: Watermarking PDFs via iText

how to extract PDF watermark content using iText apis

I was going through the itext api docs & I was able create a pdf with a watermark image or text but did not find a method to get/extract watermark content from pdf.
So I have a pdf document containing watermarked text/image & I want to extract that text or img and validate which I am not able to do.
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
By validate I mean if I have an existing pdf/image with some watermarked text [as done in 2nd link in above ref], I want to check whether it has expected text/image.
References:
http://itextpdf.com/themes/keyword.php?id=226
http://www.java-connect.com/itext/add-watermark-in-PDF-document-using-java-iText-library.html
How to extract watermark content using iText apis? Or is there any other way to validate watermark content?
Extracting watermark content?
There is nothing special about watermarks in PDFs in contrast to regular page content. They merely
appear pretty early in the content stream and other content later in the stream, therefore, is drawn above it; or they
appear pretty late in the content stream but have some kind of transparency applied.
Actually there is another type of watermarks which is special, the so-called Watermark Annotations. As these annotation can easily be lost when documents are merged or otherwise manipulated, though, they hardly ever are used.
Furthermore different PDF generating software suites offering a way to add watermarks do so in their respective individual way. Thus, you cannot even recognize watermarks by some special operations done in some specific unique pattern.
Already the iText examples you referred to apply different kinds of watermarks
MovieCountries2 simply draws some gray large Text using an angled base line.
StampStationery copies a complete page from some PDF (which itself may visually have foreground and background material) into a separate object inside the target PDF and adds a reference to this object at the beginning of every page of the target.
InsertPages similarly references a page from some PDF on every newly generated target document page.
Thus, blind watermark extraction is virtually impossible.
Validating watermark content!
You might try some validation, though, if you know what you are searching for. You simply do not merely search some (in PDF not existing) fixed watermark stream but instead the whole page content.
iText offers the classes of the parser package which allow extraction of text and/or bitmap images from content streams. Look at the samples referenced from the keywords PARSING PDF > EXTRACTING IMAGES and PARSING PDF > EXTRACTING TEXT.
You merely have to check whether the image or text which you expect can be found by these classes positioned and styled as you expect.

Replace text with an image docx4j

I have an word template. There is an word photo that has to be replaced with an image. This has to be done with Docx4Java.
How do I do this?
If specifically looking to replace a text with an image(which is not possible using docx4j as answered above), you can use replace bookmark with image as an alternative.
Just open your templated word file, position the cursor at desired location and insert->bookmark and name your bookmark.
I followed the instructions here to replace this bookmark with an image
Disclosure: I manage the docx4j project
The VariableReplace code doesn't handle images.
The best way to do this would be to use data bound content controls, specifically a picture content control pointing via XPath at a base-64 encoded image in an XML document (see Getting Started for details).
However, if you want to replace a word with an image, you can do so, but you'll have to write a bit of glue code. It is pretty straightforward.
First, find the word. You can do this using XPath or TraversalUtil (again, see Getting Started for details).
Hopefully it is in a run (w:r/w:t) by itself. If not, you'll need to split the run up so you don't replace adjacent text.
Then, add the image. See the sample ImageAdd.
I suggest you have a look at the XML created when you add an image in Word (ie save and unzip your docx, then look at document.xml). Take care that the XML representing the image is at the correct level (eg child of w:p).

How to add a hidden image to a PDF document?

I have a program. It outputs to pdf, but that is close to impossible to read from again. So i need a additional file attached to my PDF in order to be able to make it editable in my program. Attaching a file to PDF is a good idea, but that is visible to the user, which i don't wan't it to be.
An alternative is to hide my readable file format inside an image which would be added to the PDF somewhere to the top of the first page, before everything else... Even to metadata if that's possible...
That way I can extract image from pdf using a PDF library (iText), and read from it.
My question is how to add image to PDF to be as well hidden as it could be (visually and by accesibility). And it has to be in a place which would be same for any created document (somewhere on the top, or on the very bottom of the document, or to the part of the document which isn't displayed at all... I'm really guessing here, I'm not really familliar with PDF file format)...
Any ideas?
P.S. It's not really important which image is it, I could be a e.g. completly transparent image, 1x1 pixels.
I'm not sure what you mean by Image, but you can "extend" the PDF reference.
A PDF consists of objects: PDF numbers, PDF names, PDF strings, PDF arrays, PDF dictionaries, PDF streams. What you probably want, is to add an entry to a dictionary (pick one: the root dictionary, the info dictionary, the root of the page tree,...) that isn't defined in the PDF reference, so that it isn't rendered in a PDF viewer.
The key of such an entry must be a PDF name. To avoid clashes with existing names (names that are part of a current PDF spec, or will be part of a future spec), it is advised to register a four-letter key with ISO. For instance, Adobe registered adbe, iText registered ITXT and use that name with an underscore. For instance, ITXT_OriginalData would be a good name if we needed the functionality you describe.
The value of such an entry will be a PDF stream. In iText, you need the PdfStream class for this.

Categories