I have tagged a few keywords to an image in Photoshop and I'm attempting to read these keywords using Java. But right now I'm only able to read the image's format with the help of other questions that has been asked here. I have gone through the ImageIO class documentation and I could not find any functions that extract keywords out from image. Is there any other way to extract out keywords and other information such as data and time created?
Related
Basically, assuming I have a link to a google slide (as provided by user), for example:
https://docs.google.com/presentation/d/1c3TbLKMVwOqgP70l0ph2jSvIHAaZnZoSMnvW8cxs8Ik/edit?usp=sharing
Ultimately, I want to answer the following:
Is there a way to get the slides into an array of some sort of image file?
Then, is there a way to get the speaker notes into an array of Strings?
Since we have the link (assuming it is set as "Public-anyone can view"), we theoretically have access to all these information, as we can access it in the Google Slides page. However, how do we extract it algorithmically? Is there a specific place where these information is stored in the Google server where I can just retrieve it?
You can download the whole presentation in various formats:
https://docs.google.com/presentation/d/<PRESENTATION_ID>/export/<FORMAT>
Possible formats are: pdf, pptx, odp, txt (strings only)
e.g.
https://docs.google.com/presentation/d/1c3TbLKMVwOqgP70l0ph2jSvIHAaZnZoSMnvW8cxs8Ik/export/pptx
To download slides as images, you have to specify a particular slide (page):
https://docs.google.com/presentation/d/<PRESENTATION_ID>/export/<FORMAT>?pageid=<PAGE_ID>
Possible formats are: jpeg, svg, png
e.g.
https://docs.google.com/presentation/d/1c3TbLKMVwOqgP70l0ph2jSvIHAaZnZoSMnvW8cxs8Ik/export/jpeg?pageid=g1f5653c4cc_0_437
You can find the page id at the end of a URL (http://docs.google...slide=id.<PAGE_ID>).
This question already has answers here:
How to read a Table in a PDF using iText java?
(3 answers)
Closed 7 years ago.
I am very tried to trying to read table with rows, cells of a pdf file to get records in systematic order.
I have done a lot of google but i could not find best ways to do this.
So i want to ask one question about it -
Q 1- Can we read data from pdf file ?
Q 2- Can we read data from any cell of pdf table ?
I am using itext of java to do this.
Please give me any example to do this.
Thanks
The answer to both your questions is: It depends.
Suppose that you have a ZUGFeRD invoice. In that case, the invoice is a PDF/A-3 document that has an embedded file in the CII XML format. It is very easy to extract this XML and read it to get all the necessary information about the invoice. The concept of embedded or attached file that contain the source of the data used to create the PDF, or the data in an alternative form than PDF, is a technique that is used to allow what you need.
You can extract text from a PDF. This is explained in questions such as PDF text extraction using iText but you only get the raw text without formatting. In many cases, a PDF consists of a bunch of text and lines put on a canvas at absolute positions. A word on the page does not know if it's part of a sentence, part of a cell, etc. Unless:
If the PDF is a Tagged PDF, then the PDF also contains information about the structure of the content. For instance: the content will contain tags that indicate structures such as tables, table headers, table rows, table cells. If you are talking about Tagged PDFs, then it's possible to extract the text in a structured way.
In the past, we have done project where we received credit card statements from VISA, MasterCard, AmEx,... We had to extract all the expenses and store them as records in a database. We were able to achieve this, because the format of the statements was predictable: all VISA statements are created alike, hence we were able to find the pattern that allowed us to extract the data.
It goes without saying that we do not share the code we used to do this. The company that paid us for doing that project would not be pleased.
I'm trying to create a Java program that will OCR many formats of images. Images cannot be read directly from file, because their bytes are to be send through network.
I'm currently able to read raw bytes of image pixels using ImageIO. However I would like to support all the formats that are supported by ImageMagick, so read the image using JMagick and then give raw bytes to Tess4J. I'm not sure how I should approach this. I found this function can give me bytes:
PixelPacket[] MagickImage.getColormap();
But I would have to write special method for transforming obtained the PixelPacket objects to consecutive bytes. I can do that, but maybe there's better way to do this? For example maybe there's some extremely raw file format (even more than http://en.wikipedia.org/wiki/BMP_file_format#mediaviewer/File:BMPfileFormat.png) that I could use for example in this method:
byte[] imageToBlob(ImageInfo imageInfo) ?
The imageInfo object will have to point to this raw format and then I can cut out the pixels information from the bytes array.
Is this the proper way or I should use something simpler (faster/more robust)?
Edit
I found the format I had in mind is called PNM.
I think using the dispatchImage method is what you are looking for, if using JMagick. It will give you access to the raw pixels of the image directly. No file format required.
See my MagickUtil class for examples, or just use that class if you feel like.
I've also written pure Java ImageIO plugins for many of the same formats that JMagick supports, that might be of use. You'll find them in the my GitHub repository.
I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.
I have a Java application that creates a BufferedImage and saves it to disk as a JPEG. I'd really like to add a caption to the image. To prevent the image from getting crowded out by text on the image itself, it'd be great if I could write the caption to the JPEG's metadata.
I've been searching all over the place for a solution, but haven't found anything satisfactory. Sanselan comes up a lot, but I haven't figured out how to use it properly. I found examples that modify existing metadata, but my files don't contain metadata as they are simply created from ImageIO.write() or Sanselan.writeImage().
I found another post that does what I'm looking for, but it's in C# and I need Java.
Any help would be greatly appreciated.
the package you want to look at is javax.imageio.metadata
The IIOMetaData class (which has a concrete subclass for JPEG) contains methods to get metadata information in various formats, including as an XML DOM tree root node.