Formatting substrings in SpreadsheetML cells using Xelem - java

When looking for a Java library which can produce (and read) SpreadsheetML 2003 files, I came across the Xelem library.
It almost perfectly fits all my needs; however, it does not seem to provide the possibility to format parts of strings in cells individually.
I.e. I need to be able to write some words in cells in bold text while others are not bold, which is why I cannot use styles (as they format the whole cell uniformly):
"non-bold-text and bold-text"
When creating and saving files containing such cells with Excel, Excel uses bold tags of the HTML namespace like this:
...
<Cell><ss:Data ss:Type="String" xmlns="http://www.w3.org/TR/REC-html40">non-bold-text and <B>bold-text</B></ss:Data></Cell>
...
However, I cannot find any possibility to get Xelem to generate such code. It does not seem to offer the ability to specify additional namespaces for a cell.
Am I missing something here, and if not, is there a (simple) workaround for this?
Hints for alternative libraries are also appreciated, but the file format needs to be SpreadsheetML 2003.

Related

Java: Print Text With Strikethrough

I'm printing to a file. Is there a way to print the text with strikethrough through it? I have done some googling, but did not find any applicable answers.
You would have to save the file in a PDF, HTML or create some kind of word processor document. Simple text (or more correctly plaintext) does not have formatting ... in any language ...
I'd recommend HTML. It is simple to create (PDF is a pain), gives you the option of other formatting (people always end up asking for a heading), allows you to format as tables (managers love tables), and will open anywhere (could even be served on a web-server, eliminating printing and tree-killing altogether).
If you want to force it, you can use the unicode index of those letters, like this:
"\u03C0" //π
http://unicode-table.com/de/0268/
This, as an example is the ɨ

Is there a clean way to to transform text files that are not the same into a standard format

I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.
Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

How to generate a printable output for a phonebook

I'm developing a desktop software to manage people and telephones, and also to generate (export) a list of telephones (also with a summary of the cities) that can be printed (like pdf). The part of telephones management is ready and was made with java and swt/jface. Exporting the list in a print friendly format is what has become an issue.
I tried exporting the list in HTML with CSS, but the result is not the same in different browsers.
I was thinking about generating it in LaTeX, but creating an style is getting too complicated (need an A7 page size, smaller fonts...).
What file format can be used to export this list? Is there an easy way to generate printable stuff?
Edit: forgot to mention that the file will be sent to a company to be printed.
Thanks!
Generate a pdf, it will look the same no matter what browser they use. You can use iText to create the pdf, it is fairly straight forward for a simple pdf.
You could just draw an image, it will stay the same on different systems and its easy to print. by drawing it, you can style it like you imagine, without learning any document format. It should be easy to draw a simple table.
Plain text is a very friendly format for me. Altough, this could be done with HTML and CSS, if you keep the style complexity level to a minimum. Try reading:
http://www.smashingmagazine.com/2010/06/07/the-principles-of-cross-browser-css-coding/
And be careful when choosing your properties!

rendering pdf on webside ala google documents

In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.
I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.

Categories