iText - Pass Objects - java

I am providing the scenario as an image, which I guess would help in better understandability.
Here I have to pass object from different Swing forms and accumulate it to create a new document which will consist the concatenated texts, tables etc.
My question, is it possible to do the same.
N.B :- I am able to do simple tasks in iText - reading tables from Swing forms, etc.
Thanks, suggestions are very much appreciated.

Yes it is possible to do the same with itext. May be you can do it this way
Create document
Get Form 1 Values. Create a Chunk using the Form 1 Text.
Add the Chunk into the Document
Get Form 2 Values. Create a Chunk using Form 2 text
Add the Chunk into the Document
Like this do for all the forms.
The above is solution for adding all the text from different forms into document.
Refer Link for itext references.
If in case all the texts has to concatenated do it well ahead and add it to the Chunk.
P.S: Am not familiar with Swing framework.

Related

Split docx to multiple docx using Java

I have a requirement to split 1 docx to multiple docx based on subheadings.
where input document have TOC, graphs, paragraphs, tables , images and drawing tools .
I have a write a app to get a docx and generate multiple docx based on subheading.
I could see few resource for paragraph read and write but couldn't find for others. any suggestions to clone the doc and write as is in order to maintain the same style and format.
Thanks in advance
There are at least 2 ways to do this. The first is to use a clone of the entire document, but only including the relevant portion of the main document part. This is fairly easy to do, but the output documents might be large (since they contain unused images etc), unless you open/save in Word.
The second would be to use our commercial Docx4j Enterprise. You still have to identify where each chunk starts and finishes, but it will take just the objects referenced in that chunk (so you get small output documents).

Plausibility of plan to use DOM4J with JTables & Action Listeners

I am preparing to embark on a large solo project at my place of employment. First let me describe the project. I have been asked to create a Java program that can take a CamT54 file (which is just a xml file) and have java display the information in table form. Then users should be given the ability to remove certain components from the table and have it go back to xml format with the changes.
I'm not well versed in dealing with XML in Java so this is going to be a learn and work task. Before I begin investing time I would like to know that my approach is the best approach.
My plan is to use DOM4J to do the parsing and handling of the xml. I will use a JTable to display the data and incorporate some buttons to the GUI that allow the modifications of the data through the use of some action listeners.
Would this be a plausible plan? Can DOM4J effectively allow xml data to be displayed in a table format and furthermore could that data be easily modified or deleted then resaved to a new xml?
I thought I would go ahead and answer this as I finished the program and wanted to post what I thought was the easiest solution in case anyone else needed help.
It turned out the easiest approach (for me at least) was to use the standard DOM parser, here are the steps I took.
Parsed the entire XML into String array lists. XPath was required for this, I also had to convert the elements into Strings and remove the extra tag information from the string using substrings since I only wanted the actual value.
I populated a JTable with these arrays.
Once users finished editing and clicked a save button then another Dom parser would take the original XML and change each and every attribute using the values from the Arrays (that were deleted and repopulated with the JTable cell values when the user clicked "save").

how to create a new word from template with docx4j

I have the following scenario, and need some advice:
The user will input a word document as a template, and provide some parameters in runtime so i can query my database and get data to fill the document.
So, there are two basic things i need to do:
Replace every key in the document with it´s respective result from the current query line.
"Merge" (copy? duplicate?) the existing document unchanged into itself (append) depending on how many rows i got from the query, and replacing the keys from this new copy with the next row values.
What´s is the best aprroach to do this? I´ve managed to do the replace part for now, by using the unmarshallfromtemplate providing it a hashmap.
But this way is a little bit tricky, because i need to add "${variable_name}" in the document, and sometimes word separates "${" and "}" in different tags, causing issues.
I´ve read about the custom xml binding, but didn´t understand it completely. I need to generate a custom XML, inject it in the document (all of this un runtime) and call the applybindings?? If this is true, how would i bind the fields in the document to the xml ? By name?
docx4j includes VariablePrepare, which can tidy up your input docx so that your keys are not split across separate runs.
But, you would still be better off switching to content control data binding, particularly if you have repeated data (think for example of line items in an invoice). Disclosure: I champion this approach in docx4j.
To adopt the content control data binding approach:
dream up an XML format which makes sense for your data, and write some code to convert the results of your database query into that format.
modify your template, so that the content controls are bound to elements in your XML document. ordinarily you'd use an authoring add-in for Word to help with this. (The technology Microsoft uses for binding is XPath, so how you bind depends on your XML structure, but, yes, you'd typically bind to the element name or ID).
now you have your XML file and a suitable intput docx, ContentControlsMergeXML contains the code you need to create an instance document at run time. There's also a version of this for a servlet environment at https://github.com/plutext/OpenDoPE-WAR
As an alternative to 1 & 2, there is also org.docx4j.model.datastorage.migration.FromVariableReplacement in current nightlies, which can convert your existing "${" document. Only to a standardised target XML format though.
If you have further questions, there is a forum devoted to this topic at http://www.docx4java.org/forums/data-binding-java-f16/

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

Categories