Internal-linking of texts out of .csv files (in java) - java

I have a .csv file with text, and am supposed to parse the data, and based on specific keywords, replace the words with the necessary html tags for linking the keywords to a website.
So far, I wrote a .csv parser and writer, that gets all the data from the columns required out of the first file, and prints those columns to a newly created (.csv) file (e.g. text id in one cell, text title in the next cell, and the actual text in the next cell).
Now I am still waiting to get a list of keywords, as well as the website hierarchy and links to put it, but to be honest I have no idea how to continue working on this. Somehow I'll have to parse down the website hierarchy to where the text title is present, and only consider elements beneath it, and link them to keywords in my text. How can this be done? I there special software of extensions, libs, packs for java to do something like this?
Any help would be appreciated, I'm running on a deadline here...
THX!
P.S.: I am coding all of it in java

I'm not sure, but it sounds like you want to create an href column in your output:
Visit W3Schools
You could do this most simply by concatenating the strings:
String makeHref(String title, String id, String link) {
return "<a href=" + ... etc. }
before you write out the second csv. You'll need to escape the "s, of course.
It's also entirely possible that I didn't understand the question. You may want to try to be more specific if that's the case.

Related

Extracting data from a PDF

I have a system that ultimately creates a PDF files from html file. It works very similar to a mail merge. It grabs data from a database, merge's the data into palceholders in the html document and then converts the html file to a pdf.
When I am unit testing the html file I can look at the values in my place holder. For example if I had a John Smith and I want to validate that the name is "John Smith" I simply look the value of the div after the merge.
I need to do something similar with validating the data in the pdf. Using pdfbox and itext I was able to extract text from a location as well as text from the document but I can't find anything that would let me create a "tag/placeholder/..." and extract information from it similar to what I do with the html file.
Is this possible with a pdf?
That's perfectly possible using pdf2Data, which is a solution from the iText suite.
You can find the demo here
http://pdf2data.online/
It essentially does exactly what you described, you are given a viewer and some tools that allow you to define areas of interest (what you called 'placeholders').
Areas of interest can be defined using:
coordinates
relative to other areas of interest
relative to text or regular expressions
matching a certain regular exression
matching a table
etc
The tool then stores your template as an XML file, and you can use java or .NET code to extract information from a PDF that matches the template.
You are given either a json-like datastructure, or an XML file.
That should make it relatively straightforward to test whether a given area of interest contains a piece of text.

Java: Print Text With Strikethrough

I'm printing to a file. Is there a way to print the text with strikethrough through it? I have done some googling, but did not find any applicable answers.
You would have to save the file in a PDF, HTML or create some kind of word processor document. Simple text (or more correctly plaintext) does not have formatting ... in any language ...
I'd recommend HTML. It is simple to create (PDF is a pain), gives you the option of other formatting (people always end up asking for a heading), allows you to format as tables (managers love tables), and will open anywhere (could even be served on a web-server, eliminating printing and tree-killing altogether).
If you want to force it, you can use the unicode index of those letters, like this:
"\u03C0" //π
http://unicode-table.com/de/0268/
This, as an example is the ɨ

Is there a clean way to to transform text files that are not the same into a standard format

I'm pretty sure the answer i'm going to get is: "why don't you just have the text files all be the same or follow some set format". Unfortunately i do not have this option but, i was wondering if there is a way to take any text file and translate it over to another text or xml file that will always look the same?
The text files pretty much have the same data just arranged differently.
The closest i can come up with is to have an XSLT sheet for each text file but, then i have to turn around and read the file that was just created, delete it, and repeat for each text file.
So, is there a way to grab the data off text files that essentially have the same data just stored differently; and store this data in an object that i could then re-use later on in some process?
If it was up to me, i would push for every text file to follow some predefined format since they all pretty much contain the same data but, it's not up to me.
Odd question... You say they are text files yet mention XSLT as a possible solution. XSLT will only work if the source is XML, if that is so, please redefine the question. If you say text files I assume delimiter separated (e.g. csv), fixed length,...
There are some parsers (like smooks) out there that allow you to parse multiple formats, but it will still require you to perform the "mapping" yourself of course.
This is a typical problem in the integration world so any integration tool should offer you a solution (e.g. wso2, fuse,...).

parsing html page in java without using external library

I know its an old question and have been asked many a times. Note :I cannot use external libraries.
Given a function with label as argument, my function should return list of all the tags that contain that label.
I thought of saving my html as tree and then I can find the label and return list of all the tags. But I am not able to code it in java. How to completely parse and store html as tree structure and search on it?
Please help.
Thanks
Ravi

Reading PDF in java as a file and making "PDF" editable

I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove e.g. footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.
This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).
And what I want to do is to create PDF files which are still editable with my program once created.
I want to achieve this by implementing my custom file type readable with my program into the output PDF.
I came up with three ways of doing that:
Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).
In the end, I decided to go with the method number 3.
Unless someone has any better ideas, and the conditions are:
1. One file only. And that file is PDF.
2. User must not be aware of the addition.
The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).
So, does anyone have a better idea?
If not, how do I read PDF as a FILE, so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?
In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.
So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.
A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf.
There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html. See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html
It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.
Here's a related question that may be of interest: PDF parsing file trailer
I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.
So a simple algorithm for inserting your special comment will be:
Go to the end of the file
Search backwards for startxref
Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment
Save the PDF
You can (and should) do this manually in a hex editor.
Really important: are your users going to be saving changes to these files? i.e. if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).
XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.
I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217

Categories