compare xml through java and present it like a diff tool - java

I need to write a logic in java that
Takes 2 versions of xmls like v1.xml & V2.xml
Outputs the differences between the 2 xmls
Displays them on a webpage just like any diff tool, like winmerge would display.
Removed lines - Highlighted in a unique color
Added lines - Highlighted in a unique color
Changed lines - Highlighted in a unique color
What is the best way to achieve this.
Thanks !

You can use XMLUnit to achieve most of your requirements.

Writing an XML parser from scratch is a bad idea if that is what you mean. It sounds really easy at first, but then quickly becomes a nightmare, trust me. I highly recommend taking advantage of existing tools.
http://www.roseindia.net/opensource/xmldiff.php lists several tools, inlcuding 3DM, diffmk, diffxml, VMTools, X-Diff, and XMLUnit. If you do have to write your own parser, you might want to at least look at the code from these projects for ideas. However, it takes much less time and effort to just give them credit and use their tools than to rewrite them yourself. I haven't used any of these tools, so buyer beware.
See also Tool or library for comparing xml files

Related

Identify an english word as a thing or product?

Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Convert asterisks to bold or italic tags

I want to convert text between markdown style bold/italics to html bold/italics. Here's an example:
**Bold text** is bold, *italic* text is italicized.
Should go to:
<b>Bold text</b> is bold, <i>italic</i> text is italicized.
I looked elsewhere on SO, but most questions recommended a parsing library. However, I think using a library would be unsuitable for the following reasons:
I'm trying to keep the code base as small as possible
A parser will have too many features!
I want to make it as fast & lightweight as possible
How should I go about converting these tags then?
I have tried to do this myself in the past thinking exactly as you have trying to hand bake the solution. The number of exceptions you have to cater for once you add one or two more markups becomes very complex. I ended up re-inventing the wheel in a much less eficent manner. I opted to adopt one of the parsing libraries and never looked back.
A parser will have too many features!
You can get some parsers that let you define your own markup language. This is what I opted for. I did it in .Net so I can't suggest a Java version.
I want to make it as fast & lightweight as possible
Any parsing library will be more efficient than your own and unless you're parsing many MBs of data I don't think you'll notice much difference. They have usually spent much more time on making it efficient that I maybe you would be willing to.
I know this isn't an "answer" as such, but I hope I save you some time (and delay the onset of gray hair) or point you in the right direction.

Alternative to XSLT?

on my project I have a huuuuge XSLT used to convert some XML files to HTML.
The problem is that this file is growing up day by day, it's hard to read, debug and test.
So I was thinking about moving all the parsing process to Java.
Do you think is a good idea? In case what libraries to parse XML and generate HTML(XML) do u suggest? performances will be better or worse?
If it's not a good idea any alternative idea?
Thanks
Randomize
Take a look at CDuce - it is a strictly typed, statically compiled XML processing language.
I once had a client with a similar problem - thousands of lines of XSLT, growing all the time. I spent an hour reading it with increasing incredulity, then rewrote it in 20 lines of XSLT.
Refactoring is often a good idea, and the worse the code is, the more worthwhile refactoring is. But there's no reason to believe that just because the code is bad and in need of refactoring, you need to change to a different programming language. XST is actually very good at handling variety and complexity if you know how to use it properly.
It's possible that the code is an accumulation of special handling of special cases, and each new special case discovered results in more rules being added. That's a tough problem to tackle in any language, but XSLT can deal with it better than most, provided you apply your mind all the time to finding abstract general rules that encompass all the special rules, so you only need to code the special rules as exceptions.
I'd consider Velocity as an alternative. I prefer it to XSL-T. The transforms are harder to write than templates, because the latter look exactly like the XML I wish to produce. It's a simple thing to add in the markup to map in the data.

Algorithm for searching for an image in another image. (Collage)

Is this even possible? I have one huge image, 80mb with a lot of tiny pictures. They are tilted and turned around as well. How can i search for an image with programming? I know how to use java and c++. How would you go about this?
You might want to look up the Scale Invariant Feature Transform (SIFT) algorithm. Just for example, it's used in a fair number of programs for automatically generating panoramas, to recognize the parts of pictures that match up, despite differences in scaling, tilting, panning, and so on.
Edit: Quite true -- it is patented, and I probably should have mentioned that to start with. In case anybody care's it's US patent # 6,711,293.
One algorithm I've used before is SIFT. If you're interested in implementing the algorithm for yourself, you can see course notes for CPSC 425 at UBC, which describes in gentle detail how to implement SIFT in MATLAB. If you just want code that does this, take a look at VLFeat, a C library that does SIFT and a number of other algorithms.
Quotation from Jerry Coffin:
Edit: Quite true -- it is patented, and I probably should have mentioned that to start with. In case anybody care's it's US patent # 6,711,293.
How much do you know about the image? Exactly what it looks like? Do you have a copy of the image and you just need to figure out where in the large image it is?
Anyway, the branch of CS that deals with these kinds of questions is called Computer Vision.
Open CV and TINA are two open source libraries you might be able to use.
You should probably start out with the simplest ideas and see if they are sufficient for your needs. In the field of pattern matching the simplest idea is that of template matching. There is an efficient implementation of template matching found in OpenCv.
Note that template matching is rotation variant, meaning if the template you are trying to match can be rotated in the image you are trying to find it in, it won't work unless you pre-rotate the templates.

How can I index a lot of txt files? (Java/C/C++)

I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?
update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)
For java try Lucene
I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.
It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.
If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...
Some 3rd party solutions are listed here.
Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.
The Overview can give you more info
http://www.compass-project.org/overview.html
I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.
Lucene - Java
It's open source as well so you are free to use and deploy in your application.
As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions
Also take a look at Lemur Toolkit.
Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.
As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.
I believe the Boost::Statechart library may be of some help for that matter.
I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).

Categories