I have a java string like the one below which has multiple lines and blank spaces. Need to remove all of them such that these are one line.
These are xml tags and the editor is not allowing to include less than symbol
<paymentAction>
Authorization
</paymentAction>
Should become
<paymentAction>AUTHORIZATION</paymentAction>
Thanks in advance
Calling theString.replaceAll("\\s+","") will replace all whitespace sequences with the empty string. Just be sure that the text between the tags doesn't contain spaces too, othewerise they'll get removed too.
You essentially want to convert the XML you have to Canonical Form. Below is one way of doing it but it requires you to use that library. If you doesn't want to depend upon external libraries then another option for you is to use XSLT.
The Canonicalizer class at Apache XML Security project:
NOTE: Dealing with non-xml aware API's (String.replaceAll()) is not generally recommended as you end up dealing with special/exception cases.
This is a start. Probably not enough, but should be in the right direction.
xml.replaceAll(">\\s*", ">").replaceAll("\\s*<, "<");
However, I'm tempted to say there has to be a way to create a document from the XML and then serialize it in canonical form as Pangea suggested.
Related
I am using openPdf library (fork of iTextPdf) to replace placeholders like #{Address_name_1} with real values. My PDF file is not simple, so I use regular expression to find this placeholder:[{].*?[A].*?[d].*?[d].*?[r].*?[e].*?[s].*?[s].*?[L].*?[i].*?[n].*?[e].*?[1].*?[}]
and do something like
content = MY_REGEXP.replace(content, "Saint-P, Nevskiy pr.");
obj.setData(content.toByteArray(CHARSET)).
The problem happens when my replacement line is too long and it is unfortunately cut from right end. Can I somehow make it carry over to the next line? Naive \n does not work.
PDF store strings in a different way. There are no next lines, there are lines.
So you will need to add several placeholders on fields on your template for replacements that can get long enough, like:
#{Address_name_1_line1}
#{Address_name_1_line2}
#{Address_name_1_line3}
And place it in different lines on your template. The non-used empty placeholders (because replacement is not long enough) should be replaced by empty strings.
For longer replacements you will need to use several placeholders. The number of placeholders to use and the replacement splitting should be determined by code.
If your PDF is too complex to place different placeholders then you will need to placeholder everything, all your text contents should be inyected into placeholders, at least if you want to use this approach.
PDF files are NOT text files. Each line is an object with an x/y offset. To place something on the next line would require a new object placed at new x/y coords. You would need an advanced PDF editing toolkit.
I've got a requirement to take an XML file and replace any existing value with one I generate from user input. Needs to only replace the existing value in the document.
I was looking at the simplest library SAX (https://docs.oracle.com/javase/tutorial/jaxp/sax/index.html) that is now in the standard JAVA JDK, but since this is an old project I was wondering if I should use something else like XMLT (https://docs.oracle.com/javase/tutorial/jaxp/xslt/transformingXML.html).
Can someone please advise the best (easiest) approach for this simple case?
The fact that it's old does not change that it's XML. Use the library best suited to your needs. The standard SAX parser should be fine.
Also, if it's just a matter of replacing text content, why can't you just so a simple textual replace?
Thanks for reading :)
I'm trying to search for words like "UTTD_Equip_City_TE" across RTF documents using Lucene. This word appears in two different forms:
«UTTD_Equip_City_TE»,
«UTTD_Equip_City_TE»
I first tried with StandardAnalyzer, but it seems to break down the word into "UTTD", "Equip", "City", and "TE".
Then I tried again using WhiteSpaceAnalyzer, but it doesn't seem to be working... (I don't know why).
Could you help me I should approach this problem? By the way, editing the Lucene source and recompiling it with Ant is not an option :(
Thanks.
EDIT: there are other texts in this document, too. For example:
SHIP TO LESSEE (EQUIPMENT location address): «UTTD_Equip_StreetAddress_TE», «UTTD_Equip_City_TE», «UTTD_Equip_State_MC»
Basically, I'm trying to index RTF files, and inside each RTF file is tables with variables. Variables are wrapped with « and » . I'm trying to search those variables in the documents. I've tried searching "«" + string + "»", but it hasn't worked...
This example could give a better picture: http://i.imgur.com/SwlO1.png
Please help.
KeywordAnalyzer tokenizes the entire field as a single string. It sounds like this might be what you're looking for, if the substrings are in different fields within your document.
See: KeywordAnalyzer
Instead, if you are adding the entire content of the document within a single field, and you want to search for a substring with embedded '_' characters within it, then I would think that WhitespaceAnalyzer would work. You stated that it didn't work, though. Can you tell us what the results were when you tried using WhitespaceAnalyzer? And did you use it for both Indexing and Querying?
I see two options here. In both cases you have to build a custom analyzer.
Option 1
Start with StandardTokenizer's grammar file and customize it so that it emits text separated by '_' as a single token. (refer to Generating a custom Tokenizer for new TokenStream API using JFlex/ Java CC). Build your Analyzer using this new Tokenizer along with LowerCaseFilter.
Oprion 2
Write a Custom Analyzer that is made of WhiteSpaceTokenizer and custom TokenFilters. In these TokenFilters you decide on how to act on the tokens returned by WhiteSpaceTokenizer.
Refer to http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/analysis/package-summary.html for more details on analysis
The specification requires to validate a simplified xml syntax, primarily the order of tags with a stack. While the use of standard classes is allowed, I don't think xml-specific tools would be. Should I use string.split or tokenizer or something else? The goal is to extract text within <>, push if no leading /, else try to pop.
Yes you have the right idea, use the stack.
You can write a simple parser using a stack to keep track of tags. Worst case you can use regular expressions.
The basic idea behind parsing simple, well-formed tags is pretty straight-forward. You have a stack, you split the text (tokenizer sounds good), compare each token to a list of tags, and every time you encounter a tag, you push it. Keep reading until you get to another tag, make sure it's the same one on the top of the stack, pop it, and do whatever you want to do with the content.
I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.
I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?
edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.
I think you might have to generate your own grammar.
Some links:
Parsing XML with ANTLR Tutorial
ANTXR
XPA
http://www.google.com/search?q=antlr+xml
I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.
Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.
I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?
Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.
It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.