I need to translate some of HTML page content. I have a lot of HTML documents as a list of files and a map with translations like this:
List<File> files
Map<String, String> translations
Only strings in specific tags (p, h1..h6, li) have to be translated. I want to end up with the same files like at the beginning but with replaced strings.
Two solutions that don't work:
Replacing - because I don't want to translate strings inside comments or in javascript, another problem is that one string with original text can be a part of another string with original text.
Parsing libraries like Jsoup - because it cleans, fixes dom structure and I want to have unmodified HTML structure.
Any solutions?
You pretty much have to use a proper html parser (which fixes the dom structure), because otherwise there's no way to tell where an element starts and where it ends. There are all sorts of special cases and different types of broken html and if you want to handle them all, you are basically implementing a full html parser.
The only other way I can think of (and which is often used) is to use placeholders in the original files, such as <h1>${title}</h1> <p>${introduction}</p> etc, and find and replace them directly, but I guess that would require a lot of work to change the files if you don't already have them in this form.
Related
I am having a problem with my logic. Here's the setup.
input (span objects) ---> pre-processing (my function)
---> generic spanToHtml converter --> final html
My pre-processing function essentially looks for my custom spans, convert them into html and insert them into the original span so the spanToHtml converter will work as usual. Then the final html code will have the html code I generated from my pre-processing function.
However, there is one problem. Like any good web code, spanToHtml sanitizes the input and escapes it. Therefore, any html tags I added in my pre-processing function gets escaped and rendered incorrectly.
I don't know how to approach this problem. I temporarily have a hacky solution where instead of using html tags, my pre-processing function will insert arbitrary "tag" strings. Then after I call spanToHtml, I do a post-processing replacing all those tags with actual html tags. However, that's a terrible solution.
Any suggestions are appreciated!
In my Java webapp, I create summary text of long HTML text. In the process of truncation, the HTML fragments in the string often break, producing HTML string with invalid & broken fragments. Like this example HTML string:
Visit this link <img src="htt
Is there any Java library to deal with this better so that such broken fragments as above are avoided ?
Or could I let this be included in the HTML pages & somehow deal with this using client side code ?
Since browsers will usually be able to deal with almost any garbage you feed into it (if it ain't XHTML...), if the only thing that actually happens with the input (assuming it's valid HTML of any kind) is being sliced, then the only thing you have to worry about is to actually get rid of invalid opening tags; you won't be able to distinguish broken 'endings' of tags, since they, in themselves, ain't special in any way. I'd just take a slice I've generated and parse it from the end; if I encounter a stray '<', I'd get rid of everything after it. Likewise, I'd keep track of the last opened tag - if the next close after it wasn't closing that exact tag, it's likely the closing tag got out, so I'd insert it.
This would still generate a lot of garbage, but would at least fix some rudimentary problems.
A better way would be to manage a stack of opened/closed tags and generate/remove the needed/broken/unnecessary ones as they emerge. A stack is a proper solution since HTML tags musn't 'cross' [by the spec, AFAIR it's this way from HTML 4], i.e. <span><div></span></div> isn't valid.
A much better way would be to splice the document after first parsing it as SGML/HTML/XML (depends on the exact HTML doctype) - then you could just remove the nodes, without damaging the structure.
Note that you can't actually know if a tag is correct without providing an exact algorithm you use to generate this 'garbled' content.
I used owasp-java-html-sanitizer to fix those broken fragments to generate safe HTML markup from Java.
PolicyFactory html_sanitize_policy = Sanitizers.LINKS.and(Sanitizers.IMAGES);
String safeHTML = html_sanitize_policy.sanitize(htmlString);
This seemed to be easiest of all solutions I came across.
I have a java string like the one below which has multiple lines and blank spaces. Need to remove all of them such that these are one line.
These are xml tags and the editor is not allowing to include less than symbol
<paymentAction>
Authorization
</paymentAction>
Should become
<paymentAction>AUTHORIZATION</paymentAction>
Thanks in advance
Calling theString.replaceAll("\\s+","") will replace all whitespace sequences with the empty string. Just be sure that the text between the tags doesn't contain spaces too, othewerise they'll get removed too.
You essentially want to convert the XML you have to Canonical Form. Below is one way of doing it but it requires you to use that library. If you doesn't want to depend upon external libraries then another option for you is to use XSLT.
The Canonicalizer class at Apache XML Security project:
NOTE: Dealing with non-xml aware API's (String.replaceAll()) is not generally recommended as you end up dealing with special/exception cases.
This is a start. Probably not enough, but should be in the right direction.
xml.replaceAll(">\\s*", ">").replaceAll("\\s*<, "<");
However, I'm tempted to say there has to be a way to create a document from the XML and then serialize it in canonical form as Pangea suggested.
I would like to parse an HTML form and pull our filename's of any embedded images.
So the string could look like:
{
...
random HTML content
image1.png
more random HTML content
image3.png
...
}
From the above I would like to write a function in Java that returns to me
{image1.png, image3.png}.
I have a regular expression that returns to me only the last image name (image3.png) but it disregards previous image names. How can I capture all of them using regex?
All / any help would be appreciated.
https://stackoverflow.com/a/2059614/684934 give a good hint. More specifically, you're probably looking for something like [a-zA-Z0-9_\-]+\.(png|jpg|gif|jpeg|tif)
Note, however, that this is regex and is only looking for sequences of characters. If you are looking at a site that serves up dynamic images using servlets for example, and the resource URI doesn't happen to end with a normal image file extension (such as .jsp or .do), then the regex will completely fail. It will also pick up any "image names" from any sort of text that happens to match, which does not actually represent an image on the page.
To do the job properly, you will need to use some sort of DOM and traverse the <img> elements. (And the <button> elements, which may be of type image... there are probably more tags that can have images.)
I have an Android application which grabs some data from an external XML source. I've stripped out some HTML from one of the XML elements, but it's in the format:
<p class="x">Some text...</p>
<p>Some more text</p>
<p>Some final text</p>
I want to extract the middle paragraph text, how can I do this? Would a regular expression be the best way? I don't really want to start including external HTML parsing libraries.
RegEx match open tags except XHTML self-contained tags
So, I'll ask the question that wraps up the linked-to answer: have you tried using an XML parser instead?
You might get some ideas from some of the other answers there, too, but I'd try to avoid the regex path. As Macarse suggested, clean this up on the server if you can. If not, wrap those three <p> elements in a single root element and parse it using SAX or something, paying attention to the 2nd paragraph element.
If it's simple, just do a regex.
If you are getting XML from an external source that you own, I would parse it there.
just doing a split: http://developer.android.com/reference/java/lang/String.html#split(java.lang.String)
on "</p><p>" and taking the second entry in the returned array would actually do it pretty quickly
The regex would probably look something like: .*?>(.*?)<.*
And you access the grouped content by calling group(1) on the Matcher object.
If you are going to parse an XML file downloaded from website, then there is nothing to do with Android.