Avoid leading numbers in xml element name - java

I would like to have a Java method for replacing leading numbers in the xml element name. For example,<1396-tt5m>25K</1396-tt5m> needs to be transformed to <a-tt5m>25K</a-tt5m>. Please take a look to my method for this:
public static String removeLeadNumbersFromXMLTagElements(String xml) throws TransformerException {
Pattern p = Pattern.compile("(<[^>]*?[^[0-9]][^>]*?>)");
Matcher m = p.matcher(xml);
StringBuffer result = new StringBuffer();
while (m.find()) {
String replace = m.group().replaceAll("[^[0-9]]+", "a");
m.appendReplacement(result, replace);
}
m.appendTail(result);
return result.toString();
}
But the result of my method is:<a-ttam>25K</a-ttam>. Could you please help with correct regex? Thank you in advance.

Try using this:
public static String removeLeadNumbersFromXMLTagElements(String xml) throws TransformerException {
Pattern p = Pattern.compile("(\\<.*?)[0-9]+(.*?\\>)");
Matcher m = p.matcher(xml);
StringBuffer result = new StringBuffer();
while (m.find()) {
String replace = m.group(1) + "a" + m.group(2);
m.appendReplacement(result, replace);
}
m.appendTail(result);
return result.toString();
}

So this is not exactly what you wanted, but it should solve the problem. It will get the tag and then remove any leading digits, but nothing else. This code replaces your while loop. Your code is fine for identifying tags, but (as you noted) it is replacing all digits, not just the leading ones.
while (m.find()) {
//System.out.println(m.group());
String work = m.group();
String replace = m.group();
if (work.substring(0, 2).equals("</")) {
//System.out.println("end tag");
if (work.length() > 2 && Character.isDigit(work.charAt(2))) {
replace = "</a";
int i = 3;
while (i < work.length() && Character.isDigit(work.charAt(i))) {
i++;
}
replace += work.substring(i);
}
} else if (work.substring(0, 1).equals("<")) {
//System.out.println("begin tag");
if (work.length() > 1 && Character.isDigit(work.charAt(1))) {
replace = "<a";
int i = 2;
while (i < work.length() && Character.isDigit(work.charAt(i))) {
i++;
}
replace += work.substring(i);
}
}
m.appendReplacement(result, replace);
}

Mine solution: I finally found that I can use String replace = m.group().replaceFirst("[^[0-9]]+", "a") instead of replaceAll. That also works!

Related

How to get String between last two underscore

I have a string "abcde-abc-db-tada_x12.12_999ZZZ_121121.333"
The result I want should be 999ZZZ
I have tried using:
private static String getValue(String myString) {
Pattern p = Pattern.compile("_(\\d+)_1");
Matcher m = p.matcher(myString);
if (m.matches()) {
System.out.println(m.group(1)); // Should print 999ZZZ
}
else {
System.out.println("not found");
}
}
If you want to continue with a regex based approach, then use the following pattern:
.*_([^_]+)_.*
This will greedily consume up to and including the second to last underscrore. Then it will consume and capture 9999ZZZ.
Code sample:
String name = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
Pattern p = Pattern.compile(".*_([^_]+)_.*");
Matcher m = p.matcher(name);
if (m.matches()) {
System.out.println(m.group(1)); // Should print 999ZZZ
} else {
System.out.println("not found");
}
Demo
Using String.split?
String given = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String [] splitted = given.split("_");
String result = splitted[splitted.length-2];
System.out.println(result);
Apart from split you can use substring as well:
String s = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String ss = (s.substring(0,s.lastIndexOf("_"))).substring((s.substring(0,s.lastIndexOf("_"))).lastIndexOf("_")+1);
System.out.println(ss);
OR,
String s = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String arr[] = s.split("_");
System.out.println(arr[arr.length-2]);
The get text between the last two underscore characters, you first need to find the index of the last two underscore characters, which is very easy using lastIndexOf:
String s = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String r = null;
int idx1 = s.lastIndexOf('_');
if (idx1 != -1) {
int idx2 = s.lastIndexOf('_', idx1 - 1);
if (idx2 != -1)
r = s.substring(idx2 + 1, idx1);
}
System.out.println(r); // prints: 999ZZZ
This is faster than any solution using regex, including use of split.
As I misunderstood the logic from the code in question a bit with the first read and in the meantime there appeared some great answers with the use of regular expressions, this is my try with the use of some methods contained in String class (it introduces some variables just to make it more clear to read, it could be written in the shorter way of course) :
String s = "abcde-abc-db-ta__dax12.12_999ZZZ_121121.333";
int indexOfLastUnderscore = s.lastIndexOf("_");
int indexOfOneBeforeLastUnderscore = s.lastIndexOf("_", indexOfLastUnderscore - 1);
if(indexOfLastUnderscore != -1 && indexOfOneBeforeLastUnderscore != -1) {
String sub = s.substring(indexOfOneBeforeLastUnderscore + 1, indexOfLastUnderscore);
System.out.println(sub);
}

Java Regex : How to detect the index of not mached char in a complex regex

I'm using regex to control an input and I want to get the exact index of the wrong char.
My regex is :
^[A-Z]{1,4}(/[1-2][0-9][0-9][0-9][0-1][0-9])?
If I type the following input :
DATE/201A08
Then macher.group() (using lookingAt() method) will return "DATE" instead of "DATE/201". Then, I can't know that the wrong index is 9.
If I read this right, you can't do this using only one regex.
^[A-Z]{1,4}(/[1-2][0-9][0-9][0-9][0-1][0-9])? assumes either a String starting with 1 to 4 characters followed by nothing, or followed by / and exactly 6 digits. So it correctly parses your input as "DATE" as it is valid according to your regex.
Try to split this into two checks. First check if it's a valid DATE
Then, if there's an actual / part, check this against the non-optional pattern.
You want to know whether the entire pattern matched, and when not, how far it matched.
There regex fails. A regex test must succeed to give results in group(). If it also succeeds on a part, one does not know whether all was matched.
The sensible thing to do is split the matching.
public class ProgressiveMatch {
private final String[] regexParts;
private String group;
ProgressiveMatch(String... regexParts) {
this.regexParts = regexParts;
}
// lookingAt with (...)?(...=)?...
public boolean lookingAt(String text) {
StringBuilder sb = new StringBuilder();
sb.append('^');
for (int i = 0; i < regexParts.length; ++i) {
String part = regexParts[i];
sb.append("(");
sb.append(part);
sb.append(")?");
}
Pattern pattern = Pattern.compile(sb.toString());
Matcher m = pattern.matcher(text);
if (m.lookingAt()) {
boolean all = true;
group = "";
for (int i = 1; i <= regexParts.length; ++i) {
if (m.group(i) == null) {
all = false;
break;
}
group += m.group(i);
}
return all;
}
group = null;
return false;
}
// lookingAt with multiple patterns
public boolean lookingAt(String text) {
for (int n = regexParts.length; n > 0; --n) {
// Match for n parts:
StringBuilder sb = new StringBuilder();
sb.append('^');
for (int i = 0; i < n; ++i) {
String part = regexParts[i];
sb.append(part);
}
Pattern pattern = Pattern.compile(sb.toString());
Matcher m = pattern.matcher(text);
if (m.lookingAt()) {
group = m.group();
return n == regexParts.length;
}
}
group = null;
return false;
}
public String group() {
return group;
}
}
public static void main(String[] args) {
// ^[A-Z]{1,4}(/[1-2][0-9][0-9][0-9][0-1][0-9])?
ProgressiveMatch match = new ProgressiveMatch("[A-Z]{1,4}", "/",
"[1-2]", "[0-9]", "[0-9]", "[0-9]", "[0-1]", "[0-9]");
boolean matched = match.lookingAt("DATE/201A08");
System.out.println("Matched: " + matched);
System.out.println("Upto; " + match.group());
}
One could make a small DSL in java, like:
ProgressiveMatch match = ProgressiveMatchBuilder
.range("A", "Z", 1, 4)
.literal("/")
.range("1", "2")
.range("0", "9", 3, 3)
.range("0", "1")
.range("0", "9")
.match();

Change text between tags to uppercase java

I need to change text between tags to upper case in a string and then print the whole string with the changed letters. So
"asdasd <upcase>something</upcase> dfldkflskdf <upcase>stuff</upcase>skdlskd" would become:
"asdasd SOMETHING dfldkflskdf STUFF skdlskd"
So far I got this but it returns the text only from the first ocurrence of the tags.
static String tags (String word)
{
String changed = word;
while (changed.indexOf("<upcase>" ) >= 0)
{
changed = (changed.substring(changed.indexOf("<upcase>")+"<upcase>".length(),changed.indexOf("</upcase>")));
}
return changed.toUpperCase();
First, since you can have more than one string between the upcase tags, you need to return a collection of Strings. I chose a List.
Second, you need to use the String indexOf(string, pos) method, so you don't lose your place in the XML string.
Here's one way to do what you want.
public List<String> tags(String xml) {
List<String> list = new ArrayList<String>();
String startTag = "<upcase>";
String endTag = "</upcase>";
int sPos = xml.indexOf(startTag);
while (sPos >= 0) {
int ePos = xml.indexOf(endTag, sPos + startTag.length());
if (ePos >= 0) {
list.add(xml.substring(sPos + startTag.length(), ePos)
.toUpperCase());
}
sPos = xml.indexOf(startTag, ePos + endTag.length());
}
return list;
}
All right, here is an updated version:
public static String tags(String xml) {
final String upcaseTagRegex = "<upcase>(.*?)</upcase>";
String result = xml;
Pattern pattern = Pattern.compile(upcaseTagRegex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(xml);
while(matcher.find()) {
result = result.replaceFirst(upcaseTagRegex, matcher.group(1).toUpperCase());
}
return result;
}

How to get with JAVA a specific value for one substring from string?

I have ONE string field which is in format:
"TransactionID=30000001197169 ExecutionStatus=6
additionalCurrency=KMK
pin= 0000"
So they are not separated with some ; оr , they are not seperated even with one blank space.
I want to get value for Execution Status and put it in some field?
How to achieve this?
Thanks for help
This works. But I am not sure this is the most optimal.It just solves your problem.
String s = "TransactionID=30000001197169ExecutionStatus=6additionalCurrency=KMKpin=0000";
if(s!=null && s.contains("ExecutionStatus="))
{
String s1[] = s.split("ExecutionStatus=");
if(s1!=null && s1.length>1)
{
String line = s1[1];
String pattern = "[0-9]+";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Match");
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}
}
}
In your example they are indeed seperated by blanks, but the following should be working without blanks, too. Assuming your String is stored in String arguments
String executionStatus;
String[] anArray = arguments.split("=");
for (int i; i < anArray.length; i++)
if (anArray[i].contains("ExecutionStatus")){
executionStatus = anArray[++i].replace("additionalCurrency","");
executionStatus = executionStatus.trim();
}
}
Check if it contains() ExecutionStatus=
If yes then split the string with ExecutionStatus=
Now take the Second string from array find the first occurance of non digit char and use substring()
Assuming all that white space is present in your string, this works.
String str = "\"TransactionID=30000001197169 ExecutionStatus=6\n" +
" additionalCurrency=\"KMK\"\n" +
" pin= \"0000\"\"";
int start = str.indexOf("ExecutionStatus=") + "ExecutionStatus=".length();
int status = 0;
if (start >= 0) {
String strStatus = str.substring(start, str.indexOf("additionalCurrency=") - 1);
try {
status = Integer.parseInt(strStatus.trim());
} catch (NumberFormatException e) {
}
}
At the risk of attracting "... and now you have two problems!" comments, this is probably easiest done with regexes (str is the String defined above):
Pattern p = Pattern.compile("ExecutionStatus\\s*=\\s*(\\d+)"); // Whitespace matching around equals for safety, capturing group around the digits of the status)
Matcher m = p.matcher(str);
String status = m.find() ? m.group(1) : null;

How to truncate a HTML fragment to a given length(for preview) in Java? [duplicate]

Is there any utility (or sample source code) that truncates HTML (for preview) in Java? I want to do the truncation on the server and not on the client.
I'm using HTMLUnit to parse HTML.
UPDATE:
I want to be able to preview the HTML, so the truncator would maintain the HTML structure while stripping out the elements after the desired output length.
I've written another java version of truncateHTML. This function truncates a string up to a number of characters while preserving whole words and HTML tags.
public static String truncateHTML(String text, int length, String suffix) {
// if the plain text is shorter than the maximum length, return the whole text
if (text.replaceAll("<.*?>", "").length() <= length) {
return text;
}
String result = "";
boolean trimmed = false;
if (suffix == null) {
suffix = "...";
}
/*
* This pattern creates tokens, where each line starts with the tag.
* For example, "One, <b>Two</b>, Three" produces the following:
* One,
* <b>Two
* </b>, Three
*/
Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");
/*
* Checks for an empty tag, for example img, br, etc.
*/
Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");
/*
* Modified the pattern to also include H1-H6 tags
* Checks for closing tags, allowing leading and ending space inside the brackets
*/
Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");
/*
* Modified the pattern to also include H1-H6 tags
* Checks for opening tags, allowing leading and ending space inside the brackets
*/
Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");
/*
* Find > ...
*/
Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");
// splits all html-tags to scanable lines
Matcher tagMatcher = tagPattern.matcher(text);
int numTags = tagMatcher.groupCount();
int totalLength = suffix.length();
List<String> openTags = new ArrayList<String>();
boolean proposingChop = false;
while (tagMatcher.find()) {
String tagText = tagMatcher.group(1);
String plainText = tagMatcher.group(2);
if (proposingChop &&
tagText != null && tagText.length() != 0 &&
plainText != null && plainText.length() != 0) {
trimmed = true;
break;
}
// if there is any html-tag in this line, handle it and add it (uncounted) to the output
if (tagText != null && tagText.length() > 0) {
boolean foundMatch = false;
// if it's an "empty element" with or without xhtml-conform closing slash
Matcher matcher = emptyTagPattern.matcher(tagText);
if (matcher.find()) {
foundMatch = true;
// do nothing
}
// closing tag?
if (!foundMatch) {
matcher = closingTagPattern.matcher(tagText);
if (matcher.find()) {
foundMatch = true;
// delete tag from openTags list
String tagName = matcher.group(1);
openTags.remove(tagName.toLowerCase());
}
}
// opening tag?
if (!foundMatch) {
matcher = openingTagPattern.matcher(tagText);
if (matcher.find()) {
// add tag to the beginning of openTags list
String tagName = matcher.group(1);
openTags.add(0, tagName.toLowerCase());
}
}
// add html-tag to result
result += tagText;
}
// calculate the length of the plain text part of the line; handle entities (e.g. ) as one character
int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
if (totalLength + contentLength > length) {
// the number of characters which are left
int numCharsRemaining = length - totalLength;
int entitiesLength = 0;
Matcher entityMatcher = entityPattern.matcher(plainText);
while (entityMatcher.find()) {
String entity = entityMatcher.group(1);
if (numCharsRemaining > 0) {
numCharsRemaining--;
entitiesLength += entity.length();
} else {
// no more characters left
break;
}
}
// keep us from chopping words in half
int proposedChopPosition = numCharsRemaining + entitiesLength;
int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
if (endOfWordPosition == -1) {
endOfWordPosition = plainText.length();
}
int endOfWordOffset = endOfWordPosition - proposedChopPosition;
if (endOfWordOffset > 6) { // chop the word if it's extra long
endOfWordOffset = 0;
}
proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
if (plainText.length() >= proposedChopPosition) {
result += plainText.substring(0, proposedChopPosition);
proposingChop = true;
if (proposedChopPosition < plainText.length()) {
trimmed = true;
break; // maximum length is reached, so get off the loop
}
} else {
result += plainText;
}
} else {
result += plainText;
totalLength += contentLength;
}
// if the maximum length is reached, get off the loop
if(totalLength >= length) {
trimmed = true;
break;
}
}
for (String openTag : openTags) {
result += "</" + openTag + ">";
}
if (trimmed) {
result += suffix;
}
return result;
}
I think you're going to need to write your own XML parser to accomplish this. Pull out the body node, add nodes until binary length < some fixed size, and then rebuild the document. If HTMLUnit doesn't create semantic XHTML, I'd recommend tagsoup.
If you need an XML parser/handler, I'd recommend XOM.
There is a PHP function that does it here: http://snippets.dzone.com/posts/show/7125
I've made a quick and dirty Java port of the initial version, but there are subsequent improved versions in the comments that could be worth considering (especially one that deals with whole words):
public static String truncateHtml(String s, int l) {
Pattern p = Pattern.compile("<[^>]+>([^<]*)");
int i = 0;
List<String> tags = new ArrayList<String>();
Matcher m = p.matcher(s);
while(m.find()) {
if (m.start(0) - i >= l) {
break;
}
String t = StringUtils.split(m.group(0), " \t\n\r\0\u000B>")[0].substring(1);
if (t.charAt(0) != '/') {
tags.add(t);
} else if ( tags.get(tags.size()-1).equals(t.substring(1))) {
tags.remove(tags.size()-1);
}
i += m.start(1) - m.start(0);
}
Collections.reverse(tags);
return s.substring(0, Math.min(s.length(), l+i))
+ ((tags.size() > 0) ? "</"+StringUtils.join(tags, "></")+">" : "")
+ ((s.length() > l) ? "\u2026" : "");
}
Note: You'll need Apache Commons Lang for the StringUtils.join().
I can offer you a Python script I wrote to do this: http://www.ellipsix.net/ext-tmp/summarize.txt. Unfortunately I don't have a Java version, but feel free to translate it yourself and modify it to suit your needs if you want. It's not very complicated, just something I hacked together for my website, but I've been using it for a little more than a year and it generally seems to work pretty well.
If you want something robust, an XML (or SGML) parser is almost certainly a better idea than what I did.
I found this blog: dencat: Truncating HTML in Java
It contains a java port of Pythons, Django template function truncate_html_words
public class SimpleHtmlTruncator {
public static String truncateHtmlWords(String text, int max_length) {
String input = text.trim();
if (max_length > input.length()) {
return input;
}
if (max_length < 0) {
return new String();
}
StringBuilder output = new StringBuilder();
/**
* Pattern pattern_opentag = Pattern.compile("(<[^/].*?[^/]>).*");
* Pattern pattern_closetag = Pattern.compile("(</.*?[^/]>).*"); Pattern
* pattern_selfclosetag = Pattern.compile("(<.*?/>).*");*
*/
String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";
Pattern pattern_overall = Pattern.compile(HTML_TAG_PATTERN + "|" + "\\s*\\w*\\s*");
Pattern pattern_html = Pattern.compile("(" + HTML_TAG_PATTERN + ")" + ".*");
Pattern pattern_words = Pattern.compile("(\\s*\\w*\\s*).*");
int characters = 0;
Matcher all = pattern_overall.matcher(input);
while (all.find()) {
String matched = all.group();
Matcher html_matcher = pattern_html.matcher(matched);
Matcher word_matcher = pattern_words.matcher(matched);
if (html_matcher.matches()) {
output.append(html_matcher.group());
} else if (word_matcher.matches()) {
if (characters < max_length) {
String word = word_matcher.group();
if (characters + word.length() < max_length) {
output.append(word);
} else {
output.append(word.substring(0,
(max_length - characters) > word.length()
? word.length() : (max_length - characters)));
}
characters += word.length();
}
}
}
return output.toString();
}
public static void main(String[] args) {
String text = SimpleHtmlTruncator.truncateHtmlWords("<html><body><br/><p>abc</p><p>defghij</p><p>ghi</p></body></html>", 4);
System.out.println(text);
}
}

Categories