Im working in a project which we create hundreds of xml's in informatica everyday, and all the data which is in the xml should be filtered, like removing all kind of special characters like * +.. You get the idea.
Adding regular expressions for every port is too complicated and not possible due the large amount of mapping we have.
I've added a custom property to the session XMLAnyTypeToString=Yes; and now i get some of the characters instead of &abcd, in their usual presentation (" + , ..).
I'm hoping for some custom property or change in XML target to remove these characters completely.
any idea?
You can make use of String.replaceAll method:
String replaceAll(String regex, String replacement)
Replaces each substring of this string that matches the given regular expression with the given replacement.
You can create a set of symbols you want to remove using regex, for example:
[*+"#$%&\(",.\)]
and then apply it to your string:
String myString = "this contains **symbols** like these $#%#$%";
String cleanedString = myString.replaceAll("[*+"#$%&]", "");
now you "cleanedString" is free of the symbols you've chosen.
By the way, you can test your regex expression in this excellent site:
http://www.regexplanet.com/advanced/java/index.html
Related
I have this example:
String str = "HellMCo I fiCZMnd thBVMis site intZereVCsting";
String tags = "BCMVZ";
I need a regular expression that helps me to find every combination of tags. As you can see in str we find four variations. I don't know too much about regular expressions.
I'm starting to test with this pattern:
(\d{,1}[BCMVZ])
PD: I'm testing here http://regexpal.com/ but it doesn't work my pattern.
So my real question is, how can I detect any variation of any character from another string?
Maybe try someting like:
[BCMVZ]+
it find any tags combinations with this chars BCMVZ.
I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.
I need to put multiple regular expressions in a single string and then parse it back as separate regex. Like below
regex1<!>regex2<!>regex3
Problem is I am not sure which delimiter will be best to use to separate the expressions in place of <!> shown in example, so that I can safely split the string when parsing it back.
Constraints are, I can not make the string in multiple lines or use xml or json string. Because this string of expressions should be easily configurable.
Looking forward for any suggestion.
Edited:
Q: Why does it have to be a single string?
A: The system has a configuration manager that loads config from properties file. And properties are containing lines like
com.some.package.Class1.Field1: value
com.some.package.Class1.Expressions: exp1<!>exp2<!>exp3
There is no way to write the value in multiple lines in the properties file. That's why.
The best way would be to use invalid regex as delimiter such as ** Because if it is used in normal regex it won't work and would throw an exception{NOTE:++ is valid}
regex1+"**"+regex2
Now you can split it with this regex
(?<!\\\\)[*][*](?![*])
------- -----
| |->to avoid matching pattern like "A*"+"**"+"n+"
|->check if * is not escaped
Following is a list of invalid regex
[+
(+
[*
(*
[?
*+
** (delimiter would be (?<!\\\\)[*][*](?![*]))
??(delimiter would be (?<!\\\\)[?][?](?![?]))
While splitting you need to check if they are escaped
(?<!\\\\)delimiter
Best delimiter is depends upon your requirement. But for best practice use sequesnce of special characters so that possibility of occureance of this sequesnce is minimal
like
$$**##$$
#$%&&%$#
i think its something helpful for u
First you have to replace tag content with single special character and then split
String inputString="regex1<!>regex2<!>regex3";
String noHTMLString = inputString.replaceAll("\\<.*?>","-");
String[] splitString1 = (noHTMLString.split("[-]+"));
for (String string : splitString1) {
System.out.println(string);
}
I need to strip all xml tags from an xml document, but keep the space the tags occupy, so that the textual content stays at the same offsets as in the xml. This needs to be done in Java, and I thought RegExp would be the way to go, but I have found no simple way to get the length of the tags that match my regular expression.
Basically what I want is this:
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
Matcher m = p.matcher(stringWithXMLContent);
String strippedContent = m.replaceAll("THIS IS A STRING OF WHITESPACES IN THE LENGTH OF THE MATCHED TAG");
Hope somebody can help me to do this in a simple way!
Since < and > characters always surround starting and ending tags in XML, this may be simpler with a straightforward statemachine. Simply loop over all characters (in some writeable form - not stored in a string), and if you encounter a < flip on the "replacement mode" and start replacing all characters with spaces until you encounter a >. (Be sure to replace both the initial < and the closing >).
If you care about layout, you may wish to avoid replacing tab characters and/or newline characters. If all you care about is overall string length, that obviously won't matter.
Edit: If you want to support comments, processing instructions and/or CData sections, you'll need to explicitly recognize these too; also, attribute values unfortunately can include > as well; all this means a full-fledged implementation will be more complex that you'd like.
A regular transducer would be perfect for this task; but unfortunately those aren't exactly commonly found in class libraries...
Pattern p = Pattern.compile("<[^>]+>[^<]*]+>");
In the spirit of You Can't Parse XML With Regexp, you do know that's not an adequate pattern for arbitrary XML, right? (It's perfectly valid to have a > character in an attribute value, for example, not to mention other non-tag constructs.)
I have found no simple way to get the length of the tags that match my regular expression.
Instead of using replaceAll, repeatedly call find on the Matcher. You can then read start/end to get the indexes to replace, or use the appendReplacement method on a buffer. eg.
StringBuffer b= new StringBuffer();
while (m.find()) {
String spaces= StringUtils.repeat(" ", m.end()-m.start());
m.appendReplacement(b, spaces);
}
m.appendTail(b);
stringWithXMLContent= b.toString();
(StringUtils comes from Apache Commons. For more background and library-free alternatives see this question.)
Why not use an xml pull parser and simply echo everything that you want to keep as you encounter it, e.g. character content and whenever you reach a start or end tag find out the length using the name of the element, plus any attributes that it has and write the appropriate number of spaces.
The SAX API also has callbacks for ignoreable whitespace. So you can also echo all whitespace that occurs in your document.
Maybe m.start() and m.end() can help.
m.start() => "The index of the first character matched"
m.end() => "The offset after the last character matched"
(m.end() - m.start())-2 and you know how many /s you need.
**string**.replaceAll("(</?[a-zA-Z]{1}>)*", "")
you can also try this. it searches for <, then / 0 or 1 occurance then followed by characters only 1 (small or capital char), then followed by a > , then * for multiple occurrence of this pattern.
:)
I'm trying to write a regex to remove all but a handful of closing xml tags.
The code seems simple enough:
String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");
However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.
I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).
You probably shouldn't use regex for this task, but let's see what happens...
Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:
"</(?!a|em|li).*?>"
But this won't handle a number of cases correctly:
Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...
You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.
I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.
See this answer for more passionate info re. parsing XML/HTML via regexps.
You cannot use an alternation inside a character class. A character class always matches a single character.
You likely want to use a negative lookahead or lookbehind instead:
"</(?!a|em|li).*?>"