Regular Expression Split XML in Java - java

I want to split some XML text into parts:
xmlcontent = "<tagA>text1<tagB>text2</tagB></tagA>";
In C# i use
string[] splitedTexts = Regex.Split(xmlcontent, "(<.*?>)|(.+?(?=<|$))");
The result is
splitedTexts = ["<tagA>", "text1", "<tagB>", "text2", "</tagB>", "</tagA>"]
How can do it in Java?
I have tried
String[] splitedTexts = xmlcontent.split("(<.*?>)");
but the result is not like my expecting.

The parameter to split defines the delimiter to split at. You want to split before < and after > hence you can do:
String[] splitedTexts = xmlcontent.split("(?=<)|(?<=>)");

If you want to use Regex:
public static void main(String[] args) {
String xmlContent = "<xml><tagA>text1</tagA><tagB>text2</tagB></xml>";
Pattern pattern = Pattern.compile("(<.*?>)|(.+?(?=<|$))");
Matcher matcher = pattern.matcher(xmlContent);
while (matcher.find()) {
System.out.println(matcher.group());
}
}

Related

How to extract number suffix from a filename

In Java I have a filename example ABC.12.txt.gz, I want to extract number 12 from the filename. Currently I am using last index method and extracting substring multiple times.
You could try using pattern matching
import java.util.regex.Pattern;
import java.util.regex.Matcher;
// ... Other features
String fileName = "..."; // Filename with number extension
Pattern pattern = Pattern.compile("^.*(\\d+).*$"); // Pattern to extract number
// Then try matching
Matcher matcher = pattern.matcher(fileName);
String numberExt = "";
if(matcher.matches()) {
numberExt = matcher.group(1);
} else {
// The filename has no numeric value in it.
}
// Use your numberExt here.
You can just separate every numeric part from alphanumeric ones by using a regular expression:
public static void main(String args[]) {
String str = "ABC.12.txt.gz";
String[] parts = str.split("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");
// view the resulting parts
for (String s : parts) {
System.out.println(s);
}
// do what you want with those values...
}
This will output
ABC.
12
.txt.gz
Then take the parts you need and do what you have to do with them.
We can use something like this to extract the number from a string
String fileName="ABC.12.txt.gz";
String numberOnly= fileName.replaceAll("[^0-9]", "");

method to take string inside curly braces using split or tokenizer

String s = "author= {insert text here},";
Trying to get the inside of the string, ive looked around but couldn't find a resolution with just split or tokenizer...
so far im doing this
arraySplitBracket = s.trim().split("\\{", 0);
which gives me insert text here},
at array[1] but id like a way to not have } attached
also tried
StringTokenizer st = new StringTokenizer(s, "\\{,\\},");
But it gave me author= as output.
public static void main(String[] args) {
String input="{a c df sdf TDUS^&%^7 }";
String regEx="(.*[{]{1})(.*)([}]{1})";
Matcher matcher = Pattern.compile(regEx).matcher(input);
if(matcher.matches()) {
System.out.println(matcher.group(2));
}
}
You can use \\{([^}]*)\\} Regex to get string between curly braces.
Code Snap :
String str = "{insert text here}";
Pattern p = Pattern.compile("\\{([^}]*)\\}");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(1));
}
Output :
insert text here
String s = "auther ={some text here},";
s = s.substring(s.indexOf("{") + 1); //some text here},
s = s.substring(0, s.indexOf("}"));//some text here
System.out.println(s);
How about taking a substring by excluding the character at arraySplitBracket.length()-1
Something like
arraySplitBracket[1] = arraySplitBracket[1].substring(0,arraySplitBracket.length()-1);
Or use String Class's replaceAll function to replace } ?

How to extract id from url ? Google sheet

I have the follow urls.
https://docs.google.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY/edit#gid=1842172258
https://docs.google.com/a/example.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6PTKTzY0xOM5c6TXY/edit#gid=1842172258
https://docs.google.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY
Foreach url, I need to extract the sheet id: 1mrsetjgfZI2BIypz7SGHMOfHGv6PTKTzY0xOM5c6TXY into a java String.
I am thinking of using split but it can't work with all test cases:
String string = "https://docs.google.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY/edit#gid=1842172258";
String[] parts = string.split("/");
String res = parts[parts.length-2];
Log.d("hello res",res );
How can I that be possible?
You can use regex \/d\/(.*?)(\/|$) (regex demo) to solve your problem, if you look closer you can see that the ID exist between d/ and / or end of line for that you can get every thing between this, check this code demo :
String[] urls = new String[]{
"https://docs.google.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY/edit#gid=1842172258",
"https://docs.google.com/a/example.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6PTKTzY0xOM5c6TXY/edit#gid=1842172258",
"https://docs.google.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY"
};
String regex = "\\/d\\/(.*?)(\\/|$)";
Pattern pattern = Pattern.compile(regex);
for (String url : urls) {
Matcher matcher = pattern.matcher(url);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
Outputs
1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY
1mrsetjgfZI2BIypz7SGHMOfHGv6PTKTzY0xOM5c6TXY
1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY
it looks like the id you are looking for always follow "/spreadsheets/d/" if it is the case you can update your code to that
String string = "https://docs.google.com/spreadsheets/d/1mrsetjgfZI2BIypz7SGHMOfHGv6kTKTzY0xOM5c6TXY/edit#gid=1842172258";
String[] parts = string.split("spreadsheets/d/");
String result;
if(parts[1].contains("/")){
String[] parts2 = parts[1].split("/");
result = parts2[0];
}
else{
result=parts[1];
}
System.out.println("hello "+ result);
Using regex
Pattern pattern = Pattern.compile("(?<=\\/d\\/)[^\\/]*");
Matcher matcher = pattern.matcher(url);
System.out.println(matcher.group(1));
Using Java
String result = url.substring(url.indexOf("/d/") + 3);
int slash = result.indexOf("/");
result = slash == -1 ? result
: result.substring(0, slash);
System.out.println(result);
Google use fixed lenght characters for its IDs, in your case they are 44 characters and these are the characters google use: alphanumeric, -, and _ so you can use this regex:
regex = "([\w-]){44}"
match = re.search(regex,url)

Replace a set of substring in a string in more efficient way?

I've to replace a set of substrings in a String with another substrings for example
"^t" with "\t"
"^=" with "\u2014"
"^+" with "\u2013"
"^s" with "\u00A0"
"^?" with "."
"^#" with "\\d"
"^$" with "[a-zA-Z]"
So, I've tried with:
String oppip = "pippo^t^# p^+alt^shefhjkhfjkdgfkagfafdjgbcnbch^";
Map<String,String> tokens = new HashMap<String,String>();
tokens.put("^t", "\t");
tokens.put("^=", "\u2014");
tokens.put("^+", "\u2013");
tokens.put("^s", "\u00A0");
tokens.put("^?", ".");
tokens.put("^#", "\\d");
tokens.put("^$", "[a-zA-Z]");
String regexp = "^t|^=|^+|^s|^?|^#|^$";
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(oppip);
while (m.find())
m.appendReplacement(sb, tokens.get(m.group()));
m.appendTail(sb);
System.out.println(sb.toString());
But it doesn't work. tokens.get(m.group()) throws an exception.
Any idea why?
You don't have to use a HashMap. Consider using simple arrays, and a loop:
String oppip = "pippo^t^# p^+alt^shefhjkhfjkdgfkagfafdjgbcnbch^";
String[] searchFor =
{"^t", "^=", "^+", "^s", "^?", "^#", "^$"},
replacement =
{"\\t", "\\u2014", "\\u2013", "\\u00A0", ".", "\\d", "[a-zA-Z]"};
for (int i = 0; i < searchFor.length; i++)
oppip = oppip.replace(searchFor[i], replacement[i]);
// Print the result.
System.out.println(oppip);
Here is an online code demo.
For the completeness, you can use a two-dimensional array for a similar approach:
String oppip = "pippo^t^# p^+alt^shefhjkhfjkdgfkagfafdjgbcnbch^";
String[][] tasks =
{
{"^t", "\\t"},
{"^=", "\\u2014"},
{"^+", "\\u2013"},
{"^s", "\\u00A0"},
{"^?", "."},
{"^#", "\\d"},
{"^$", "[a-zA-Z]"}
};
for (String[] replacement : tasks)
oppip = oppip.replace(replacement[0], replacement[1]);
// Print the result.
System.out.println(oppip);
In regex the ^ means "begin-of-text" (or "not" within a character class as negation). You have to place a backslash before it, which becomes two backslashes in a java String.
String regexp = "\\^[t=+s?#$]";
I have reduced it a bit further.

Extract content after "=" and before "&", Regex expression in java

guys, I wanna extract the content in a string, the content is before "&" and after the "=", like this example:
asdfaf=afl10109&adsfjkl
I want to extract "afl10109" out of the string, can anyone teach me how to do this, I am very new to regex expression...
Use replaceAll() to replace the whole input with just what you want:
String target = str.replaceAll(".*=(.*)&.*", "$1");
The target is captured in a group (group number 1), which is then referenced in the replacement string.
try
public static void main(String args[]) {
String input="asdfaf=afl10109&adsfjkl";
Pattern pattern = Pattern.compile("=[^&]*&");
Matcher m = pattern.matcher(input);
while (m.find()) {
String str = m.group();
System.out.println( str.substring(1,str.length()-1));
}
}
This is not regex but you can also use split()
String str = "asdfaf=afl10109&adsfjkl";
System.out.println(str.split("=")[1].split("&")[0]);
Output:
afl10109
Using good old String#substring()
String str = "foo=bar&baz";
int begin = str.indexOf('=');
if (begin != -1) {
int end = str.indexOf('&', begin);
if (end != -1) {
System.out.println(str.substring(begin+1, end)); // bar
}
}

Categories