Best way to get Tokens in java - java

I have files with some naming conventions -
Ex 1 - filename1.en.html.xslt
Ex 2 - filename2.de.text.xslt
where en/de - language, html/text - output
I need to read individual files and populate the java object accordingly.
Also, en should be converted to en-US etc, while populating the language field.
Format.java
private String language ;
private string output ;
What is the best way to do this? I know it can be done through plain indexOf or using string tokenizer or parsing thru regex.
If regex is better any code samples please?

It really doesn't matter how you parse the filename as long as it works for you. If you want to take the regex route, a Pattern like this will work:
Pattern p = Pattern.compile("([^.]+)\\.([^.]+)\\.([^.]+)\\.xslt");
The first capture group is the filename, the second is the language, and the third is the output.
That said, a regex does seem like overkill, so what's wrong with using String#split()?

You could do it with StringTokenizer, but String.split() should mostly do the trick.
String foo = "filename1.en.html.xslt"
String[] parts = foo.split("\\."); // regex: need to escape dot
System.out.println(parts[1]); // outputs "en"
With StringTokenizer you could do:
String foo = "filename1.en.html.xslt"
StringTokenizer tokenizer = new StringTokenizer(foo, ".");
List<String> parts = new ArrayList<String>();
while(tokenizer.hasMoreTokens()) {
String part = tokenizer.nextToken();
parts.add(part);
}
System.out.println(parts.get(1)); // "en"

Related

Using StringTokenizer with Comma delimiter while at the same time keeping commas preceded by a backslash [duplicate]

I'm trying to perform some super simple parsing o log files, so I'm using String.split method like this:
String [] parts = input.split(",");
And works great for input like:
a,b,c
Or
type=simple, output=Hello, repeat=true
Just to say something.
How can I escape the comma, so it doesn't match intermediate commas?
For instance, if I want to include a comma in one of the parts:
type=simple, output=Hello, world, repeate=true
I was thinking in something like:
type=simple, output=Hello\, world, repeate=true
But I don't know how to create the split to avoid matching the comma.
I've tried:
String [] parts = input.split("[^\,],");
But, well, is not working.
You can solve it using a negative look behind.
String[] parts = str.split("(?<!\\\\), ");
Basically it says, split on each ", " that is not preceeded by a backslash.
String str = "type=simple, output=Hello\\, world, repeate=true";
String[] parts = str.split("(?<!\\\\), ");
for (String s : parts)
System.out.println(s);
Output:
type=simple
output=Hello\, world
repeate=true
(ideone.com link)
If you happen to be stuck with the non-escaped comma-separated values, you could do the following (similar) hack:
String[] parts = str.split(", (?=\\w+=)");
Which says split on each ", " which is followed by some word-characters and an =
(ideone.com link)
I'm afraid, there's no perfect solution for String.split. Using a matcher for the three parts would work. In case the number of parts is not constant, I'd recommend a loop with matcher.find. Something like this maybe
final String s = "type=simple, output=Hello, world, repeat=true";
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,|$)");
final Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group(1));
You'll probably want to skip the spaces after the comma as well:
final Pattern p = Pattern.compile("((?:[^\\\\,]|\\\\.)*)(?:,\\s*|$)");
It's not really complicated, just note that you need four backslashes in order to match one.
Escaping works with the opposite of aioobe's answer (updated: aioobe now uses the same construct but I didn't know that when I wrote this), negative lookbehind
final String s = "type=simple, output=Hello\\, world, repeate=true";
final String[] tokens = s.split("(?<!\\\\),\\s*");
for(final String item : tokens){
System.out.println("'" + item.replace("\\,", ",") + "'");
}
Output:
'type=simple'
'output=Hello, world'
'repeate=true'
Reference:
Pattern: Special Constructs
I think
input.split("[^\\\\],");
should work. It will split at all commas that are not preceeded with a backslash.
BTW if you are working with Eclipse, I can recommend the QuickRex Plugin to test and debug Regexes.

Change tags in symbol Pattern/Matcher

This code works fine :
final String result = myString.replaceAll("<tag1>", "{").replaceAll("<tag2>", "}");
but I have to parse big files, so I'm asking me if I can have a Pattern.compile("REGEX"); before the while :
Patter p = Pattern.compile("REGEX");
while(scan.hasNextLine()){
final String myWorkLine = scan.readLine();
p.matcher(s).replaceAll("$1"); // or other value
..;
}
I expect faster result because regex compilation is maid once and only once.
EDIT
I want to put (if it is possible) the replaceAll(..).replaceAll(..) model in a Pattern, and have tag1==>{, and tag2==>}.
Question : is outside loop Pattern model faster than inside loop replaceAll.replaceAll model?
To answer your original question: yes, you could do that, and indeed it would be faster than your original code, if you apply the same regular expression(s) multiple times in a loop. Your loop should be rewritten like this:
Pattern p1 = Pattern.compile("REGEX1");
Pattern p1 = Pattern.compile("REGEX1");
while (scan.hasNextLine()) {
String myWorkLine = scan.readLine();
myWorkLine = p1.matcher(myWorkLine).replaceAll("replacement1");
myWorkLine = p2.matcher(myWorkLine).replaceAll("replacement2");
...;
}
But, if your're not using regular expressions, as your first example suggests ("<tag1>"), then don't use String.replaceAll(String regex, String replacement), as it is slower because of the regular expression. Instead use String.replace(CharSequence target, CharSequence replacement), as it doesn't work with regular expression and is much faster.
Example:
"ABAP is fun! ABAP ABAP ABAP".replace("ABAP", "Java");
See: Java Docs for String.replace
It's not nice changing your question that radically, but ok, here again an answer for your regular expression:
String s1
= "You can <bold>have nice weather</bold>, but <bold>not</bold> always!";
//EDIT: the regex was 'overengineered', and .?? should have been .*?
//String s2 = s1.replaceAll("(.*?)<bold>(.*?)</bold>(.??)", "$1{$2}$3");
String s2 = s1.replaceAll("<bold>(.*?)</bold>", "{$1}");
System.out.println(s2);
Output: You can {have nice weather}, but {not} always!
Here the loop with this new regex, and yes, this would be faster than original loop:
//EDIT: the regex was 'overengineered'
Pattern p = Pattern.compile("<bold>(.*?)</bold>");
while (scan.hasNextLine()) {
String myWorkLine = scan.readLine();
myWorkLine = p.matcher(myWorkLine).replaceAll("{$1}");
...;
}
EDIT:
Here the description of Java RegEx syntax constructs
replaceAll uses regex Patterns. From the java.lang.String source code:
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
Edit1: Please stop changing what you're asking. Pick a question and stick with it.
Edit2:
If you're really sure you want to do it this way, compiling a regex outside of the loop, in the simplest case you'd need two different patterns:
Pattern tag1Pattern = Pattern.compile("<tag1>");
Pattern tag2Pattern = Pattern.compile("<tag2>");
while( scan.hasNextLine() ) {
String line = scan.readLine();
String modifiedLine = tag1Pattern.matcher(line).replaceAll("{");
modifiedLine = tag2Pattern.matcher(line).replaceAll("}");
...
}
You're still applying the pattern matcher twice per line, so if there's any performance hits that's why.
Without knowing what your data looks like, it's hard to give you a more precise answer or better regex. Unless you've edited your question (again) while I was writing this.

Regex Pattern to avoid : and , in the strings

I have a string which comes from the DB.
the string is something like this:-
ABC:def,ghi:jkl,hfh:fhgh,ahf:jasg
In short String:String, and it repeats for large values.
I need to parse this string to get only the words without any : or , and store each word in ArrayList
I can do it using split function(twice) but I figured out that using regex I can do it one go and get the arraylist..
String strLine="category:hello,good:bye,wel:come";
Pattern titlePattern = Pattern.compile("[a-z]");
Matcher titleMatcher = titlePattern.matcher(strLine);
int i=0;
while(titleMatcher.find())
{
i=titleMatcher.start();
System.out.println(strLine.charAt(i));
}
However it is not giving me proper results..It ends up giving me index of match found and then I need to append it which is not so logical and efficient,.
Is there any way around..
String strLine="category:hello,good:bye,wel:come";
String a[] = strLine.split("[,:]");
for(String s :a)
System.out.println(s);
Use java StringTokenizer
Sample:
StringTokenizer st = new StringTokenizer(in, ":,");
while(st.hasMoreTokens())
System.out.println(st.nextToken());
Even if you can use a regular expression to parse the entire string at once, I think it would be less readable than splitting it with multiple steps.

regular expression to split the string in java

I want to split the string say [AO_12345678, Real Estate] into AO_12345678 and Real Estate
how can I do this in Java using regex?
main issue m facing is in avoiding "[" and "]"
please help
Does it really have to be regex?
if not:
String s = "[AO_12345678, Real Estate]";
String[] split = s.substring(1, s.length()-1).split(", ");
I'd go the pragmatic way:
String org = "[AO_12345678, Real Estate]";
String plain = null;
if(org.startsWith("[") {
if(org.endsWith("]") {
plain = org.subString(1, org.length());
} else {
plain = org.subString(1, org.length() + 1);
}
}
String[] result = org.split(",");
If the string is always surrounded with '[]' you can just substring it without checking.
One easy way, assuming the format of all your inputs is consistent, is to ignore regex altogether and just split it. Something like the following would work:
String[] parts = input.split(","); // parts is ["[AO_12345678", "Real Estate]"]
String firstWithoutBrace = parts[0].substring(1);
String secondWithoutBrace = parts[1].substring(0, parts[1].length() - 1);
String first = firstWithoutBrace.trim();
String second = secondWithoutBrace.trim();
Of course you can tailor this as you wish - you might want to check whether the braces are present before removing them, for example. Or you might want to keep any spaces before the comma as part of the first string. This should give you a basis to modify to your specific requirements however.
And in a simple case like this I'd much prefer code like the above to a regex that extracted the two strings - I consider the former much clearer!
you can also use StringTokenizer. Here is the code:
String str="[AO_12345678, Real Estate]"
StringTokenizer st=new StringTokenizer(str,"[],",false);
String s1 = st.nextToken();
String s2 = st.nextToken();
s1=AO_12345678
s1=Real Estate
Refer to javadocs for reading about StringTokenizer
http://download.oracle.com/javase/1.4.2/docs/api/java/util/StringTokenizer.html
Another option using regular expressions (RE) capturing groups:
private static void extract(String text) {
Pattern pattern = Pattern.compile("\\[(.*),\\s*(.*)\\]");
Matcher matcher = pattern.matcher(text);
if (matcher.find()) { // or .matches for matching the whole text
String id = matcher.group(1);
String name = matcher.group(2);
// do something with id and name
System.out.printf("ID: %s%nName: %s%n", id, name);
}
}
If speed/memory is a concern, the RE can be optimized to (using Possessive quantifiers instead of Greedy ones)
"\\[([^,]*+),\\s*+([^\\]]*+)\\]"

Regex in java for the string

I am very new programmer to Java regular expressions. I do not want to use java split with delimiters and try getting the individual tokens. I don't feel its a neat way. I have the following string
"Some String lang:c,cpp,java file:build.java"
I want to break up this into three parts
1 part containing "Some String"
2 part containing "c,cpp,java"
3 String containing "build.java"
The lang: and file: can be placed any where and they are optional.
The lang: and file: can be placed any where and they are optional.
Try the following expressions to get the language list and the file:
String input = "Some String lang:c,cpp,java file:build.java";
String langExpression = "lang:([\\w,]*)";
String fileExpression = "file:([\w\.]*)";
Patter langPattern = Pattern.compile(langExpression);
Matcher langMatcher = langPattern.matcher(input);
if (langMatcher.matches()) {
String languageList = langMatcher.group(1);
}
Patter filePattern = Pattern.compile(fileExpression );
Matcher fileMatcher = filePattern.matcher(input);
if (fileMatcher .matches()) {
String filename= fileMatcher.group(1);
}
This should work with lang:xxx file:xxx as well as file:xxx lang:xxx as long as the language list or the filename don't contain whitespaces. This would also work if lang: and/or file: was missing.
Would you also expect a string like this: file:build.java Some String lang:c,cpp,java?
What is so "unmaintainable" about using split?
String str = "Some String lang:c,cpp,java file:build.java";
String[] s = str.split("(lang|file):");
While split can do what you want to achieve, you can write your own code using substring and indexOf methods. This will be far faster than using split in terms of performance.

Categories