Removing specific part of string - java

I'm parsing every line of a file (XML file) and I need to find path="this_is_my_path". After this, I need to extract whats inside the \". I need to get this_is_my_path.
This is what I'm doing:
String pattern = ".*path=\"(.*?)\"";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(the_text_file);
while (m.find()) {
System.out.println(m.group().trim());
}
After running this, I'm getting this:
path="path_to_file"
test="ui_test" path="path_to_other_file"
.....
I should be printing this:
path_to_file
path_to_other_file
path_to_other_fileX
path_to_other_fileW

If you need to use regex, try with this:
(?<=path=\")(.*?)(?=\")
DEMO
Or you can use your regex, but without .* at the begenning, because it match also any content before path= in same line. Then get value by group 1.

Why reinvent the wheel? Unless this is a challenge or something?
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/

One should really try and collect the many reasons why using a regular expression is insufficient for getting anything out reliably from an XML file, even if that "anything" is just a measly attribute, e.g. path and its (string) value. A simple pattern such as "path=\"(.*?)\"" is doomed to fail due to the tiniest amount of freedom the XML spec leaves for writing legal XML, and more.
White space, including line breaks, may occur before and after the equal sign.
Apostrophes can be used instead of quotes.
Any character can be written as a numeric or named entity.
The string could be part of an element or attribute value.
The string could occur in an XML comment.
The XML file may be written in an encoding which naive reading as a vanilla text file fails to take into account; hence data may be garbage.
So, just for the record: I strongly suggest to use an XSLT transformation to extract the desired attribute values. This requires just a very simple template. Using an XML parser requires more lines of codes, but it is equally reliable.
And here is the Java code I strongly advocate not to use - it just covers two out of the points mentioned above.
String theText = ...;
String pattern = "\\bpath\\s*=\\s*(\"(.*?)\"|'(.*?)')";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(theText);
while (m.find()) {
System.out.println(m.group(1).trim());
}
(And did you notice the word boundary preceding path? Just another chance to go wrong with this approach.)

Related

How to retrieve portion of number that's within parenthesis in Java?

For part of my Java assignment I'm required to select all records that have a certain area code. I have custom objects within an ArrayList, like ArrayList<Foo>.
Each object has a String phoneNumber variable. They are formatted like "(555) 555-5555"
My goal is to search through each custom object in the ArrayList<Foo> (call it listOfFoos) and place the objects with area code "616" in a temporaryListOfFoos ArrayList<Foo>.
I have looked into tokenizers, but was unable to get the syntax correct. I feel like what I need to do is similar to this post, but since I'm only trying to retrieve the first 3 digits (and I don't care about the remaining 7), this really didn't give me exactly what I was looking for. Ignore parentheses with string tokenizer?
What I did as a temporary work-around, was...
for (int i = 0; i<listOfFoos.size();i++){
if (listOfFoos.get(i).getPhoneNumber().contains("616")){
tempListOfFoos.add(listOfFoos.get(i));
}
}
This worked for our current dataset, however, if there was a 616 anywhere else in the phone numbers [like "(555) 616-5555"] it obviously wouldn't work properly.
If anyone could give me advice on how to retrieve only the first 3 digits, while ignoring the parentheses, I would greatly appreciate it.
You have two options:
Use value.startsWith("(616)") or,
Use regular expressions with this pattern "^\(616\).*"
The first option will be a lot quicker.
areaCode = number.substring(number.indexOf('(') + 1, number.indexOf(')')).trim() should do the job for you, given the formatting of phone numbers you have.
Or if you don't have any extraneous spaces, just use areaCode = number.substring(1, 4).
I think what you need is a capturing group. Have a look at the Groups and capturing section in this document.
Once you are done matching the input with a pattern (for example "\((\\d+)\) \\d+-\\d+"), you can get the number in the parentheses using a matcher (object of java.util.regex.Matcher) with matcher.group(1).
You could use a regular expression as shown below. The pattern will ensure the entire phone number conforms to your pattern ((XXX) XXX-XXXX) plus grabs the number within the parentheses.
int areaCodeToSearch = 555;
String pattern = String.format("\\((%d)\\) \\d{3}-\\d{4}", areaCodeToSearch);
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(phoneNumber);
if (m.matches()) {
String areaCode = m.group(1);
// ...
}
Whether you choose to use a regular expression versus a simple String lookup (as mentioned in other answers) will depend on how bothered you are about the format of the entire string.

Detect when file can't be created because of bad characters in name

Can anyone tell me how to cope with illegal file names in java? When I run the following on Windows:
File badname = new File("C:\\Temp\\a:b");
System.out.println(badname.getAbsolutePath()+" length="+badname.length());
FileWriter w = new FileWriter(badname);
w.write("hello world");
w.close();
System.out.println(badname.getAbsolutePath()+" length="+badname.length());
The output shows that the file has been created and has the expected length, but in C:\Temp all I can see is a file called "a" with 0 length. Where is java putting the file?
What I'm looking for is a reliable way to throw an error when the file can't be created. I can't use exists() or length() - what other options are there?
In that particular example, the data is being written to a named stream. You can see the data you've written from the command line as follows:
more < .\a:b
For information about valid file names, look here.
To answer your specific question: exists() should be sufficient. Even in this case, after all, the data is being written to the designated location - it just wasn't where you expected it to be! If you think this case will cause problems for your users, check for the presence of a colon in the file name.
I would suggest looking at Regular Expressions. They allow you to break apart a string and see if certain characteristics apply. The other method that would work is splitting the String into a char[], and then processing each point to see what's in it, and if it's legal... but I think RegEx would work much better.
You should take a look at Regular Expressions and create a pattern which will match any illegal character, something like this:
String fileName = "...";
Pattern pattern = Pattern.compile("[:;!?]");
Matcher matcher = pattern.match(fileName);
if (matcher.find())
{
//Do something when the file name has an illegal character.
}
Note: I have not tested this code, but it should be enough to get you on the right track. The above code will match any string which contains a :, ;, `!' and '?'. Feel free to add/remove as you see fit.
You can use File.renameTo(File dest);.
Get the file name first:
String fileName = fullPath.substring(fullPath.lastIndexOf('\\'), fullPath.length);
Create an array of all special chars not allowed in file names.
for each char in array, check if fileName contains it. I guess, Java has a pre-built API for it.
Check this.
Note: This solution assumes that parent directory exists

Pattern match numbers/operators

Hey, I've been trying to figure out why this regular expression isn't matching correctly.
List l_operators = Arrays.asList(Pattern.compile(" (\\d+)").split(rtString.trim()));
The input string is "12+22+3"
The output I get is -- [,+,+]
There's a match at the beginning of the list which shouldn't be there? I really can't see it and I could use some insight. Thanks.
Well, technically, there is an empty string in front of the first delimiter (first sequence of digits). If you had, say a line of CSV, such as abc,def,ghi and another one ,jkl,mno you would clearly want to know that the first value in the second string was the empty string. Thus the behaviour is desirable in most cases.
For your particular case, you need to deal with it manually, or refine your regular expression somehow. Like this for instance:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(rtString);
if (m.find()) {
List l_operators = Arrays.asList(p.split(rtString.substring(m.end()).trim()));
// ...
}
Ideally however, you should be using a parser for these type of strings. You can't for instance deal with parenthesis in expressions using just regular expressions.
That's the behavior of split in Java. You just have to take it (and deal with it) or use other library to split the string. I personally try to avoid split from Java.
An example of one alternative is to look at Splitter from Google Guava.
Try Guava's Splitter.
Splitter.onPattern("\\d+").omitEmptyStrings().split(rtString)

Java Regex - exclude empty tags from xml

let's say I have two xml strings:
String logToSearch = "<abc><number>123456789012</number></abc>"
String logToSearch2 = "<abc><number xsi:type=\"soapenc:string\" /></abc>"
String logToSearch3 = "<abc><number /></abc>";
I need a pattern which finds the number tag if the tag contains value, i.e. the match should be found only in the logToSearch.
I'm not saying i'm looking for the number itself, but rather that the matcher.find method should return true only for the first string.
For now i have this:
Pattern pattern = Pattern.compile("<(" + pattrenString + ").*?>",
Pattern.CASE_INSENSITIVE);
where the patternString is simply "number". I tried to add "<(" + pattrenString + ")[^/>].*?> but it didn't work because in [^/>] each character is treated separately.
Thanks
This is absolutely the wrong way to parse XML. In fact, if you need more than just the basic example given here, there's provably no way to solve the more complex cases with regex.
Use an easy XML parser, like XOM. Now, using xpath, query for the elements and filter those without data. I can only imagine that this question is a precursor to future headaches unless you modify your approach right now.
So a search for "<number[^/>]*>" would find the opening tag. If you want to be sure it isn't empty, try "<number[^/>]*>[^<]" or "<number[^/>]*>[0-9]"

Java regex to retain specific closing tags

I'm trying to write a regex to remove all but a handful of closing xml tags.
The code seems simple enough:
String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");
However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.
I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).
You probably shouldn't use regex for this task, but let's see what happens...
Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:
"</(?!a|em|li).*?>"
But this won't handle a number of cases correctly:
Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...
You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.
I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.
See this answer for more passionate info re. parsing XML/HTML via regexps.
You cannot use an alternation inside a character class. A character class always matches a single character.
You likely want to use a negative lookahead or lookbehind instead:
"</(?!a|em|li).*?>"

Categories