trim all 'spaces' from String - java

I am parsing a PDF and getting a lot of Strings with \t, \r, \n,\s... And they appear on both ends of the String and don't appear in order. So I can have
ex:
"\t\s\t\nSome important data I need surrounded by useless data \r\t\s\s\r\t\t"
. Is there any efficient ways to trim these Strings?
What I have so far which isn't good enough because I want some symbols.:
public static String trimToLetters(String sourceString) {
int beginIndex = 0;
int endIndex = sourceString.length() - 1;
Pattern p = Pattern.compile("[A-Z_a-z\\;\\.\\(\\)\\*\\?\\:\\\"\\']");
Matcher matcher = p.matcher(sourceString);
if (matcher.find()) {
if (matcher.start() >= 0) {
beginIndex = matcher.start();
StringBuilder sb = new StringBuilder(sourceString);
String sourceReverse = sb.reverse().toString();
matcher = p.matcher(sourceReverse);
if (matcher.find()) {
endIndex = sourceString.length() - matcher.start();
}
}
}
return sourceString.substring(beginIndex, endIndex);
}

The trim method of the String should be able to remove all whitespace from both ends of the string:
trim: Returns a copy of the string, with leading and trailing whitespace omitted.
P.S. \s is not a valid escape sequence in Java.

Related

regex doesn't find last word in my string

I have a regular expression [a-z]\d to unpack the text witch is compressed by simple rule
hellowoooorld -> hel2owo4rld
So now i have to unpack my text and it doesn't work correctly. It can't find last word in my String
it always like skip gu4ys
StringBuilder text = new StringBuilder("Hel2o peo7ple it is ou6r wo3rld gu4ys");
Pattern pattern = Pattern.compile("[a-z]\\d");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group());
int startWord = matcher.start();
int numLetters = Integer.parseInt(text.substring(startWord + 1, startWord + 2));
text.deleteCharAt(startWord + 1);
for (int i = 0; i < numLetters - 1; ++i) {
text.insert(startWord + 1, text.charAt(startWord));
}
}
System.out.println(text);
Result is : Hello peooooooople it is ouuuuuur wooorld gu4ys
I expect this : Hello peooooooople it is ouuuuuur wooorld guuuuys
I can't understand why it doesn't work all is simple
It seems like Java's Matcher checks your string size when it initializes, and doesn't go past that. You are inserting to the string, which makes it longer. The matcher doesn't check that far.
A quick, though slow, fix is to re-initialize the matcher every time.
StringBuilder text = new StringBuilder("Hel2o peo7ple it is ou6r wo3rld gu4ys");
Pattern pattern = Pattern.compile("[a-z]\\d");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int startWord = matcher.start();
int numLetters = Integer.parseInt(text.substring(startWord + 1, startWord + 2));
text.deleteCharAt(startWord + 1);
for (int i = 0; i < numLetters - 1; ++i) {
text.insert(startWord + 1, text.charAt(startWord));
}
matcher = pattern.matcher(text);
}
System.out.println(text);
A faster approach would find the numbers, calculate the string length and then manually construct the string using the found numbers.
The issue is probably that the matcher is only finding the pattern [a-z]\d, which matches a single letter followed by a digit, but it is not finding the last word "gu4ys" because it doesn't match that pattern.
To fix this, you can modify the regular expression to include an optional group that matches any remaining letters at the end of the text.
Try this regex and please let me know if it worked :)
"[a-z]\d|[a-z]+"

Java - Regex split decimal, minus, and math operation

i need to split h[0] to first number("-12.0"), h[1] to operation symbol(+) and h[2] to second number(-15.3) but i don't know how it works
a=12.0+-15.3;
h = a.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
Could somebody help me?
You may use a regex to match * or / anywhere n the string and - and + only when they are after a digit. In case of a match expression, you may match + or - after a word char, so, basically, you may check for a word boundary on the left: [/*]|\b[-+].
See the regex demo.
Then just split and keep the matches:
public static final Pattern regex = Pattern.compile("[/*]|\\b[-+]");
public static List<String> split(String s, Pattern pattern) {
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
if (start >= s.length()) {
ret.add(s.substring(start));
}
return ret;
}
Usage example:
String s = "12.0+-15.3*-45.7/+67.9";
List<String> res = split(s, regex);
System.out.println(res);
// => [12.0, +, -15.3, *, -45.7, /]
See the Java demo

How to get String between last two underscore

I have a string "abcde-abc-db-tada_x12.12_999ZZZ_121121.333"
The result I want should be 999ZZZ
I have tried using:
private static String getValue(String myString) {
Pattern p = Pattern.compile("_(\\d+)_1");
Matcher m = p.matcher(myString);
if (m.matches()) {
System.out.println(m.group(1)); // Should print 999ZZZ
}
else {
System.out.println("not found");
}
}
If you want to continue with a regex based approach, then use the following pattern:
.*_([^_]+)_.*
This will greedily consume up to and including the second to last underscrore. Then it will consume and capture 9999ZZZ.
Code sample:
String name = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
Pattern p = Pattern.compile(".*_([^_]+)_.*");
Matcher m = p.matcher(name);
if (m.matches()) {
System.out.println(m.group(1)); // Should print 999ZZZ
} else {
System.out.println("not found");
}
Demo
Using String.split?
String given = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String [] splitted = given.split("_");
String result = splitted[splitted.length-2];
System.out.println(result);
Apart from split you can use substring as well:
String s = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String ss = (s.substring(0,s.lastIndexOf("_"))).substring((s.substring(0,s.lastIndexOf("_"))).lastIndexOf("_")+1);
System.out.println(ss);
OR,
String s = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String arr[] = s.split("_");
System.out.println(arr[arr.length-2]);
The get text between the last two underscore characters, you first need to find the index of the last two underscore characters, which is very easy using lastIndexOf:
String s = "abcde-abc-db-tada_x12.12_999ZZZ_121121.333";
String r = null;
int idx1 = s.lastIndexOf('_');
if (idx1 != -1) {
int idx2 = s.lastIndexOf('_', idx1 - 1);
if (idx2 != -1)
r = s.substring(idx2 + 1, idx1);
}
System.out.println(r); // prints: 999ZZZ
This is faster than any solution using regex, including use of split.
As I misunderstood the logic from the code in question a bit with the first read and in the meantime there appeared some great answers with the use of regular expressions, this is my try with the use of some methods contained in String class (it introduces some variables just to make it more clear to read, it could be written in the shorter way of course) :
String s = "abcde-abc-db-ta__dax12.12_999ZZZ_121121.333";
int indexOfLastUnderscore = s.lastIndexOf("_");
int indexOfOneBeforeLastUnderscore = s.lastIndexOf("_", indexOfLastUnderscore - 1);
if(indexOfLastUnderscore != -1 && indexOfOneBeforeLastUnderscore != -1) {
String sub = s.substring(indexOfOneBeforeLastUnderscore + 1, indexOfLastUnderscore);
System.out.println(sub);
}

Finding longest regex match in Java?

I have this:
import java.util.regex.*;
String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = m.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
System.out.println(substring);
}
The above only prints hello whereas I want it to print hello world.
One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...
So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?
Here is a way of doing it using matcher regions, but with a single loop over the string index:
public static String findLongestMatch(String regex, String s) {
Pattern pattern = Pattern.compile("(" + regex + ")$");
Matcher matcher = pattern.matcher(s);
String longest = null;
int longestLength = -1;
for (int i = s.length(); i > longestLength; i--) {
matcher.region(0, i);
if (matcher.find() && longestLength < matcher.end() - matcher.start()) {
longest = matcher.group();
longestLength = longest.length();
}
}
return longest;
}
I'm forcing the pattern to match until the region's end, and then I move the region's end from the rightmost string index towards the left. For each region's end tried, Java will match the leftmost starting substring that finishes at that region's end, i.e. the longest substring that ends at that place. Finally, it's just a matter of keeping track of the longest match found so far.
As a matter of optimization, and since I start from the longer regions towards the shorter ones, I stop the loop as soon as all regions that would come after are already shorter than the length of longest substring already found.
An advantage of this approach is that it can deal with arbitrary regular expressions and no specific pattern structure is required:
findLongestMatch("(?<m1>(hello|universe))|(?<m2>(hello world))", "hello world")
==> "hello world"
findLongestMatch("hello( universe)?", "hello world")
==> "hello"
findLongestMatch("hello( world)?", "hello world")
==> "hello world"
findLongestMatch("\\w+|\\d+", "12345 abc")
==> "12345"
If you are dealing with just this specific pattern:
There is one or more named group on the highest level connected by |.
The regex for the group is put in superfluous braces.
Inside those braces is one or more literal connected by |.
Literals never contain |, ( or ).
Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:
private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)");
public static final String findLongestMatch(String s, Pattern p) {
Matcher m = g.matcher(p.pattern());
List<String> literals = new ArrayList<>();
while (m.find())
Collections.addAll(literals, m.group(1).split("\\|"));
Collections.sort(literals, new Comparator<String>() {
public int compare(String a, String b) {
return Integer.compare(b.length(), a.length());
}
});
for (Iterator<String> itr = literals.iterator(); itr.hasNext();) {
String literal = itr.next();
if (s.indexOf(literal) >= 0)
return literal;
}
return null;
}
Test:
System.out.println(findLongestMatch(
"hello world",
Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: hello world
System.out.println(findLongestMatch(
"hello universe",
Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: universe
just add the $ (End of string) before the Or separator |.
Then it check whether the string is ended of not. If ended, it will return the string. Otherwise skip that part of regex.
The below code gives what you want
import java.util.regex.*;
public class RegTest{
public static void main(String[] arg){
String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))";
String s = "hello world";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = matcher.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
System.out.println(substring);
}
}
}
Likewise, the below code will skip hello , hello world and match hello world there
See the usage of $ there
import java.util.regex.*;
public class RegTest{
public static void main(String[] arg){
String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))$|(?<m3>(hello world there))";
String s = "hello world there";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = matcher.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
System.out.println(substring);
}
}
}
If the structure of the regex is always the same, this should work:
String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";
//split the regex into the different groups
String[] allParts = regex.split("\\|\\(\\?\\<");
for (int i=1; i<allParts.length; i++) {
allParts[i] = "(?<" + allParts[i];
}
//find the longest string
int longestSize = -1;
String longestString = null;
for (int i=0; i<allParts.length; i++) {
Pattern pattern = Pattern.compile(allParts[i]);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
MatchResult matchResult = matcher.toMatchResult();
String substring = s.substring(matchResult.start(), matchResult.end());
if (substring.length() > longestSize) {
longestSize = substring.length();
longestString = substring;
}
}
}
System.out.println("Longest: " + longestString);

Java, split string by punctuation sign, process string, add punctuation signs back to string

I have string like this:
Some text, with punctuation sign!
I am splitting it by punctuation signs, using str.split("regex"). Then I process each element (switch characters) in the received array, after splitting.
And I want to add all punctuation signs back to their places. So result should be like this:
Smoe txet, wtih pinctuatuon sgin!
What is the best approach to do that?
How about doing the whole thing in one tiny line?
str = str.replaceAll("(?<=\\b\\w)(.)(.)", "$2$1");
Some test code:
String str = "Some text, with punctuation sign!";
System.out.println(str.replaceAll("(?<=\\b\\w)(.)(.)", "$2$1"));
Output:
Smoe txet, wtih pnuctuation sgin!
Since you aren't adding or removing characters, you may as well just use String.toCharArray():
char[] cs = str.toCharArray();
for (int i = 0; i < cs.length; ) {
while (i < cs.length() && !Character.isLetter(cs[i])) ++i;
int start = i;
while (i < cs.length() && Character.isLetter(cs[i])) ++i;
process(cs, start, i);
}
String result = new String(cs);
where process(char[], int startInclusive, int endExclusive) is a method which jumbles the letters in the array between the indexes.
I'd read through the string character by character.
If the character is punctuation append it to a StringBuilder
If the character is not punctuation keep reading characters until you reach a punctuation character, then process that word and append it to the StringBuilder.
Then skip to that next punctuation character.
This prints, rather than appends to a StringBuilder, but you get the idea:
String sentence = "This is a test, message!";
for (int i = 0; i<sentence.length(); i++) {
if (Character.isLetter(sentence.charAt(i))) {
String tmp = "" +sentence.charAt(i);
while (Character.isLetter(sentence.charAt(i+1)) && i<sentence.length()) {
i++;
tmp += sentence.charAt(i);
}
System.out.print(switchChars(tmp));
} else {
System.out.print(sentence.charAt(i));
}
}
System.out.println();
You can use:
String[] parts = str.split(",");
// processing parts
String str2 = String.join(",", parts);

Categories