Pattern.matches() against a char array without cast to String in java - java

Scenario
I need to check a regex pattern against a character array (char[]). I am not allowed to cast the character array to a String, because of security considerations. Java's Pattern.matches() method is designed to take a pattern and a String. Also, the regex pattern is passed to me from another source, and will change (is not constant).
This does not work:
// This pattern comes from another source, that I do not control. It may change.
String pattern = "^(.)\\1+$";
char[] exampleArray = new char[4];
exampleArray[0] = 'b';
exampleArray[1] = 'l';
exampleArray[2] = 'a';
exampleArray[3] = 'h';
// This should return true, for this pattern, but I cannot pass in a char[].
boolean matches = Pattern.matches(pattern, exampleArray);
Thoughts
I attempted to deconstruct the regex pattern and examine the array for each part of the pattern, but the conditional logic required to interpret each part of the pattern thwarted me. For example: Suppose the pattern contains something like "(.){5,10}". Then I only need to check the char[] length. However, if it contains "^B(.){5,10}X", then I need to do something very different. It feels like there are too many possibilities to effectively deconstruct the regex pattern and account for each possibility (which is exactly why I've always just used Pattern.matches()).
Question
What would be the most efficient way of checking a regex pattern against a character array without casting the character array to a String, or creating a String?

Pattern.matches accepts a general CharSequence. You can use for example CharBuffer from java.nio instead of String.
boolean matches = Pattern.matches(pattern, CharBuffer.wrap(exampleArray));
CharBuffer.wrap will not create an extra copy of the password in memory, so of all the options it's the safest.

If someone has access to the machine's memory then the problems can get far beyond the uncovering of passwords.
boolean matches = Pattern.matches(pattern, new String(exampleArray));

Related

Regex: Ignoring numbers

I am trying to write a regex that tries to match on a specific string, but ignores all numbers in the target string - So my regex could be 'MyDog', but it should match MyDog, as well as My11Dog and MyDog1 etc. I could write something like
M[^\d]*y[^\d]D[^\d]*o[^\d]g[^\d]*
But that is pretty painful. Any ideas out there? I am using Java, and cannot change what is in the string, because I need to retrieve it as is.
Regular Expressions can do this at the end but why don't you get help by your programming language Java? (I can't Java!)
String s1 = "0My1D2og3";
s2 = s1.replaceAll("\d", "");
if (s2.equals("MyDog")) {
// Do something
}

How can I use hash sets in java to determine if a string contains valid characters?

I'm writing a lexical analyzer and have never used hash sets. I want to take a string and make sure it's legal. I think I understand how to build the hash set with valid characters but I'm not sure how to compare the string with teh hash set to ensure it contains valid characters. I can't find an example anywhere. Can someone point me to code that would do this?
HashSet has the function contains() for this, since it implements the Collection interface.
You cannot compare an entire string to a HashSet<Character>, but you can do it one character at a time:
HashSet<Character> valid = new HashSet<Character>();
valid.add('a');
valid.add('d');
valid.add('f');
boolean allOk = true;
for (char c : "fad".toCharArray()) {
if (!valid.contains(c)) {
allOk = false;
break;
}
}
System.out.println(allOk);
However, this is not the most efficient way of doing it. A better approach would be to construct a regex with the characters that you need, and call match() on the string:
// Let's say x, y, and z are the valid characters
String regex = "[xyz]*";
if (myString.matches(regex)) {
System.out.println("All characters in the string are in 'x', 'y', and 'z'");
}
I think you are probably over-thinking this problem. (For instance, spending too much time thinking how to make the lexer "efficient" ...)
The conventional ways to test for valid / invalid characters in a lexer are:
use a big switch statement, or
perform a sequence of "character class" tests; e.g. using the result of Character.getType(char)
Or better still, use a lexer generator.
Using a HashSet is neither more efficient or more readable than a switch. And the "character class" approach could be a lot more readable than both ... depending on your validation rules.
But if I haven't convinced you, see #blinkenlights' Answer :-)

How to retrieve portion of number that's within parenthesis in Java?

For part of my Java assignment I'm required to select all records that have a certain area code. I have custom objects within an ArrayList, like ArrayList<Foo>.
Each object has a String phoneNumber variable. They are formatted like "(555) 555-5555"
My goal is to search through each custom object in the ArrayList<Foo> (call it listOfFoos) and place the objects with area code "616" in a temporaryListOfFoos ArrayList<Foo>.
I have looked into tokenizers, but was unable to get the syntax correct. I feel like what I need to do is similar to this post, but since I'm only trying to retrieve the first 3 digits (and I don't care about the remaining 7), this really didn't give me exactly what I was looking for. Ignore parentheses with string tokenizer?
What I did as a temporary work-around, was...
for (int i = 0; i<listOfFoos.size();i++){
if (listOfFoos.get(i).getPhoneNumber().contains("616")){
tempListOfFoos.add(listOfFoos.get(i));
}
}
This worked for our current dataset, however, if there was a 616 anywhere else in the phone numbers [like "(555) 616-5555"] it obviously wouldn't work properly.
If anyone could give me advice on how to retrieve only the first 3 digits, while ignoring the parentheses, I would greatly appreciate it.
You have two options:
Use value.startsWith("(616)") or,
Use regular expressions with this pattern "^\(616\).*"
The first option will be a lot quicker.
areaCode = number.substring(number.indexOf('(') + 1, number.indexOf(')')).trim() should do the job for you, given the formatting of phone numbers you have.
Or if you don't have any extraneous spaces, just use areaCode = number.substring(1, 4).
I think what you need is a capturing group. Have a look at the Groups and capturing section in this document.
Once you are done matching the input with a pattern (for example "\((\\d+)\) \\d+-\\d+"), you can get the number in the parentheses using a matcher (object of java.util.regex.Matcher) with matcher.group(1).
You could use a regular expression as shown below. The pattern will ensure the entire phone number conforms to your pattern ((XXX) XXX-XXXX) plus grabs the number within the parentheses.
int areaCodeToSearch = 555;
String pattern = String.format("\\((%d)\\) \\d{3}-\\d{4}", areaCodeToSearch);
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(phoneNumber);
if (m.matches()) {
String areaCode = m.group(1);
// ...
}
Whether you choose to use a regular expression versus a simple String lookup (as mentioned in other answers) will depend on how bothered you are about the format of the entire string.

Best way to create SEO friendly URI string

The method should allows only "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-" chars in URI strings.
What is the best way to make nice SEO URI string?
This is what the general consensus is:
Lowercase the string.
string = string.toLowerCase();
Normalize all characters and get rid of all diacritical marks (so that e.g. é, ö, à becomes e, o, a).
string = Normalizer.normalize(string, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
Replace all remaining non-alphanumeric characters by - and collapse when necessary.
string = string.replaceAll("[^\\p{Alnum}]+", "-");
So, summarized:
public static String toPrettyURL(String string) {
return Normalizer.normalize(string.toLowerCase(), Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
.replaceAll("[^\\p{Alnum}]+", "-");
}
The following regex will do the same thing as your algorithm. I'm not aware of libraries for doing this type of thing.
String s = input
.replaceAll(" ?- ?","-") // remove spaces around hyphens
.replaceAll("[ ']","-") // turn spaces and quotes into hyphens
.replaceAll("[^0-9a-zA-Z-]",""); // remove everything not in our allowed char set
These are commonly called "slugs" if you want to search for more information.
You may want to check out other answers such as How can I create a SEO friendly dash-delimited url from a string? and How to make Django slugify work properly with Unicode strings?
They cover C# and Python more than javascript but have some language-agnostic discussion about slug conventions and issues you may face when making them (such as uniqueness, unicode normalization problems, etc).

Pattern match numbers/operators

Hey, I've been trying to figure out why this regular expression isn't matching correctly.
List l_operators = Arrays.asList(Pattern.compile(" (\\d+)").split(rtString.trim()));
The input string is "12+22+3"
The output I get is -- [,+,+]
There's a match at the beginning of the list which shouldn't be there? I really can't see it and I could use some insight. Thanks.
Well, technically, there is an empty string in front of the first delimiter (first sequence of digits). If you had, say a line of CSV, such as abc,def,ghi and another one ,jkl,mno you would clearly want to know that the first value in the second string was the empty string. Thus the behaviour is desirable in most cases.
For your particular case, you need to deal with it manually, or refine your regular expression somehow. Like this for instance:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(rtString);
if (m.find()) {
List l_operators = Arrays.asList(p.split(rtString.substring(m.end()).trim()));
// ...
}
Ideally however, you should be using a parser for these type of strings. You can't for instance deal with parenthesis in expressions using just regular expressions.
That's the behavior of split in Java. You just have to take it (and deal with it) or use other library to split the string. I personally try to avoid split from Java.
An example of one alternative is to look at Splitter from Google Guava.
Try Guava's Splitter.
Splitter.onPattern("\\d+").omitEmptyStrings().split(rtString)

Categories