How to Identify String Literals from Arrays with Regex on Java? - java

If I have this array of Strings named tokenArray. Its contents are the following
[num1] [;] ["] [This] [is] [a] [\"] [string] [literal] [\"] [.] [?] ["]
Note: non escaped and escaped double quotation are as it is.
Question:
How do I Identify that the values between the two double quotations in the array as a single string literal?, . I'm using string concatenation to save temporary lexemes found and finally save to stack when I match is found. In my case of identifying a single line comment before it was the // and tHiS_iS_tHe_EnD_Of_NeWlInE for the start and end match. How do I apply it with regex with two double quations just like above inside that the loop in the code I put below. TIA.
Background:
It's just that the samples I am finding are in a form of single String declaration and mine was in an array. I can't quite grasp how it works with an array of strings.
BTW. I'm making a string analyzer which scan a block of code and outputs lexemes of the particular language. I already identified each lexemes like single and block comments in addition to the delimiters and some keywords of a language without regex. But I want to try regex for the string literal that i have not detected yet. Applying the detection through if and else statement was so time consuming and confusing, but i made it through to the least.
Below is the code I am using to identifying single line comments in my array.
The for loop is my entire loop for reading my arrays and assigning newly detected lexeme to a stack.
for(int ctr=0;ctr<removedNullsStackSize.length;ctr++) {
if(removedNullsStackSize[ctr].equals("//")) {
do {
tempString = tempString + " " + removedNullsStackSize[ctr] ;
ctr++;
if(ctr>=removedNullsStackSize.length-1){
removedNullsStackSize[ctr]="tHiS_iS_tHe_EnD_Of_NeWlInE";
}
}
while(removedNullsStackSize[ctr]!="tHiS_iS_tHe_EnD_Of_NeWlInE");
myQCommentsTokenized.add(tempString);
tempString="";
}
In the code above, what it does is that it concatenates the preceding arrays when it detected // and won't stop concatenating not until it detects a newline character. If a newline character is detected,it then saves it the tempstring to stack as the new lexeme found.

My Pattern is.
//Regex for identifying string literals
Pattern strRegex=Pattern.compile("\".*\"");
//Loop your array here to read code
//str is the temporary location of all the codes you have
//In mine, I have it inside a text area so I just typecasted it to string and start comparing there
//begins matching` for string literals that is in the strRegex
Matcher m = strRegex.matcher(str) ;
After reading the code, it will then have the lexemes for the string literal in the code that was read.
while (m.find()) {
String forReadStr=m.group();
//If the end of the token is a double quote, Do this
//in this loop, you can then declare anything for the lexeme you detected and do anything with it
if(forReadStr.endsWith("\"")){
System.out.println(m.group()+"\n\t -> \t This is a String Literal\n");
}
}

Related

How to add a space after certain characters using regex Java

I have a string consisting of 18 digits Eg. 'abcdefghijklmnopqr'. I need to add a blank space after 5th character and then after 9th character and after 15th character making it look like 'abcde fghi jklmno pqr'. Can I achieve this using regular expression?
As regular expressions are not my cup of tea hence need help from regex gurus out here. Any help is appreciated.
Thanks in advance
Regex finds a match in a string and can't preform a replacement. You could however use regex to find a certain matching substring and replace that, but you would still need a separate method for replacement (making it a two step algorithm).
Since you're not looking for a pattern in your string, but rather just the n-th char, regex wouldn't be of much use, it would make it unnecessary complex.
Here are some ideas on how you could implement a solution:
Use an array of characters to avoid creating redundant strings: create a character array and copy characters from the string before
the given position, put the character at the position, copy the rest
of the characters from the String,... continue until you reach the end
of the string. After that construct the final string from that
array.
Use Substring() method: concatenate substring of the string before
the position, new character, substring of the string after the
position and before the next position,... and so on, until reaching the end of the original string.
Use a StringBuilder and its insert() method.
Note that:
First idea listed might not be a suitable solution for very large strings. It needs an auxiliary array, using additional space.
Second idea creates redundant strings. Strings are immutable and final in Java, and are stored in a pool. Creating
temporary strings should be avoided.
Yes you can use regex groups to achieve that. Something like that:
final Pattern pattern = Pattern.compile("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})");
final Matcher matcher = pattern.matcher("abcdefghijklmnopqr");
if (matcher.matches()) {
String first = matcher.group(0);
String second = matcher.group(1);
String third = matcher.group(2);
String fourth = matcher.group(3);
return first + " " + second + " " + third + " " + fourth;
} else {
throw new SomeException();
}
Note that pattern should be a constant, I used a local variable here to make it easier to read.
Compared to substrings, which would also work to achieve the desired result, regex also allow you to validate the format of your input data. In the provided example you check that it's a 18 characters long string composed of only lowercase letters.
If you had a more interesting examples, with for example a mix of letters and digits, you could check that each group contains the correct type of data with the regex.
You can also do a simpler version where you just replace with:
"abcdefghijklmnopqr".replaceAll("([a-z]{5})([a-z]{4})([a-z]{6})([a-z]{3})", "$1 $2 $3 $4")
But you don't have the benefit of checking because if the string doesn't match the format it will just not replaced and this is less efficient than substrings.
Here is an example solution using substrings which would be more efficient if you don't care about checking:
final Set<Integer> breaks = Set.of(5, 9, 15);
final String str = "abcdefghijklmnopqr";
final StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if (breaks.contains(i)) {
stringBuilder.append(' ');
}
stringBuilder.append(str.charAt(i));
}
return stringBuilder.toString();

Java - Why does string split for empty string give me a non empty array?

I want to split a String by a space. When I use an empty string, I expect to get an array of zero strings. Instead, I get an array with only empty string. Why ?
public static void main(String [] args){
String x = "";
String [] xs = x.split(" ");
System.out.println("strings :" + xs.length);//prints 1 instead of 0.
}
The single element string array entry is in fact empty string. This makes sense, because the split on " " fails, and hence you just get back the input with which you started. As a general approach, you may consider that if splitting returns you a single element, then the split did not match anything, leaving you with the starting input string.
An interesting puzzle indeed:
> "".split(" ")
String[1] { "" }
> " ".split(" ")
String[0] { }
The question is, when you split the empty string, why does the result contain the empty string, and when you split a space, why does the result not contain anything? It seems inconsistent, but all is explained in the documentation.
The String.split(String) method "works as if by invoking the two-argument split method with the given expression and a limit argument of zero", so let's read the docs for String.split(String, int). The case of the empty string is answered by this part:
If the expression does not match any part of the input then the resulting array has just one element, namely this string.
The empty string has no part matching a space, so the output is an array containing one element, the input string, exactly as the docs say should happen.
The case of the string " " is answered by these two parts:
A zero-width match at the beginning however never produces such empty leading substring.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
The whole input string " " matches the splitting pattern. In principle we could include an empty string on either side of the match, but the docs say that an empty leading substring is never included, and (because the limit parameter n = 0) the trailing empty string is also discarded. Hence, the empty strings before and after the match are both not included in the resulting array, so it's empty.
It appears that since the String exists and it cannot be split (there are no spaces), it simply places the entire String into the first array position, causing there to be one. If you were to instead try
String x = " ";
String [] xs = x.split(" ");
System.out.println("strings :" + xs.length);//prints 1 instead of 0.
It will give you the zero you are expecting.
See also: Java String split removed empty values

Empty Strings within a non empty String [duplicate]

This question already has answers here:
Replace with empty string replaces newChar around all the characters in original string
(4 answers)
Closed 6 years ago.
I'm confused with a code
public class StringReplaceWithEmptyString
{
public static void main(String[] args)
{
String s1 = "asdfgh";
System.out.println(s1);
s1 = s1.replace("", "1");
System.out.println(s1);
}
}
And the output is:
asdfgh
1a1s1d1f1g1h1
So my first opinion was every character in a String is having an empty String "" at both sides. But if that's the case after 'a' (in the String) there should be two '1' coming in the second line of output (one for end of 'a' and second for starting of 's').
Now I checked whether the String is represented as a char[] in these links In Java, is a String an array of chars? and String representation in Java I got answer as YES.
So I tried to assign an empty character '' to a char variable, but its giving me a compiler error,
Invalid character constant
The same process gives a compiler error when I tried in char[]
char[] c = {'','a','','s'}; // CTE
So I'm confused about three things.
How an empty String is represented by char[] ?
Why I'm getting that output for the above code?
How the String s1 is represented in char[] when it is initialized first time?
Sorry if I'm wrong at any part of my question.
Just adding some more explanation to Tim Biegeleisen answer.
As of Java 8, The code of replace method in java.lang.String class is
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
Here You can clearly see that the string is replaced by Regex Pattern matcher and in regex "" is identified by Zero-Length character and it is present around any Non-Zero length character.
So, behind the scene your code is executed as following
Pattern.compile("".toString(), Pattern.LITERAL).matcher("asdfgh").replaceAll(Matcher.quoteReplacement("1".toString()));
The the output becomes
1a1s1d1f1g1h1
Going with Andy Turner's great comment, your call to String#replace() is actually implemented using String#replaceAll(). As such, there is a regex replacement happening here. The matches occurs before the first character, in between each character in the string, and after the last character.
^|a|s|d|f|g|h|$
^ this and every pipe matches to empty string ""
The match you are making is a zero length match. In Java's regex implementation used in String.replaceAll(), this behaves as the example above shows, namely matching each inter-character position and the positions before the first and after the last characters.
Here is a reference which discusses zero length matches in more detail: http://www.regexguru.com/2008/04/watch-out-for-zero-length-matches/
A zero-width or zero-length match is a regular expression match that does not match any characters. It matches only a position in the string. E.g. the regex \b matches between the 1 and , in 1,2.
This is because it does a regex match of the pattern/replacement you pass to the replace().
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}
Replaces each substring of this string that matches the literal target
sequence with the specified literal replacement sequence. The
replacement proceeds from the beginning of the string to the end, for
example, replacing "aa" with "b" in the string "aaa" will result in
"ba" rather than "ab".
Parameters:
target The sequence of char values
to be replaced
replacement The replacement sequence of char values
Returns: The resulting string
Throws: NullPointerException if target
or replacement is null.
Since:
1.5
Please read more at the link below ... (Also browse through the source code).
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/String.java#String.replace%28java.lang.CharSequence%2Cjava.lang.CharSequence%29
A regex such as "" would match every possible empty string in a string. In this case it happens to be every empty space at the start and end and after every character in the string.

reading text file java

I'm trying to read a text file(.txt) in java. I need to eventually put the text I extract word by word in a binary tree's nodes . If for example, I have the text: "Hi, I'm doing a test!", I would like to split it into "Hi" "I" "m" "doing" "a" "test", basically skipping all punctuation and empty spaces and considering a word to be a sequence of contiguous alphabet letters. I am so far able to extract the words and put them in an array for testing. However, if I have a completely empty line in my .txt file, the code will consider it as a word and return an empty space. Also, punctuation at the end of a line works but if there's a comma for example and then text, I will get an empty space as well ! Here is what I tried so far:
public static void main(String[] args) throws Exception
{
FileReader file = new FileReader("File.txt");
BufferedReader reader = new BufferedReader(file);
String text = "";
String line = reader.readLine();
while (line != null)
{
text += line;
line = reader.readLine();
}
System.out.println(text);
String textnospaces=text.replaceAll("\\s+", " ");
System.out.println(textnospaces);
String [] tokens = textnospaces.split("[\\W+]");
for(int i=0;i<=tokens.length-1;i++)
{
tokens[i]=tokens[i].toLowerCase();
System.out.println(tokens[i]);
}
}
Using the following text:
I can't, come see you. Today my friend is hard
s
I get the following output:
i
can
t
(extra space between "t" and "come")
come
see
you
(extra space again)
today
my
friend
is
hards
Any help would be appreciated ! Thanks
use the trim() method of String. From documentation http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim%28%29:
"Returns a copy of the string, with leading and trailing whitespace omitted.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
Otherwise, if there is no character with a code greater than '\u0020' in the string, then a new String object representing an empty string is created and returned.
Otherwise, let k be the index of the first character in the string whose code is greater than '\u0020', and let m be the index of the last character in the string whose code is greater than '\u0020'. A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1).
This method may be used to trim whitespace (as defined above) from the beginning and end of a string.
Returns:
A copy of this string with leading and trailing white space removed, or this string if it has no leading or trailing white space."
If you really are just looking for each contiguous sequence of characters, you can accomplish this with regex matching quite simply.
String patternString1 = "([a-zA-Z]+)";
String text = "I can't, come see you. Today my friend is hard";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}

How to detect if a string input has more than one consecutive space?

For a class I have to make a morse code program using a binary tree. The user is suppose to enter morse code and the program will decode it and print out the result. The binary tree only holds A-Z. And I only need to read dashes, dots, and spaces. If there is one space that is the end of the letter. If there is 2 or more spaces in a row that is the end of the word.
How do you detect if the string input has consecutive spaces? Right now I have it programmed where it detects if there is 2 (which will then print out a space), but i dont know how to have it where it knows there is 3+ spaces.
This is how I'm reading the input btw:
String input = showInputDialog( "Enter Code", null);
character = input.charAt(i);
And this is how I have it detecting a space: if (character == ' ').
Can anyone help?
Well, you could do something like this which if you had more than one item in the resulting array would tell you that you had at least one instance of 2+ spaces.
String[] foo = "a b c d".split(" +");
This splits into "a b", "c", and "d".
You'd probably need regex checks than just that though if you need to detect how many of each count of spaces (e.g. how many 2 spaces, how many 3 spaces, etc).
Note I have made an assumption that you are retrieving the full morse code message in one go and not one character at a time
Focusing on this point:
"If there is one space that is the end of the letter. If there is 2 or more spaces in a row that is the end of the word."
Personally, I'd use the split() method on the String class. This will split up a String into a String[] and then you can do some checks on the individual Strings in the array. Splitting on a space character like this will give you a couple of behavioural advantages:
Any strings that represent characters will have no trailing or leading spaces on them
Any sequences of multiple spaces will result in empty strings in the returned String[].
For example, calling split(" ") on the string "A B C" would give you a String[] containing {"A", "B", "", "C"}
Using this, I would first check if the empty string appeared at all. If this was the case, it implies that there were at least 2 space characters next to each other in the input morse code message. Then you can just ignore any empty strings that occur after the first one and it will cater for any number of sequential empty strings.
Without wanting to complete your assignment for you, here is some sample code:
public String decode(final String morseCode) {
final StringBuilder decodedMessage = new StringBuilder();
final String[] splitMorseCode = morseCode.split(" ");
for (final String morseCharacter : splitMorseCode) {
if( "".equals(morseCharacter) ) {
/* We now know we had at least 2 spaces in sequence
* So we check to see if we already added a space to spearate the
* resulting decoded words. If not, then we add one. */
if ( !decodedMessage.toString().endsWith(" ") ) {
decodedMessage.append(" ");
}
continue;
}
//Some code that decodes your morse code character.
}
return decodedMessage.toString();
}
I also wrote a quick test. In my example I made "--" convert to "M". Splitting the decodedMessage on the space character was a way of counting the individual words that had been decoded.
#Test
public void thatDecoderCanDecodeMultipleWordsSeparatedByMultipleSpaces() {
final String decodedMessage = this.decoder.decode("-- -- -- -- -- -- -- -- -- -- -- -- -- --");
assertThat(decodedMessage.split(" ").length, is(7));
assertThat(decodedMessage, is("MM MM MM MM MM MM MM"));
}
Of course, if this is still not making sense, then reading the APIs always helps
To detect if a String has more than one space:
if (str.matches(".* .*"))
This will help.,
public class StringTester {
public static void main(String args[]){
String s="Hello ";
int count=0;
char chr[]= s.toCharArray();
for (char chr1:chr){
if(chr1==' ')
count++;
}
if(count>=2)
System.out.println(" I got more than 2 spaces") ;
}

Categories