reading text file java - java

I'm trying to read a text file(.txt) in java. I need to eventually put the text I extract word by word in a binary tree's nodes . If for example, I have the text: "Hi, I'm doing a test!", I would like to split it into "Hi" "I" "m" "doing" "a" "test", basically skipping all punctuation and empty spaces and considering a word to be a sequence of contiguous alphabet letters. I am so far able to extract the words and put them in an array for testing. However, if I have a completely empty line in my .txt file, the code will consider it as a word and return an empty space. Also, punctuation at the end of a line works but if there's a comma for example and then text, I will get an empty space as well ! Here is what I tried so far:
public static void main(String[] args) throws Exception
{
FileReader file = new FileReader("File.txt");
BufferedReader reader = new BufferedReader(file);
String text = "";
String line = reader.readLine();
while (line != null)
{
text += line;
line = reader.readLine();
}
System.out.println(text);
String textnospaces=text.replaceAll("\\s+", " ");
System.out.println(textnospaces);
String [] tokens = textnospaces.split("[\\W+]");
for(int i=0;i<=tokens.length-1;i++)
{
tokens[i]=tokens[i].toLowerCase();
System.out.println(tokens[i]);
}
}
Using the following text:
I can't, come see you. Today my friend is hard
s
I get the following output:
i
can
t
(extra space between "t" and "come")
come
see
you
(extra space again)
today
my
friend
is
hards
Any help would be appreciated ! Thanks

use the trim() method of String. From documentation http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim%28%29:
"Returns a copy of the string, with leading and trailing whitespace omitted.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
Otherwise, if there is no character with a code greater than '\u0020' in the string, then a new String object representing an empty string is created and returned.
Otherwise, let k be the index of the first character in the string whose code is greater than '\u0020', and let m be the index of the last character in the string whose code is greater than '\u0020'. A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1).
This method may be used to trim whitespace (as defined above) from the beginning and end of a string.
Returns:
A copy of this string with leading and trailing white space removed, or this string if it has no leading or trailing white space."

If you really are just looking for each contiguous sequence of characters, you can accomplish this with regex matching quite simply.
String patternString1 = "([a-zA-Z]+)";
String text = "I can't, come see you. Today my friend is hard";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}

Related

How to Identify String Literals from Arrays with Regex on Java?

If I have this array of Strings named tokenArray. Its contents are the following
[num1] [;] ["] [This] [is] [a] [\"] [string] [literal] [\"] [.] [?] ["]
Note: non escaped and escaped double quotation are as it is.
Question:
How do I Identify that the values between the two double quotations in the array as a single string literal?, . I'm using string concatenation to save temporary lexemes found and finally save to stack when I match is found. In my case of identifying a single line comment before it was the // and tHiS_iS_tHe_EnD_Of_NeWlInE for the start and end match. How do I apply it with regex with two double quations just like above inside that the loop in the code I put below. TIA.
Background:
It's just that the samples I am finding are in a form of single String declaration and mine was in an array. I can't quite grasp how it works with an array of strings.
BTW. I'm making a string analyzer which scan a block of code and outputs lexemes of the particular language. I already identified each lexemes like single and block comments in addition to the delimiters and some keywords of a language without regex. But I want to try regex for the string literal that i have not detected yet. Applying the detection through if and else statement was so time consuming and confusing, but i made it through to the least.
Below is the code I am using to identifying single line comments in my array.
The for loop is my entire loop for reading my arrays and assigning newly detected lexeme to a stack.
for(int ctr=0;ctr<removedNullsStackSize.length;ctr++) {
if(removedNullsStackSize[ctr].equals("//")) {
do {
tempString = tempString + " " + removedNullsStackSize[ctr] ;
ctr++;
if(ctr>=removedNullsStackSize.length-1){
removedNullsStackSize[ctr]="tHiS_iS_tHe_EnD_Of_NeWlInE";
}
}
while(removedNullsStackSize[ctr]!="tHiS_iS_tHe_EnD_Of_NeWlInE");
myQCommentsTokenized.add(tempString);
tempString="";
}
In the code above, what it does is that it concatenates the preceding arrays when it detected // and won't stop concatenating not until it detects a newline character. If a newline character is detected,it then saves it the tempstring to stack as the new lexeme found.
My Pattern is.
//Regex for identifying string literals
Pattern strRegex=Pattern.compile("\".*\"");
//Loop your array here to read code
//str is the temporary location of all the codes you have
//In mine, I have it inside a text area so I just typecasted it to string and start comparing there
//begins matching` for string literals that is in the strRegex
Matcher m = strRegex.matcher(str) ;
After reading the code, it will then have the lexemes for the string literal in the code that was read.
while (m.find()) {
String forReadStr=m.group();
//If the end of the token is a double quote, Do this
//in this loop, you can then declare anything for the lexeme you detected and do anything with it
if(forReadStr.endsWith("\"")){
System.out.println(m.group()+"\n\t -> \t This is a String Literal\n");
}
}

How to remove all characters before a specific character in Java?

I have a string and I'm getting value through a html form so when I get the value it comes in a URL so I want to remove all the characters before the specific charater which is = and I also want to remove this character. I only want to save the value that comes after = because I need to fetch that value from the variable..
EDIT : I need to remove the = too since I'm trying to get the characters/value in string after it...
You can use .substring():
String s = "the text=text";
String s1 = s.substring(s.indexOf("=") + 1);
s1.trim();
then s1 contains everything after = in the original string.
s1.trim()
.trim() removes spaces before the first character (which isn't a whitespace, such as letters, numbers etc.) of a string (leading spaces) and also removes spaces after the last character (trailing spaces).
While there are many answers. Here is a regex example
String test = "eo21jüdjüqw=realString";
test = test.replaceAll(".+=", "");
System.out.println(test);
// prints realString
Explanation:
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
= matches the character = literally (case sensitive)
This is also a shady copy paste from https://regex101.com/ where you can try regex out.
You can split the string from the = and separate in to array and take the second value of the array which you specify as after the = sign
For example:
String CurrentString = "Fruit = they taste good";
String[] separated = CurrentString.split("=");
separated[0]; // this will contain "Fruit"
separated[1]; //this will contain "they teste good"
then separated[1] contains everything after = in the original string.
I know this is asked about Java but this seems to also be the first search result for Kotlin so you should know that Kotlin has the String.substringAfter(delimiter: String, missingDelimiterValue: String = this) extension for this case.
Its implementation is:
val index = indexOf(delimiter)
return if (index == -1)
missingDelimiterValue
else
substring(index + delimiter.length, length)
Maybe locate the first occurrence of the character in the URL String. For Example:
String URL = "http://test.net/demo_form.asp?name1=stringTest";
int index = URL.indexOf("=");
Then, split the String based on an index
String Result = URL.substring(index+1); //index+1 to skip =
String Result now contains the value: stringTest
If you use the Apache Commons Lang3 library, you can also use the substringAfter method of the StringUtils utility class.
Official documentation is here.
Examples:
String value = StringUtils.substringAfter("key=value", "=");
// in this case where a space is in the value (e.g. read from a file instead of a query params)
String value = StringUtils.trimToEmpty(StringUtils.substringAfter("key = value", "=")); // = "value"
It manage the case where your values can contains the '=' character as it takes the first occurence.
If you have keys and values also containing '=' character it will not work (but the other methods as well); in the URL query params, such a character should be escaped anyway.

Java match whole word in String

I have an ArrayList<String> which I iterate through to find the correct index given a String. Basically, given a String, the program should search through the list and find the index where the whole word matches. For example:
ArrayList<String> foo = new ArrayList<String>();
foo.add("AAAB_11232016.txt");
foo.add("BBB_12252016.txt");
foo.add("AAA_09212017.txt");
So if I give the String AAA, I should get back index 2 (the last one). So I can't use the contains() method as that would give me back index 0.
I tried with this code:
String str = "AAA";
String pattern = "\\b" + str + "\\b";
Pattern p = Pattern.compile(pattern);
for(int i = 0; i < foo.size(); i++) {
// Check each entry of list to find the correct value
Matcher match = p.matcher(foo.get(i));
if(match.find() == true) {
return i;
}
}
Unfortunately, this code never reaches the if statement inside the loop. I'm not sure what I'm doing wrong.
Note: This should also work if I searched for AAA_0921, the full name AAA_09212017.txt, or any part of the String that is unique to it.
Since word boundary does not match between a word char and underscore you need
String pattern = "(?<=_|\\b)" + str + "(?=_|\\b)";
Here, (?<=_|\b) positive lookbehind requires a word boundary or an underscore to appear before the str, and the (?=_|\b) positive lookahead requires an underscore or a word boundary to appear right after the str.
See this regex demo.
If your word may have special chars inside, you might want to use a more straight-forward word boundary:
"(?<![^\\W_])" + Pattern.quote(str) + "(?![^\\W_])"
Here, the negative lookbehind (?<![^\\W_]) fails the match if there is a word character except an underscore ([^...] is a negated character class that matches any character other than the characters, ranges, etc. defined inside this class, thus, it matches all characters other than a non-word char \W and a _), and the (?![^\W_]) negative lookahead fails the match if there is a word char except the underscore after the str.
Note that the second example has a quoted search string, so that even AA.A_str.txt could be matched well with AA.A.
See another regex demo

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

How to detect if a string input has more than one consecutive space?

For a class I have to make a morse code program using a binary tree. The user is suppose to enter morse code and the program will decode it and print out the result. The binary tree only holds A-Z. And I only need to read dashes, dots, and spaces. If there is one space that is the end of the letter. If there is 2 or more spaces in a row that is the end of the word.
How do you detect if the string input has consecutive spaces? Right now I have it programmed where it detects if there is 2 (which will then print out a space), but i dont know how to have it where it knows there is 3+ spaces.
This is how I'm reading the input btw:
String input = showInputDialog( "Enter Code", null);
character = input.charAt(i);
And this is how I have it detecting a space: if (character == ' ').
Can anyone help?
Well, you could do something like this which if you had more than one item in the resulting array would tell you that you had at least one instance of 2+ spaces.
String[] foo = "a b c d".split(" +");
This splits into "a b", "c", and "d".
You'd probably need regex checks than just that though if you need to detect how many of each count of spaces (e.g. how many 2 spaces, how many 3 spaces, etc).
Note I have made an assumption that you are retrieving the full morse code message in one go and not one character at a time
Focusing on this point:
"If there is one space that is the end of the letter. If there is 2 or more spaces in a row that is the end of the word."
Personally, I'd use the split() method on the String class. This will split up a String into a String[] and then you can do some checks on the individual Strings in the array. Splitting on a space character like this will give you a couple of behavioural advantages:
Any strings that represent characters will have no trailing or leading spaces on them
Any sequences of multiple spaces will result in empty strings in the returned String[].
For example, calling split(" ") on the string "A B C" would give you a String[] containing {"A", "B", "", "C"}
Using this, I would first check if the empty string appeared at all. If this was the case, it implies that there were at least 2 space characters next to each other in the input morse code message. Then you can just ignore any empty strings that occur after the first one and it will cater for any number of sequential empty strings.
Without wanting to complete your assignment for you, here is some sample code:
public String decode(final String morseCode) {
final StringBuilder decodedMessage = new StringBuilder();
final String[] splitMorseCode = morseCode.split(" ");
for (final String morseCharacter : splitMorseCode) {
if( "".equals(morseCharacter) ) {
/* We now know we had at least 2 spaces in sequence
* So we check to see if we already added a space to spearate the
* resulting decoded words. If not, then we add one. */
if ( !decodedMessage.toString().endsWith(" ") ) {
decodedMessage.append(" ");
}
continue;
}
//Some code that decodes your morse code character.
}
return decodedMessage.toString();
}
I also wrote a quick test. In my example I made "--" convert to "M". Splitting the decodedMessage on the space character was a way of counting the individual words that had been decoded.
#Test
public void thatDecoderCanDecodeMultipleWordsSeparatedByMultipleSpaces() {
final String decodedMessage = this.decoder.decode("-- -- -- -- -- -- -- -- -- -- -- -- -- --");
assertThat(decodedMessage.split(" ").length, is(7));
assertThat(decodedMessage, is("MM MM MM MM MM MM MM"));
}
Of course, if this is still not making sense, then reading the APIs always helps
To detect if a String has more than one space:
if (str.matches(".* .*"))
This will help.,
public class StringTester {
public static void main(String args[]){
String s="Hello ";
int count=0;
char chr[]= s.toCharArray();
for (char chr1:chr){
if(chr1==' ')
count++;
}
if(count>=2)
System.out.println(" I got more than 2 spaces") ;
}

Categories