reading from text file to string array - java

So I can search for a string in my text file, however, I wanted to sort data within this ArrayList and implement an algorithm. Is it possible to read from a text file and the values [Strings] within the text file be stored in a String[] Array.
Also is it possible to separate the Strings? So instead of my Array having:
[Alice was beginning to get very tired of sitting by her sister on the, bank, and of having nothing to do:]
is it possible to an array as:
["Alice", "was" "beginning" "to" "get"...]
.
public static void main(String[]args) throws IOException
{
Scanner scan = new Scanner(System.in);
String stringSearch = scan.nextLine();
BufferedReader reader = new BufferedReader(new FileReader("File1.txt"));
List<String> words = new ArrayList<String>();
String line;
while ((line = reader.readLine()) != null) {
words.add(line);
}
for(String sLine : words)
{
if (sLine.contains(stringSearch))
{
int index = words.indexOf(sLine);
System.out.println("Got a match at line " + index);
}
}
//Collections.sort(words);
//for (String str: words)
// System.out.println(str);
int size = words.size();
System.out.println("There are " + size + " Lines of text in this text file.");
reader.close();
System.out.println(words);
}

To split a line into an array of words, use this:
String words = sentence.split("[^\\w']+");
The regex [^\w'] means "not a word char or an apostrophe"
This will capture words with embedded apostrophes like "can't" and skip over all punctuation.
Edit:
A comment has raised the edge case of parsing a quoted word such as 'this' as this.
Here's the solution for that - you have to first remove wrapping quotes:
String[] words = input.replaceAll("(^|\\s)'([\\w']+)'(\\s|$)", "$1$2$3").split("[^\\w']+");
Here's some test code with edge and corner cases:
public static void main(String[] args) throws Exception {
String input = "'I', ie \"me\", can't extract 'can't' or 'can't'";
String[] words = input.replaceAll("(^|[^\\w'])'([\\w']+)'([^\\w']|$)", "$1$2$3").split("[^\\w']+");
System.out.println(Arrays.toString(words));
}
Output:
[I, ie, me, can't, extract, can't, or, can't]

Also is it possible to separate the Strings?
Yes, You can split string by using this for white spaces.
String[] strSplit;
String str = "This is test for split";
strSplit = str.split("[\\s,;!?\"]+");
See String API
Moreover you can also read a text file word by word.
Scanner scan = null;
try {
scan = new Scanner(new BufferedReader(new FileReader("Your File Path")));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
while(scan.hasNext()){
System.out.println( scan.next() );
}
See Scanner API

Related

How to break a file into tokens based on regex using Java

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.
<?aaaaa>
<?bbbb
bb>
<?cccccc>
I have the code:
FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
Scanner scanner = new Scanner(fs);
scanner.useDelimiter(Pattern.compile("<\\?"));
if (scanner.hasNext()) {
String line = scanner.next();
System.out.println(line);
}
scanner.close();
But the result I got have the begining <\? removed:
aaaaa>
bbbb
bb>
cccccc>
I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.
Is there a way to NOT have the delimeter pattern removed?
Break on a newline only when preceded by a ">" char:
scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly
\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >
Plus it's cool because <=> looks like Darth Vader's TIE fighter.
I'm assuming you want to ignore the newline character '\n' everywhere.
I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:
String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
... //your code
Feel free to ask any further questions you might have!
Here is one way of doing it by using a StringBuilder:
public static void main(String[] args) throws FileNotFoundException {
Scanner in = new Scanner(new File("C:\\test.txt"));
StringBuilder builder = new StringBuilder();
String input = null;
while (in.hasNextLine() && null != (input = in.nextLine())) {
for (int x = 0; x < input.length(); x++) {
builder.append(input.charAt(x));
if (input.charAt(x) == '>') {
System.out.println(builder.toString());
builder = new StringBuilder();
}
}
}
in.close();
}
Input:
<?aaaaa>
<?bbbb
bb>
<?cccccc>
Output:
<?aaaaa>
<?bbbb bb>
<?cccccc>

Reading letters and avoiding numbers and symbols

I am given a .txt file which has a bunch of words, here is a sample of what it looks like:
Legs
m%cks
animals
s3nt!m4nts
I need to create a code which reads this .txt file and put the words without numbers and symbols into an array. So basically I gotta put Legs and animals into an array The other two words I gotta just print it out.
public class Readwords {
public static void main(String[] args) {
String[] array=new string[10];
}
}
How do I get the program to read letters only and ignore the numbers and symbols?
You can use Regex for finding numbers and symbols,after that replace them.
1).Read the whole .txt file to a string.
2).Use replaceAll function to replace the unwanted characters.
String str = your text;
str = str.replaceAll(your regex, "");
You can try this:
try {
BufferedReader file = new BufferedReader(new FileReader("yourfile.txt")
String line;
ArrayList<String> array = new ArrayList<String>();
while ((line = file.nextLine()) != null) {
if (line.matches("[a-zA-Z]+"))
array.add(line);
else
System.out.println(line);
}
String[] result = array.toArray(new String[array.size()]);
file.close();
return result;
}
catch (Exception e)
e.printStackTrace;

Retrieving part of a string using a delimiter

Okay So I am creating an application but I'm not sure how to get certain parts of the string. I have read In a file as such:
*tp*|21394398437984|163600
*2*|AAA|1234567894561236|STOP|20140527|Success||Automated|DSPRN1234567
*2*|AAA|1234567894561237|STOP|20140527|Success||Automated|DPSRN1234568
*3*|2
I need to read the lines beginning with 2 so I done:
s = new Scanner(new BufferedReader(new FileReader("example.dat")));
while (s.hasNext()) {
String str1 = s.nextLine ();
if(str1.startsWith("*2*")) {
System.out.print(str1);
}
}
So this will read the whole line I'm fine with that, Now my issue is I need to extract the 2nd line beginning with numbers the 4th with numbers the 5th with success and the 7th(DPSRN).
I was thinking about using a String delimiter with | as the delimiter but I'm not sure where to go after this any help would be great.
You should use String.split("|"), it will give you an array - String[]
Try following:
String test="*2*|AAA|1234567894561236|STOP|20140527|Success||Automated|DSPRN1234567";
String tok[]=test.split("\\|");
for(String s:tok){
System.out.println(s);
}
Output :
*2*
AAA
1234567894561236
STOP
20140527
Success
Automated
DSPRN1234567
What you require will be placed at tok[2], tok[4], tok[5] and tok[8].
Just split the returned line based on your search, which would return an array of String elements where you can retrieve your elements based on their index:
s = new Scanner(new BufferedReader(new FileReader("example.dat")));
String searchLine = "";
while (s.hasNext()) {
searchLine = s.nextLine();
if(searchLine.startsWith("*2*")) {
break;
}
}
String[] strs = searchLine.split("|");
String secondArgument = strs[2];
String forthArgument = strs[4];
String fifthArgument = strs[5];
String seventhArgument = strs[7];
System.out.println(secondArgument);
System.out.println(forthArgument);
System.out.println(fifthArgument);
System.out.println(seventhArgument);

Identifying each word in a file

Importing a large list of words and I need to create code that will recognize each word in the file. I am using a delimiter to recognize the separation from each word but I am receiving a suppressed error stating that the value of linenumber and delimiter are not used. What do I need to do to get the program to read this file and to separate each word within that file?
public class ASCIIPrime {
public final static String LOC = "C:\\english1.txt";
#SuppressWarnings("null")
public static void main(String[] args) throws IOException {
//import list of words
#SuppressWarnings("resource")
BufferedReader File = new BufferedReader(new FileReader(LOC));
//Create a temporary ArrayList to store data
ArrayList<String> temp = new ArrayList<String>();
//Find number of lines in txt file
String line;
while ((line = File.readLine()) != null)
{
temp.add(line);
}
//Identify each word in file
int lineNumber = 0;
lineNumber++;
String delimiter = "\t";
//assess each character in the word to determine the ascii value
int total = 0;
for (int i=0; i < ((String) line).length(); i++)
{
char c = ((String) line).charAt(i);
total += c;
}
System.out.println ("The total value of " + line + " is " + total);
}
}
This smells like homework, but alright.
Importing a large list of words and I need to create code that will recognize each word in the file. What do I need to do to get the program to read this file and to separate each word within that file?
You need to...
Read the file
Separate the words from what you've read in
... I don't know what you want to do with them after that. I'll just dump them into a big list.
The contents of my main method would be...
BufferedReader File = new BufferedReader(new FileReader(LOC));//LOC is defined as class variable
//Create an ArrayList to store the words
List<String> words = new ArrayList<String>();
String line;
String delimiter = "\t";
while ((line = File.readLine()) != null)//read the file
{
String[] wordsInLine = line.split(delimiter);//separate the words
//delimiter could be a regex here, gotta watch out for that
for(int i=0, isize = wordsInLine.length(); i < isize; i++){
words.add(wordsInLine[i]);//put them in a list
}
}
You can use the split method of the String class
String[] split(String regex)
This will return an array of strings that you can handle directly of transform in to any other collection you might need.
I suggest also to remove the suppresswarning unless you are sure what you are doing. In most cases is better to remove the cause of the warning than supress the warning.
I used this great tutorial from thenewboston when I started off reading files: https://www.youtube.com/watch?v=3RNYUKxAgmw
This video seems perfect for you. It covers how to save file words of data. And just add the string data to the ArrayList. Here's what your code should look like:
import java.io.*;
import java.util.*;
public class ReadFile {
static Scanner x;
static ArrayList<String> temp = new ArrayList<String>();
public static void main(String args[]){
openFile();
readFile();
closeFile();
}
public static void openFile(){
try(
x = new Scanner(new File("yourtextfile.txt");
}catch(Exception e){
System.out.println(e);
}
}
public static void readFile(){
while(x.hasNext()){
temp.add(x.next());
}
}
public void closeFile(){
x.close();
}
}
One thing that is nice with using the java util scanner is that is automatically skips the spaces between words making it easy to use and identify words.

I can't seem to figure out how to get this print out all the words including the duplicates

I am trying to get this to print out all the words that are on a text file in ascending order. When I run it, it prints out in ascending order, but it only prints one occurrence of the word. I want it to print out every occurrence of the word(duplicates wanted). I am not sure what I'm doing wrong.
Also I would like it to only print out the words and not the punctuation marks that are in the text file. I know I need to use the "split", just not sure how to properly use it. I've worked with it once before but can not remember how to apply it here.
This is the code I have so far:
public class DisplayingWords {
public static void main(String[] args) throws
FileNotFoundException, IOException
{
Scanner ci = new Scanner(System.in);
System.out.print("Please enter a text file to open: ");
String filename = ci.next();
System.out.println("");
File file = new File(filename);
BufferedReader br = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String str;
while((str = br.readLine())!= null)
{
/*
* This is where i seem to be having my problems.
* I have only ever used a split once before and can not
* remember how to properly use it.
* i am trying to get the print out to avoid printing out
* all the punctuation marks and have only the words
*/
// String[] str = str.split("[ \n\t\r.,;:!?(){}]");
str.split("[ \n\t\r.,;:!?(){}]");
sb.append(str);
sb.append(" ");
System.out.println(str);
}
ArrayList<String> text = new ArrayList<>();
StringTokenizer st = new StringTokenizer(sb.toString().toLowerCase());
while(st.hasMoreTokens())
{
String s = st.nextToken();
text.add(s);
}
System.out.println("\n" + "Words Printed out in Ascending "
+ "(alphabetical) order: " + "\n");
HashSet<String> set = new HashSet<>(text);
List<String> arrayList = new ArrayList<>(set);
Collections.sort(arrayList);
for (Object ob : arrayList)
System.out.println("\t" + ob.toString());
}
}
your duplicates are probably being stripped out here
HashSet<String> set = new HashSet<>(text);
a set generally does not contain duplicates, so I'd just sort your text array list
Collections.sort(text);
for (Object ob : text)
System.out.println("\t" + ob.toString());
The problem is here:
HashSet<String> set = new HashSet<>(text);
Set doesn't contain duplicates.
You should instead use following code:
//HashSet<String> set = new HashSet<>(text);
List<String> arrayList = new ArrayList<>(text);
Collections.sort(arrayList);
Also for split method I would suggest you to use:
s.split("[\\s\\.,;:\\?!]+");
For example consider the code given below:
String s = "Abcdef;Ad; country hahahahah? ad! \n alsj;d;lajfa try.... wait, which wish work";
String sp[] = s.split("[\\s\\.,;:\\?!]+");
for (String sr : sp )
{
System.out.println(sr);
}
Its output is as follows:
Abcdef
Ad
country
hahahahah
ad
alsj
d
lajfa
try
wait
which
wish
work

Categories