How to skip certain input from a text file - java

I am trying to take in a file that looks like the following (but with hundreds of more lines):
123 000 words with spaces 123 123 123 words with spaces
123 000 and again words here 123 123 123 and words again
The 123, 000, "words with spaces" stuff are different each line. I am just trying to show it as a placeholder for what I need.
If I only need to get the 123's of each row, how can I ignore the other stuff in there?
Below is what I have tried:
File file = new File("txt file here");
try (Scanner in = new Scanner(file))
{
int count = 0;
while (in.hasNext())
{
int a = in.nextInt();
String trash1 = in.next();
String trash2 = in.next();
String trash3 = in.next();
int b = in.nextInt();
int c = in.nextInt();
int d = in.nextInt();
//This continues but I realize this will eventually throw an
//exception at some points in the text file because
//some rows will have more "words with spaces" than others
}
}
catch (FileNotFoundException fnf)
{
System.out.println(fnf.getMessage());
}
Is there a way to skip the "000's" and the "words with spaces" stuff that way I only take in the "123's"? Or am I just approaching this in a "bad" way. Thanks!

You can use regular expressions to strip the first part of the line.
String cleaned = in.nextLine().replace("^(\\d+\\s+)+([a-zA-Z]+\\s+)+", "");
^ means the pattern starts at the beginning of the text (the start of the line)
(\\d+\\s+)+ matches one or more groups of digits followed by whitespace.
([a-zA-Z]+\\s+)+ matches one or more groups of alphabetic characters followed by whitespace.
You may have to modify the pattern if there's punctuation or other characters. You can read more about regular expressions here if you're new to using them.

Grab line by line and split the line around a space and iterate over the array of strings only caring if the string in the array matches what you want
int countsOf123s = 0;
while (in.hasNextLine())
{
String[] words = in.nextLine().split(" "); //or for any whitespace do \\s+
for(String singleWord : words)
{
if(singleWord.equals("123"))
{
//do something
countsOf123s++;
}
}
}

Related

Taking an integer from a scanner token that may or may not include symbols

I have a function that takes a scanner of a text file as input, and I need to extract integer values from each line. These lines might not follow a rigid syntax.
I have tried to to use skip() to ignore specific non-integers, but I fear I may be using it for something it's not capable of.
I've also tried turning the token into a string and using replaceAll(";", ""), but that quickly turns my code into a mess of if statements and String to int conversions. It gets bad quite fast considering I have a lot of different variables that need to be set here.
Is there is a more elegant solution?
Here is my input file:
pop 25; // my code must accept this
pop 25 ; // and also this
house 3.2, 1; // some lines will set multiple values
house 3.2 , 1 ; // so I will need to ignore both commas and semicolons
Here is my code:
static int population = -1;
static double median = -1;
static double scatter = -1;
private static void readCommunity(Scanner sc) {
while (sc.hasNext()) {
String input = sc.next();
if ("pop".equals(input)) {
sc.skip(";*"); // my guess is this wouldn't work unless the
// token had a ';' BEFORE the integer
if (sc.hasNextInt()) {
population = sc.nextInt();
} else { // throw an error. not important here }
sc.nextLine();
} else if ("house".equals(input)) {
sc.skip(",*");
if (sc.hasNextDouble()) {
median = sc.nextDouble;
sc.skip(";*");
if (sc.hasNextDouble()) {
scatter = sc.nextDouble();
} else { // error }
} else { // error }
sc.nextLine();
}
}
}
In my opinion, I think it's just easier to read each entire file data line then split that line into what I need, and do validations on the read in data values, etc. For example:
private static void readCommunity(String dataFilePath) {
File file = new File(dataFilePath);
if (!file.exists()) {
System.err.println("File Not Found! (" + dataFilePath + ")");
return;
}
int lineCount = 0; // For counting file lines.
// 'Try With Resources' used here so as to auto-close reader.
try (Scanner sc = new Scanner(file)) {
while (sc.hasNextLine()) {
String fileInput = sc.nextLine().trim();
lineCount++; // Increment line counter.
// Skip blank lines (if any).
if (fileInput.isEmpty()) {
continue;
}
/* Remove comments from data line (if any). Your file
example shows comments at the end of each line. Yes,
I realize that your file most likely doesn't contain
these but it doesn't hurt to have this here in case
it does or if you want to have that option. Comments
can start with // or /*. Comments must be at the end
of a data line. This 'does not' support any Multi-line
comments. More code is needed for that. */
if (fileInput.contains("//") || fileInput.contains("/*")) {
fileInput = fileInput.substring(0, fileInput.contains("//")
? fileInput.indexOf("//") : fileInput.indexOf("/*"));
}
// Start parsing the data line into required parts...
// Start with semicolon portions
String[] lineMainParts = fileInput.split("\\s{0,};\\s{0,}");
/* Iterate through all the main elemental parts on a
data line (if there is more than one), for example:
pop 30; house 4.3, 1; pop 32; house 3.3, 2 */
for (int i = 0; i < lineMainParts.length; i++) {
// Is it a 'pop' attribute?
if (lineMainParts[i].toLowerCase().startsWith("pop")) {
//Yes it is... so validate, convert, and display the value.
String[] attributeParts = lineMainParts[i].split("\\s+");
if (attributeParts[1].matches("-?\\d+|\\+?\\d+")) { // validate string numerical value (Integer).
population = Integer.valueOf(attributeParts[1]); // convert to Integer
System.out.println("Population:\t" + population); // display...
}
else {
System.err.println("Invalid population value detected in file on line "
+ lineCount + "! (" + lineMainParts[i] + ")");
}
}
// Is it a 'house' attribute?
else if (lineMainParts[i].toLowerCase().startsWith("house")) {
/* Yes it is... so split all comma delimited attribute values
for 'house', validate each numerical value, convert each
numerical value, and display each attribute and their
respective values. */
String[] attributeParts = lineMainParts[i].split("\\s{0,},\\s{0,}|\\s+");
if (attributeParts[1].matches("-?\\d+(\\.\\d+)?")) { // validate median string numerical value (Double or Integer).
median = Double.valueOf(attributeParts[1]); // convert to Double.
System.out.println("Median: \t" + median); // display median...
}
else {
System.err.println("Invalid Median value detected in file on line "
+ lineCount + "! (" + lineMainParts[i] + ")");
}
if (attributeParts[2].matches("-?\\d+|\\+?\\d+")) { // validate scatter string numerical value (Integer).
scatter = Integer.valueOf(attributeParts[2]); // convert to Integer
System.out.println("Scatter: \t" + scatter); // display scatter...
}
else {
System.err.println("Invalid Scatter value detected in file on line "
+ lineCount + "! (" + lineMainParts[i] + ")");
}
}
else {
System.err.println("Unhandled Data Attribute detected in data file on line " + lineCount + "! ("
+ lineMainParts[i] + ")");
}
}
}
}
catch (FileNotFoundException ex) {
System.err.println(ex);
}
}
There are several Regular Expressions (RegEx) used in the code above. Here is what they mean in the order they are encountered in code:
"\\s{0,};\\s{0,}"
Used with the String#split() method for parsing a semicolon (;) delimited line. This regex pretty much covers the bases for when semicolon delimited string data needs to be split but the semicolon may be spaced in several different fashions within the string, for example:
"data;data ;data; data ; data; data ;data"
\\s{0,} 0 or more whitespaces before the semicolon.
; The literal semicolon delimiter itself.
\\s{0,} 0 or more whitespaces after the semicolon.
"\\s+"
Used with the String#split() method for parsing a whitespace (" ") delimited line. This regex pretty much covers the bases for when whitespaced delimited string data needs to be split but there may be anywhere from 1 to several whitespace or tab characters separating the string tokens for example:
"datadata" Split to: [datadata] (Need at least 1 space)
"data data" Split to: [data, data]
"data data" Split to: [data, data]
"data data data" Split to: [data, data, data]
"-?\\d+|\\+?\\d+"
Used with the String#matches() method for string numerics validation. This regex is used to see if the tested string is indeed a string representation of a signed or unsigned integer numerical value (of any length). Used in the code above for numerical string validation before converting that numerical value to Integer. String representations can be:
-1 1 324 +2 342345 -65379 74 etc.
-? If the string optionally starts with or doesn't start with the
Hyphen character indicating a signed value.
\\d+ The string contains 1 or more (+) digits from 0
to 9.
| Logical OR
\\+? If the string optionally starts with or doesn't start with the
Plus character.
\\d+ The string contains 1 or more (+) digits from 0
to 9.
"\\s{0,},\\s{0,}|\\s+" (must be in this order)
Used with the String#split() method for parsing a comma (,) delimited line. This regex pretty much covers the bases for when comma delimited string data needs to be split but the comma may be spaced in several different fashions within the string, for example:
"my data,data" Split to: [my, data, data]
"my data ,data" Split to: [my, data, data]
"my data, data" Split to: [my, data, data]
"my data , data" Split to: [my, data, data]
"my data, data" Split to: [my, data, data]
"my data ,data" Split to: [my, data, data]
\\s{0,} 0 or more whitespaces before the comma.
, The literal comma delimiter itself.
\\s{0,} 0 or more whitespaces after the comma.
| Logical OR split on...
\\s+ Just one or more whitespace delimiter.
So in other words, split on either: just comma OR split on comma and one or more whitespaces OR split on one or more whitespaces and comma OR split on one or more whitespaces and comma and one or more whitespaces OR split on just one or more whitespaces
"-?\\d+(\\.\\d+)?"
Used with the String#matches() method for string numerics validation. This regex is used to see if the tested string is indeed a string representation of a signed or unsigned integer or double type numerical value (of any length). Used in the code above for numerical string validation before converting that numerical value to Double. String representations can be:
-1.34 1.34 324 2.54335 342345 -65379.7 74 etc.
-? If the string optionally starts with or doesn't start with the
Hyphen character indicating a signed value.
\\d+ The string contains 1 or more (+) digits from 0
to 9. [The string would be considered Integer up to this point.]
( Start of a Group.
\\. If the string contains a literal Period (.) after the first set of digits.
\\d+ The string contains 1 or more (+) digits from 0 to 9 after the Period.
) End of Group.
? The data expressed within the Group expression may or may not be there making the Group an Option Group.
Hopefully, the above should be able to get you started.
A regex would probably be a better choice instead of a nextInt or a nextDouble. You could fetch each decimal value using
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");
Matcher m = p.matcher(a);
while(m.find()) {
System.out.println(m.group());
}
The regex checks for all occurrences of a decimal or non-decimal number in the given string.
\\d+ - One or more occurrence of a digit
(\\.\\d+) - Followed by a decimal and one or more digits
? - The expression in the parantheses is optional. So, the numbers may or may not contain decimals.
This will print the below for the data you provided
25
25
3.2
1
3.2
1
EDIT:
The problem you have with commas and semi-colons while parsing the line can be avoided by fetching the entire line using nextLine() instead of next(). next() only fetches one token at a time from the input. Using nextLine and a regular expression, you can read individual numbers as below.
while (sc.hasNext()) {
Pattern p = Pattern.compile("\\d+(\\.\\d+)?");
Matcher m ;
int population = -1;
double median = -1;
double scatter = -1;
String input = sc.nextLine(); // fetches the entire line
if (input.contains("pop")) {
m = p.matcher(input);
while (m.find()) {
population = Integer.parseInt(m.group());
}
} else if (input.contains("house")) {
m = p.matcher(input);
m.find();
median = Double.parseDouble(m.group());
m.find();
scatter = Double.parseDouble(m.group());
}
}

String Manipulation - Removing the charachters of the second word to the first word

Enter two words: computer program
result: cute
the character of the second word of the users input is deleted on the first word of the input in java. Leaving "cute"
thought of using replaceAll but could not make it work.
String sentence;
Scanner input = new Scanner(System.in);
System.out.println("Enter 2 words: ");
sentence = input.nextLine();
String[] arrWrd = sentence.split(" ");
String scdWrd = arrWrd[1];
String fnl = arrWrd[0].replaceAll(scdWrd, "");
System.out.println(fnl);
.replaceAll takes a regex, so basically what you are doing here is you're searching for the whole "program" word and replacing it and not its characters, so you just need to add brackets to your scdWrd to let it know that you want to replace the chars:
String scdWrd = "[" + arrWrd[1] + "]";
Just to add to the elegant solution by #B.Mik, you should also check for things like
If multiple spaces are entered between the words.
If the user enters a blank line or just one word e.g. execute your program and enter a blank line or just one word e.g. computer and you will be welcomed with java.lang.ArrayIndexOutOfBoundsException.
The program given below addresses these points:
import java.util.Scanner;
public class LettersFromSecondReplacement {
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
boolean valid;
String input;
String words[];
do {
valid = true;
System.out.print("Enter two words separated with space: ");
input = in.nextLine();
words = input.split("\\s+"); //Split on one or more spaces
if (words.length != 2) {
System.out.println("Error: wrong input. Try again");
valid = false;
}
} while (!valid);
for (String s : words[1].split("")) { //Split the 2nd word into strings of one character
words[0] = words[0].replaceAll(s, "");
}
System.out.println(words[0]);
}
}
A sample run:
Enter two words separated with space:
Error: wrong input. Try again
Enter two words separated with space: computer
Error: wrong input. Try again
Enter two words separated with space: computer program
cute
Note that I have used a different algorithm (which you can replace with the one provided by #B.Mik) for replacement. Feel free to comment in case of any doubt/issue
replace line with replaceAll by
String fnl = arrWrd[0];
for (byte c : scdWrd.getBytes()) {
fnl = fnl.replace("" + (char)c, "");
}

regular expression for extracting some data from a text file

I have a text with sentences by this format:
sentence 1 This is a sentence.
t-extraction 1 This is a sentence
s-extraction 1 This_DT is_V a_DT sentence_N
sentence 2 ...
As you see, the lines are separated by enter key. sentence, t-extraction, s-extraction words are repeated. The numbers are sentence numbers 1,2,.. . The phrases are separated by Tab key for example in the first line: sentence(TAb)1(TAb)This is a sentence.
or in the second line:t-extraction(TAb)1(TAb)This(TAb)is(TAb)a sentence.
I need to map some of these information in a sql table, so I should extract them.
I need first and second sentence(without sentence word in first lines and t-extraction and numbers in second lines). Each separated part by Tab will be mapped in a field in sql (for example 1 in one column, This is a sentence in one column, This (in second lines) in one column, and also is and a sentence ).
What is your suggestion? Thanks in advance.
You could use String.split().
The regex you could use is [^A-Za-z_]+ or [ \t]+
Using the split method on String is probably the key to this. The split command breaks a string into parts where the regex matches, returning an array of Strings of the parts between the matches.
You want to match on tab (or \t as it is delimited to). You also want to process three lines as a unit, the code below shows one way of doing this (it does depend on the file being in good format).
Of course you want to use a reader created from your file not a string.
public class Test {
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new FileReader("/my/file.data"));
String line = null;
for(int i = 0; (line = reader.readLine()) != null; i++){
if(i % 3 == 0){
String[] parts = line.split("\t");
System.out.printf("sentence ==> %s\n", Arrays.toString(parts));
} else if(i % 3 == 1){
String[] parts = line.split("\t");
System.out.printf("t-sentence ==> %s\n", Arrays.toString(parts));
} else {
String[] parts = line.split("\t");
System.out.printf("s-sentence ==> %s\n", Arrays.toString(parts));
}
}
}
}

single string list- alphabetizing

I'm trying to write a code that uses a scanner to input a list of words, all in one string, then alphabetizer each individual word. What I'm getting is just the first word alphabetized by letter, how can i fix this?
the code:
else if(answer.equals("new"))
{
System.out.println("Enter words, separated by commas and spaces.");
String input= scanner.next();
char[] words= input.toCharArray();
Arrays.sort(words);
String sorted= new String(words);
System.out.println(sorted);
}
Result: " ,ahy "
You're reading in a String via scanner.next() and then breaking that String up into characters. So, as you said, it's sorting the single-string by characters via input.toCharArray(). What you need to do is read in all of the words and add them to a String []. After all of the words have been added, use Arrays.sort(yourStringArray) to sort them. See comments for answers to your following questions.
You'll need to split your string into words instead of characters. One option is using String.split. Afterwards, you can join those words back into a single string:
System.out.println("Enter words, separated by commas and spaces.");
String input = scanner.nextLine();
String[] words = input.split(",| ");
Arrays.sort(words);
StringBuilder sb = new StringBuilder();
sb.append(words[0]);
for (int i = 1; i < words.length; i++) {
sb.append(" ");
sb.append(words[i]);
}
String sorted = sb.toString();
System.out.println(sorted);
Note that by default, capital letters are sorted before lowercase. If that's a problem, see this question.

Scanner through a line with whitespace and comma

I am new to Java and looking for some help with Java's Scanner class. Below is the problem.
I have a text file with multiple lines and each line having multiple pairs of digit.Such that each pair of digit is represented as ( digit,digit ). For example 3,3 6,4 7,9. All these multiple pairs of digits are seperated from each other by a whitespace. Below is an exampel from the text file.
1 2,3 3,2 4,5
2 1,3 4,2 6,13
3 1,2 4,2 5,5
What i want is that i can retrieve each digit seperately. So that i can create an array of linkedlist out it. Below is what i have acheived so far.
Scanner sc = new Scanner(new File("a.txt"));
Scanner lineSc;
String line;
Integer vertix = 0;
Integer length = 0;
sc.useDelimiter("\\n"); // For line feeds
while (sc.hasNextLine()) {
line = sc.nextLine();
lineSc = new Scanner(line);
lineSc.useDelimiter("\\s"); // For Whitespace
// What should i do here. How should i scan through considering the whitespace and comma
}
Thanks
Consider using a regular expression, and data that doesn't conform to your expectation will be easily identified and dealt with.
CharSequence inputStr = "2 1,3 4,2 6,13";
String patternStr = "(\\d)\\s+(\\d),";
// Compile and use regular expression
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
while (matcher.find()) {
// Get all groups for this match
for (int i=0; i<=matcher.groupCount(); i++) {
String groupStr = matcher.group(i);
}
}
Group one and group two will correspond to the first and second digit in each pairing, respectively.
1. use nextLine() method of Scanner to get the each Entire line of text from the File.
2. Then use BreakIterator class with its static method getCharacterInstance(), to get the individual character, it will automatically handle commas, spaces, etc.
3. BreakIterator also give you many flexible methods to separate out the sentences, words etc.
For more details see this:
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html
Use the StringTokenizer class. http://docs.oracle.com/javase/1.4.2/docs/api/java/util/StringTokenizer.html
//this is in the while loop
//read each line
String line=sc.nextLine();
//create StringTokenizer, parsing with space and comma
StringTokenizer st1 = new StringTokenizer(line," ,");
Then each digit is read as a string when you call nextToken() like this, if you wanted all digits in the line
while(st1.hasMoreTokens())
{
String temp=st1.nextToken();
//now if you want it as an integer
int digit=Integer.parseInt(temp);
//now you have the digit! insert it into the linkedlist or wherever you want
}
Hope this helps!
Use split(regex), more simple :
while (sc.hasNextLine()) {
final String[] line = sc.nextLine().split(" |,");
// What should i do here. How should i scan through considering the whitespace and comma
for(int num : line) {
// Do your job
}
}

Categories