regular expression for extracting some data from a text file - java

I have a text with sentences by this format:
sentence 1 This is a sentence.
t-extraction 1 This is a sentence
s-extraction 1 This_DT is_V a_DT sentence_N
sentence 2 ...
As you see, the lines are separated by enter key. sentence, t-extraction, s-extraction words are repeated. The numbers are sentence numbers 1,2,.. . The phrases are separated by Tab key for example in the first line: sentence(TAb)1(TAb)This is a sentence.
or in the second line:t-extraction(TAb)1(TAb)This(TAb)is(TAb)a sentence.
I need to map some of these information in a sql table, so I should extract them.
I need first and second sentence(without sentence word in first lines and t-extraction and numbers in second lines). Each separated part by Tab will be mapped in a field in sql (for example 1 in one column, This is a sentence in one column, This (in second lines) in one column, and also is and a sentence ).
What is your suggestion? Thanks in advance.

You could use String.split().
The regex you could use is [^A-Za-z_]+ or [ \t]+

Using the split method on String is probably the key to this. The split command breaks a string into parts where the regex matches, returning an array of Strings of the parts between the matches.
You want to match on tab (or \t as it is delimited to). You also want to process three lines as a unit, the code below shows one way of doing this (it does depend on the file being in good format).
Of course you want to use a reader created from your file not a string.
public class Test {
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new FileReader("/my/file.data"));
String line = null;
for(int i = 0; (line = reader.readLine()) != null; i++){
if(i % 3 == 0){
String[] parts = line.split("\t");
System.out.printf("sentence ==> %s\n", Arrays.toString(parts));
} else if(i % 3 == 1){
String[] parts = line.split("\t");
System.out.printf("t-sentence ==> %s\n", Arrays.toString(parts));
} else {
String[] parts = line.split("\t");
System.out.printf("s-sentence ==> %s\n", Arrays.toString(parts));
}
}
}
}

Related

How to append a sentence with prefix and suffix in java?

I am trying to read an input file that contains the following:
input.txt
Hello world. Welcome,
to the java.
And, I have to append the sentence with prefix(BEGIN) and suffix(END) and the output should like the following:
output expected:
BEGIN_Hello world_END.BEGIN_ Welcome,
to the java_END.
Following is my input file reading function. I am reading an entire file and storing it in array list:
InputDetails.java
private List<String> readInput = new ArrayList<>();
public void readFile() throws IOException {
while((inputLine = input.readLine()) != null ) {
readInput.add(inputLine);
}
}
//Getter to return input file content
public List<String> getReadInput() {
return readInput;
}
And following is my code for appending the string with BEGIN and END:
public void process() {
InputDetails inputD = new InputDetails();
for(int i=0;i<inputD.getReadInput().size();i++) {
String sentence = inputD.getReadInput().get(i);
String splitSentence[] = sentence.split("\\.");
for(int j=0;j<splitSentence.length;j++) {
System.out.println(splitSentence[j]);
splitSentence[j] = "BEGIN_"+splitSentence[j]+"__END";
}
sentence = String.join(".",splitSentence);
inputD.writeToFile(sentence);
}
}
output getting:
BEGIN_SENTENCE__Hello world__END_SENTENCE.BEGIN_SENTENCE__Welcome
to the java.
Note: Each sentence is separated by a "." (period) character. The output Sentence should be prefixed with BEGIN_ and suffixed with __END. The period character is not considered a part of the sentence. And, input file are delimited by one or more spaces. The sentence is complete when it has period(.) Even if it means the sentence completes on the new line(just as the input that i specified above). All, the special chars position should be retained in the output. There can also be a space between period(.) or a comma(,) and a word. for eg: java . or Welcome ,
Can Anyone help me fix this? Thanks
First, you'll need to join your string list input into a single string. Then, you can use the String.split() method to break up your input into parts delimited by the . character. You can then choose to either run a loop on that array or use the stream method (as shown below) to iterate over your sentences. On each part, simply append the required BEGIN_ and _END blocks to the sentence. You can use manual string concatenation using the + operator or use a string template with String.format() (as shown below). Finally, reintroduce the . delimiter used to break the input by joining the parts back into a single string.
String fullString = String.join("", getReadInput());
Arrays.asList(fullString).split("\\.")).stream()
.map(s -> String.format("BEGIN_%s_END", s))
.collect(Collectors.joining("."));

How to skip certain input from a text file

I am trying to take in a file that looks like the following (but with hundreds of more lines):
123 000 words with spaces 123 123 123 words with spaces
123 000 and again words here 123 123 123 and words again
The 123, 000, "words with spaces" stuff are different each line. I am just trying to show it as a placeholder for what I need.
If I only need to get the 123's of each row, how can I ignore the other stuff in there?
Below is what I have tried:
File file = new File("txt file here");
try (Scanner in = new Scanner(file))
{
int count = 0;
while (in.hasNext())
{
int a = in.nextInt();
String trash1 = in.next();
String trash2 = in.next();
String trash3 = in.next();
int b = in.nextInt();
int c = in.nextInt();
int d = in.nextInt();
//This continues but I realize this will eventually throw an
//exception at some points in the text file because
//some rows will have more "words with spaces" than others
}
}
catch (FileNotFoundException fnf)
{
System.out.println(fnf.getMessage());
}
Is there a way to skip the "000's" and the "words with spaces" stuff that way I only take in the "123's"? Or am I just approaching this in a "bad" way. Thanks!
You can use regular expressions to strip the first part of the line.
String cleaned = in.nextLine().replace("^(\\d+\\s+)+([a-zA-Z]+\\s+)+", "");
^ means the pattern starts at the beginning of the text (the start of the line)
(\\d+\\s+)+ matches one or more groups of digits followed by whitespace.
([a-zA-Z]+\\s+)+ matches one or more groups of alphabetic characters followed by whitespace.
You may have to modify the pattern if there's punctuation or other characters. You can read more about regular expressions here if you're new to using them.
Grab line by line and split the line around a space and iterate over the array of strings only caring if the string in the array matches what you want
int countsOf123s = 0;
while (in.hasNextLine())
{
String[] words = in.nextLine().split(" "); //or for any whitespace do \\s+
for(String singleWord : words)
{
if(singleWord.equals("123"))
{
//do something
countsOf123s++;
}
}
}

Transferring each elemnt in a text file into an array

I have made this method to take in a file.txt and transfer its elements into an array list.
My problem is, I dont want to transfer a whole line into one string. I want to take each element on the line as string.
public ArrayList<String> readData() throws IOException {
FileReader pp=new FileReader(filename);
BufferedReader nn=new BufferedReader(pp);
ArrayList<String> data=new ArrayList<String>();
String line;
while((line=nn.readLine()) != null){
data.add(line);
}
xoxo.close();
return data;
}
is it possible ?
What about reading the lines, but splitting each line into the single words?
while ((line = nn.readLine()) != null) {
for (String word : line.split(" ")) {
data.add(line);
}
}
The method split(" ") in this example will split the line on each whitespace " " and put the single words into an array.
In case the words in the file are separated by another character (like a comma for example) you can use that too in split():
line.split(",");
If I may, here is a somewhat easier way to read a text file:
Scanner scanner = new Scanner(filename);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
for (String word : line.split(" ")) {
data.add(word);
}
}
Well not easier but shorter :)
And one last advice: if you give your variables a more.. readable name like bufferedReader instead of naming them all nn, pp, xoxo you might have less problems when the code grows more and more complex later on
Use split function for String.
String line = "This is line";
String [] a = line.split("\\s");// \\s is regular expression for space
a[0] = This
a[1] = is
a[2] = line
If by 'Element' you mean each word, then simply changing
line = nn.readLine()
to
line = nn.read()
should fix your problem, as the read method will take in every character it reads until it hits a space character in which it will return the processed characters. However if by element you mean each character then the problem is much harder. You will need to read each word and split that string up using any of the various functions Java provides.

Removing escape character

I am facing a weird file reading a file. The problem is when I read a file, it displays all the data in one line. To heal this, I added line.separators while reading the file. It works fine.see following code
line = br.readLine();
while (line != null) {
String[] parts = line.split(" ");
word_count += parts.length;
line_count++;
fileRead+=line;
fileRead+=System.getProperty("line.separator","\n");
line = br.readLine();
}
Now, the problem comes, when I read the data from fileRead String and count the length of each and every word, then it doesn't give me the correct length/size of some strings like
Let say file contains
Hello, today is Sunday
Thanks
It gives me correct lenth of hello(5) today (5) is(2) Sunday(13). it appends Sunday string like Sunday/n/rThanks. I dont know to get the length of two individuals strings
Code for getting lengths
public void stringLenth(String[] parts) {
for(int i=0;i<parts.length;i++){
System.out.println("hello"+parts[i]+"lenth"+parts[i].trim().length());
parts[i] = parts[i].replaceAll("\\r|\\n", "");
if(parts[i].length() < minWordCount ){
minWordCount = parts[i].trim().length();
}
}
}
Any idea?
Use \\s instead of a single whitespace character to split your line.
Instead of splitting, try to use a regex with a Matcher and use \\w as regex to find all words.

Scanner through a line with whitespace and comma

I am new to Java and looking for some help with Java's Scanner class. Below is the problem.
I have a text file with multiple lines and each line having multiple pairs of digit.Such that each pair of digit is represented as ( digit,digit ). For example 3,3 6,4 7,9. All these multiple pairs of digits are seperated from each other by a whitespace. Below is an exampel from the text file.
1 2,3 3,2 4,5
2 1,3 4,2 6,13
3 1,2 4,2 5,5
What i want is that i can retrieve each digit seperately. So that i can create an array of linkedlist out it. Below is what i have acheived so far.
Scanner sc = new Scanner(new File("a.txt"));
Scanner lineSc;
String line;
Integer vertix = 0;
Integer length = 0;
sc.useDelimiter("\\n"); // For line feeds
while (sc.hasNextLine()) {
line = sc.nextLine();
lineSc = new Scanner(line);
lineSc.useDelimiter("\\s"); // For Whitespace
// What should i do here. How should i scan through considering the whitespace and comma
}
Thanks
Consider using a regular expression, and data that doesn't conform to your expectation will be easily identified and dealt with.
CharSequence inputStr = "2 1,3 4,2 6,13";
String patternStr = "(\\d)\\s+(\\d),";
// Compile and use regular expression
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
while (matcher.find()) {
// Get all groups for this match
for (int i=0; i<=matcher.groupCount(); i++) {
String groupStr = matcher.group(i);
}
}
Group one and group two will correspond to the first and second digit in each pairing, respectively.
1. use nextLine() method of Scanner to get the each Entire line of text from the File.
2. Then use BreakIterator class with its static method getCharacterInstance(), to get the individual character, it will automatically handle commas, spaces, etc.
3. BreakIterator also give you many flexible methods to separate out the sentences, words etc.
For more details see this:
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html
Use the StringTokenizer class. http://docs.oracle.com/javase/1.4.2/docs/api/java/util/StringTokenizer.html
//this is in the while loop
//read each line
String line=sc.nextLine();
//create StringTokenizer, parsing with space and comma
StringTokenizer st1 = new StringTokenizer(line," ,");
Then each digit is read as a string when you call nextToken() like this, if you wanted all digits in the line
while(st1.hasMoreTokens())
{
String temp=st1.nextToken();
//now if you want it as an integer
int digit=Integer.parseInt(temp);
//now you have the digit! insert it into the linkedlist or wherever you want
}
Hope this helps!
Use split(regex), more simple :
while (sc.hasNextLine()) {
final String[] line = sc.nextLine().split(" |,");
// What should i do here. How should i scan through considering the whitespace and comma
for(int num : line) {
// Do your job
}
}

Categories