I am trying to read an input file that contains the following:
input.txt
Hello world. Welcome,
to the java.
And, I have to append the sentence with prefix(BEGIN) and suffix(END) and the output should like the following:
output expected:
BEGIN_Hello world_END.BEGIN_ Welcome,
to the java_END.
Following is my input file reading function. I am reading an entire file and storing it in array list:
InputDetails.java
private List<String> readInput = new ArrayList<>();
public void readFile() throws IOException {
while((inputLine = input.readLine()) != null ) {
readInput.add(inputLine);
}
}
//Getter to return input file content
public List<String> getReadInput() {
return readInput;
}
And following is my code for appending the string with BEGIN and END:
public void process() {
InputDetails inputD = new InputDetails();
for(int i=0;i<inputD.getReadInput().size();i++) {
String sentence = inputD.getReadInput().get(i);
String splitSentence[] = sentence.split("\\.");
for(int j=0;j<splitSentence.length;j++) {
System.out.println(splitSentence[j]);
splitSentence[j] = "BEGIN_"+splitSentence[j]+"__END";
}
sentence = String.join(".",splitSentence);
inputD.writeToFile(sentence);
}
}
output getting:
BEGIN_SENTENCE__Hello world__END_SENTENCE.BEGIN_SENTENCE__Welcome
to the java.
Note: Each sentence is separated by a "." (period) character. The output Sentence should be prefixed with BEGIN_ and suffixed with __END. The period character is not considered a part of the sentence. And, input file are delimited by one or more spaces. The sentence is complete when it has period(.) Even if it means the sentence completes on the new line(just as the input that i specified above). All, the special chars position should be retained in the output. There can also be a space between period(.) or a comma(,) and a word. for eg: java . or Welcome ,
Can Anyone help me fix this? Thanks
First, you'll need to join your string list input into a single string. Then, you can use the String.split() method to break up your input into parts delimited by the . character. You can then choose to either run a loop on that array or use the stream method (as shown below) to iterate over your sentences. On each part, simply append the required BEGIN_ and _END blocks to the sentence. You can use manual string concatenation using the + operator or use a string template with String.format() (as shown below). Finally, reintroduce the . delimiter used to break the input by joining the parts back into a single string.
String fullString = String.join("", getReadInput());
Arrays.asList(fullString).split("\\.")).stream()
.map(s -> String.format("BEGIN_%s_END", s))
.collect(Collectors.joining("."));
Related
I'm building an android/Java program which reads from a text file and store each sentence in the text file in an array list. Then it checks the occurrence of a particular word in each sentence and prints out the sentence which contains the word.
This is the code that I have so far:
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.text4);
text = (TextView)findViewById(R.id.info2);
BufferedReader reader = null;
try {
reader = new BufferedReader(
new InputStreamReader(getAssets().open("input3.txt")));
String line;
List<String> sentences = new ArrayList<String>();
}
}
}
As you can see from the above code, the program looks for the word "Despite".
My text file consist of three sentences. This program works perfectly by outputting the specific sentence with the word "Despite" if my text file is arranged using the following structure (This structure has a line break after each sentence.
However, if the the text file is arranged in the following structure (No line break after each sentence), the program will output all three sentences on the output screen.
I don't want to add a line break after each of my sentences in the text file for this program to work. How do I alter my code so it works for any type of text file regardless of its structure?
Your split() doesn't work, at all. First, your expression will only match this exact substring:
.?!\r\n\t
Extra tabs at the end are also included in the match.
You probably meant to use a character class, e.g. [0-9], but you forgot the brackets.
Since line is exactly one line of text from the file, why are splitting on \r and \n? Also, why is a tab (\t) considered a sentence separator?
Next part that's wrong with the split(), is the fact that you're only ever taking the first value ([0]). If the split had worked, that would discard the second and third sentences.
Also, when looking for a word, make sure you don't match a longer word, e.g. if looking for is, don't match this, so you need to include word-boundary checks (\b).
To ensure that the matched token, e.g. period, is included in the sentence, you need do use a zero-width positive lookbehind non-capturing group ((?<=X)).
Word matching should also be case-insensitive.
And finally, the code structure is wrong. It won't compile since you're missing an end-brace (}). This is made extra confusing because of the bad indentations.
Here is updated code:
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(getAssets().open("input3.txt")))) {
List<String> sentences = new ArrayList<>();
for (String line; (line = reader.readLine()) != null; ) {
for (String sentence : line.split("(?<=[.?!\t])")) {
sentence = sentence.trim();
if (! sentence.isEmpty()) {
sentences.add(sentence);
}
}
}
Pattern word = Pattern.compile("\\bDESPITE\\b", Pattern.CASE_INSENSITIVE);
for (String sentence : sentences) {
if (word.matcher(sentence).find()) {
text.setText(sentence);
break; // No need to continue searching
}
}
} catch (IOException e) {
Toast.makeText(getApplicationContext(), "Error reading file!", Toast.LENGTH_LONG).show();
e.printStackTrace();
}
I'm trying to read a text file(.txt) in java. I need to eventually put the text I extract word by word in a binary tree's nodes . If for example, I have the text: "Hi, I'm doing a test!", I would like to split it into "Hi" "I" "m" "doing" "a" "test", basically skipping all punctuation and empty spaces and considering a word to be a sequence of contiguous alphabet letters. I am so far able to extract the words and put them in an array for testing. However, if I have a completely empty line in my .txt file, the code will consider it as a word and return an empty space. Also, punctuation at the end of a line works but if there's a comma for example and then text, I will get an empty space as well ! Here is what I tried so far:
public static void main(String[] args) throws Exception
{
FileReader file = new FileReader("File.txt");
BufferedReader reader = new BufferedReader(file);
String text = "";
String line = reader.readLine();
while (line != null)
{
text += line;
line = reader.readLine();
}
System.out.println(text);
String textnospaces=text.replaceAll("\\s+", " ");
System.out.println(textnospaces);
String [] tokens = textnospaces.split("[\\W+]");
for(int i=0;i<=tokens.length-1;i++)
{
tokens[i]=tokens[i].toLowerCase();
System.out.println(tokens[i]);
}
}
Using the following text:
I can't, come see you. Today my friend is hard
s
I get the following output:
i
can
t
(extra space between "t" and "come")
come
see
you
(extra space again)
today
my
friend
is
hards
Any help would be appreciated ! Thanks
use the trim() method of String. From documentation http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim%28%29:
"Returns a copy of the string, with leading and trailing whitespace omitted.
If this String object represents an empty character sequence, or the first and last characters of character sequence represented by this String object both have codes greater than '\u0020' (the space character), then a reference to this String object is returned.
Otherwise, if there is no character with a code greater than '\u0020' in the string, then a new String object representing an empty string is created and returned.
Otherwise, let k be the index of the first character in the string whose code is greater than '\u0020', and let m be the index of the last character in the string whose code is greater than '\u0020'. A new String object is created, representing the substring of this string that begins with the character at index k and ends with the character at index m-that is, the result of this.substring(k, m+1).
This method may be used to trim whitespace (as defined above) from the beginning and end of a string.
Returns:
A copy of this string with leading and trailing white space removed, or this string if it has no leading or trailing white space."
If you really are just looking for each contiguous sequence of characters, you can accomplish this with regex matching quite simply.
String patternString1 = "([a-zA-Z]+)";
String text = "I can't, come see you. Today my friend is hard";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("found: " + matcher.group(1));
}
I'm new to Java and to regex in particular
I have a CSV file that look something like :
col1,col2,clo3,col4
word1,date1,date2,port1,port2,....some amount of port
word2,date3,date4,
....
What I would like is to iterate over each line (I suppose I'll do it with simple for loop) and get all ports back.
I guess what I need is the fetch every thing after the two dates and look for
,(\d+),? and the group that comes back
My question(s) is :
1) Can it be done with one expression? (meaning, without storing the result in a string and then apply another regex)
2) Can I maybe incorporate the iteration over the lines into the regex?
There are many ways to do it, I will show a few for educational purpose.
I put your input in a String just for the example, you will have to read it properly. I also store the results in a List and print them at the end:
public static void main(String[] args) {
String source = "col1,col2,clo3,col4" + System.lineSeparator() +
"word1,date1,date2,port1,port2,port3" + System.lineSeparator() +
"word2,date3,date4";
List<String> ports = new ArrayList<>();
// insert code blocks bellow
System.out.println(ports);
}
Using Scanner:
Scanner scanner = new Scanner(source);
scanner.useDelimiter("\\s|,");
while (scanner.hasNext()) {
String token = scanner.next();
if (token.startsWith("port"))
ports.add(token);
}
Using String.split:
String[] values = source.split("\\s|,");
for (String value : values) {
if (value.startsWith("port"))
ports.add(value);
}
Using Pattern-Matcher:
Matcher matcher = Pattern.compile("(port\\d+)").matcher(source);
while (matcher.find()) {
ports.add(matcher.group());
}
Output:
[port1, port2, port3]
If you know where the "ports" are located in the file, you can use that info to slightly increase performance by specifying the location and getting a substring.
Yes, it can be done in one line:
first remove all non-port terms (those containing a non-digit)
then split the result of step one on commas
Here's the magic line:
String[] ports = line.replaceAll("(^|(?<=,))[^,]*[^,\\d][^,]*(,|$)", "").split(",");
The regex says "any term that has a non-digit" where a "term" is a series of characters between start-of-input/comma and comma/end-of-input.
Conveniently, the split() method doesn't return trailing blank terms, so no need worry about any trailing commas left after the first replace.
In java 8, you can do it in one line, but things are much more straightforward:
List<String> ports = Arrays.stream(line.split(",")).filter(s -> s.matches("\\d+")).collect(Collectors.toList());
This streams the result of a split on commas, then filters out non-all-numeric elements, them collects the result.
Some test code:
String line = "foo,12-12-12,11111,2222,bar,3333";
String[] ports = line.replaceAll("(^|(?<=,))[^,]*[^,\\d][^,]*(,|$)", "").split(",");
System.out.println(Arrays.toString(ports));
Output:
[11111, 2222, 3333]
Same output in java 8 for:
String line = "foo,12-12-12,11111,2222,bar,3333,baz";
List<String> ports = Arrays.stream(line.split(",")).filter(s -> s.matches("\\d+")).collect(Collectors.toList());
I have made this method to take in a file.txt and transfer its elements into an array list.
My problem is, I dont want to transfer a whole line into one string. I want to take each element on the line as string.
public ArrayList<String> readData() throws IOException {
FileReader pp=new FileReader(filename);
BufferedReader nn=new BufferedReader(pp);
ArrayList<String> data=new ArrayList<String>();
String line;
while((line=nn.readLine()) != null){
data.add(line);
}
xoxo.close();
return data;
}
is it possible ?
What about reading the lines, but splitting each line into the single words?
while ((line = nn.readLine()) != null) {
for (String word : line.split(" ")) {
data.add(line);
}
}
The method split(" ") in this example will split the line on each whitespace " " and put the single words into an array.
In case the words in the file are separated by another character (like a comma for example) you can use that too in split():
line.split(",");
If I may, here is a somewhat easier way to read a text file:
Scanner scanner = new Scanner(filename);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
for (String word : line.split(" ")) {
data.add(word);
}
}
Well not easier but shorter :)
And one last advice: if you give your variables a more.. readable name like bufferedReader instead of naming them all nn, pp, xoxo you might have less problems when the code grows more and more complex later on
Use split function for String.
String line = "This is line";
String [] a = line.split("\\s");// \\s is regular expression for space
a[0] = This
a[1] = is
a[2] = line
If by 'Element' you mean each word, then simply changing
line = nn.readLine()
to
line = nn.read()
should fix your problem, as the read method will take in every character it reads until it hits a space character in which it will return the processed characters. However if by element you mean each character then the problem is much harder. You will need to read each word and split that string up using any of the various functions Java provides.
I have a text with sentences by this format:
sentence 1 This is a sentence.
t-extraction 1 This is a sentence
s-extraction 1 This_DT is_V a_DT sentence_N
sentence 2 ...
As you see, the lines are separated by enter key. sentence, t-extraction, s-extraction words are repeated. The numbers are sentence numbers 1,2,.. . The phrases are separated by Tab key for example in the first line: sentence(TAb)1(TAb)This is a sentence.
or in the second line:t-extraction(TAb)1(TAb)This(TAb)is(TAb)a sentence.
I need to map some of these information in a sql table, so I should extract them.
I need first and second sentence(without sentence word in first lines and t-extraction and numbers in second lines). Each separated part by Tab will be mapped in a field in sql (for example 1 in one column, This is a sentence in one column, This (in second lines) in one column, and also is and a sentence ).
What is your suggestion? Thanks in advance.
You could use String.split().
The regex you could use is [^A-Za-z_]+ or [ \t]+
Using the split method on String is probably the key to this. The split command breaks a string into parts where the regex matches, returning an array of Strings of the parts between the matches.
You want to match on tab (or \t as it is delimited to). You also want to process three lines as a unit, the code below shows one way of doing this (it does depend on the file being in good format).
Of course you want to use a reader created from your file not a string.
public class Test {
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new FileReader("/my/file.data"));
String line = null;
for(int i = 0; (line = reader.readLine()) != null; i++){
if(i % 3 == 0){
String[] parts = line.split("\t");
System.out.printf("sentence ==> %s\n", Arrays.toString(parts));
} else if(i % 3 == 1){
String[] parts = line.split("\t");
System.out.printf("t-sentence ==> %s\n", Arrays.toString(parts));
} else {
String[] parts = line.split("\t");
System.out.printf("s-sentence ==> %s\n", Arrays.toString(parts));
}
}
}
}