Compare content of two text files and split words java

Compare content of two text files and split words java - java

I know this question has been already asked several times but I can't find the way to apply it on my code.
So my propose is the following:
I have two files griechenland_test.txt and outagain5.txt . I want to read them and then get which percentage of outagain5.txt is inside the other file.
Outagain5 has input like that:
mit dem 542824
und die 517126
And Griechenland is an normal article from Wikipedia about that topic (so like normal text, without freqeuncy Counts).
1. Problem
- How can I split the input in bigramms? Like every two words, but always with the one before? So if I have words A, B, C, D --> get AB, BC, CD ?
I have this:
while ((sCurrentLine = in.readLine()) != null) {
// System.out.println(sCurrentLine);
arr = sCurrentLine.split(" ");
for (int i = 0; i < arr.length; i++) {
if (null == hash.get(arr[i])) {
hash.put(arr[i], 1);
} else {
int x = hash.get(arr[i]) + 1;
hash.put(arr[i], x);
}
}
Then I read the other file with this code ( I just add the word, and not the number (I split it with 4 spaces, so the two words are at h[0])).
for (String line = br.readLine(); line != null; line = br.readLine()) {
String h[] = line.split(" ");
words.add(h[0]);
}
2. Problem
Now I make the comparsion between the String x in hash and the String s in words. I have put the else System out.print to get which words are not contained in outagain5.txt, but there are several words printed out which ARE contained in outagain5.txt. I don't understand why :D
So I think that the comparsion doesn't work well or maybe this will be solved will fix the first problem.
ArrayList<String> words = new ArrayList<String>();
ArrayList<String> neuS = new ArrayList<String>();
ArrayList<Long> neuZ = new ArrayList<Long>();
for (String x : hash.keySet()) {
summe = summe + hash.get(x);
long neu = hash.get(x);
for (String s : words) {
if (x.equals(s)) {
neuS.add(x);
neuZ.add(neu);
disc = disc + 1;
} else {
System.out.println(x);
break;
}
}
}
Hope I made my question clear, thanks a lot!!

public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
It is much easier to use the generic "n-gram" approach so you can split every 2 or 3 words if you want. Here is the link I used to grab the code from: I have used this exact code almost any time I need to split words in the (AB), (BC), (CD) format. NGram Sequence.

If I recall, String has a method titled split(regex, count) that will split the item according to a specific point and you can tell it how many times to do it.
I am referencing this JavaDoc https://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split(java.lang.String, int).
And I guess for running comparison between two text files I would recommend having your code read both of them, populated two unique arrays and then try to run comparisons between the two strings each time. Hope I helped.

Related

How do I exclude capitalizing specific words in a String?

I'm new to programming, and here I'm required to capitalise the user's input, which excludes certain words.
For example, if the input is
THIS IS A TEST I get This Is A Test
However, I want to get This is a Test format
String s = in.nextLine();
StringBuilder sb = new StringBuilder(s.length());
String wordSplit[] = s.trim().toLowerCase().split("\\s");
String[] t = {"is","but","a"};
for(int i=0;i<wordSplit.length;i++){
if(wordSplit[i].equals(t))
sb.append(wordSplit[i]).append(" ");
else
sb.append(Character.toUpperCase(wordSplit[i].charAt(0))).append(wordSplit[i].substring(1)).append(" ");
}
System.out.println(sb);
}
This is the closest I have gotten so far but I seem to be unable to exclude capitalising the specific words.

The problem is that you are comparing each word to the entire array. Java does not disallow this, but it does not really make a lot of sense. Instead, you could loop each word in the array and compare those, but that's a bit lengthy in code, and also not very fast if the array of words gets bigger.
Instead, I'd suggest creating a Set from the array and checking whether it contains the word:
String[] t = {"is","but","a"};
Set<String> t_set = new HashSet<>(Arrays.asList(t));
...
if (t_set.contains(wordSplit[i]) {
...

Your problem (as pointed out by #sleepToken) is that
if(wordSplit[i].equals(t))
is checking to see if the current word is equal to the array containing your keywords.
Instead what you want to do is to check whether the array contains a given input word, like so:
if (Arrays.asList(t).contains(wordSplit[i].toLowerCase()))
Note that there is no "case sensitive" contains() method, so it's important to convert the word in question into lower case before searching for it.

You're already doing the iteration once. Just do it again; iterate through every String in t for each String in wordSplit:
for (int i = 0; i < wordSplit.length; i++){
boolean found = false;
for (int j = 0; j < t.length; j++) {
if(wordSplit[i].equals(t[j])) {
found = true;
}
}
if (found) { /* do your stuff */ }
else { }
}

First of all right method which is checking if the word contains in array.
contains(word) {
for (int i = 0;i < arr.length;i++) {
if ( word.equals(arr[i])) {
return true;
}
}
return false;
}
And then change your condition wordSplit[i].equals(t) to contains(wordSplit[i]

You are not comparing with each word to ignore in your code in this line if(wordSplit[i].equals(t))
You can do something like this as below:
public class Sample {
public static void main(String[] args) {
String s = "THIS IS A TEST";
String[] ignore = {"is","but","a"};
List<String> toIgnoreList = Arrays.asList(ignore);
StringBuilder result = new StringBuilder();
for (String s1 : s.split(" ")) {
if(!toIgnoreList.contains(s1.toLowerCase())) {
result.append(s1.substring(0,1).toUpperCase())
.append(s1.substring(1).toLowerCase())
.append(" ");
} else {
result.append(s1.toLowerCase())
.append(" ");
}
}
System.out.println("Result: " + result);
}
}
Output is:
Result: This is a Test

To check the words to exclude java.util.ArrayList.contains() method would be a better choice.
The below expression checks if the exclude list contains the word and if not capitalises the first letter:
tlist.contains(x) ? x : (x = x.substring(0,1).toUpperCase() + x.substring(1)))
The expression is also corresponds to:
if(tlist.contains(x)) { // ?
x = x; // do nothing
} else { // :
x = x.substring(0,1).toUpperCase() + x.substring(1);
}
or:
if(!tlist.contains(x)) {
x = x.substring(0,1).toUpperCase() + x.substring(1);
}
If you're allowed to use java 8:
String s = in.nextLine();
String wordSplit[] = s.trim().toLowerCase().split("\\s");
List<String> tlist = Arrays.asList("is","but","a");
String result = Stream.of(wordSplit).map(x ->
tlist.contains(x) ? x : (x = x.substring(0,1).toUpperCase() + x.substring(1)))
.collect(Collectors.joining(" "));
System.out.println(result);
Output:
This is a Test

Set int position to the line after a string is read

Edit: As some have asked, I will try to make it more clear. The user inserts a value, any value, into a text box. This is saved as the result int. The problem is finding the right line to insert the strings to for every choice the user might make.
I am trying to insert strings through a loop in a file and as it is right now, I'm using a static declaration of the location (line number) through an int. The problem is that if the number of iterations changes, the strings are not inserted in the right location.
In the code below, result represents the number of strings to be inserted, as written by the user in a text box.
for (int a = result; a >= 1; a--) {
Path path = Paths.get("ScalabilityModel.bbt");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
int position = 7;
String extraLine = "AttackNode" + a;
lines.add(position, extraLine);
Files.write(path, lines, StandardCharsets.UTF_8);
}
I would like to change "int position = 7" to something like position = "begin attack nodes" + 1 (so that the string is inserted on the line below the line that contains the string I'm looking for.
What's the easiest way to do this?

Assuming from the comments in the question that user wants to add 2 lines (for example). If user adds '2' into input box.
Please mention in the comment if I am missing something.
One of the way to get that can be:
public static void main(String[] args) throws IOException {
// Assuming the user input here
int result = 2;
for (int a = result; a >= 1; a--) {
Path path = Paths.get("ScalabilityModel.bbt");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
// Used CopyOnWriteArrayList to avoid ConcurrentModificationException
CopyOnWriteArrayList<String> myList = new CopyOnWriteArrayList<String>(lines);
// taking index to get the position of line when it matches the string
int index = 0;
for (String string : myList) {
index = index + 1;
if (string.equalsIgnoreCase("AttackNode")) {
myList.add(index, "AttackNode" + a);
}
}
Files.write(path, myList, StandardCharsets.UTF_8);
}
}

I moved the reading of the file to outside the loop and created a list of the lines to add. Since I wasn't sure what string you want to match with I added a variable searchString for this, so just replace it or assign the right value to it.
Path path = Paths.get("ScalabilityModel.bbt");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
String searchString = "abc";
List<String> newLines = new ArrayList<>();
for (int i = 0; i < result; i++) {
String extraLine = "AttackNode" + (result - i);
newLines.add(extraLine);
}
for (int i = 0; i < lines.size(); i++) {
if (lines.get(i).contains(searchString)) { //Check here can be modified to equeals, startsWith etc depending on the search pattern
if (i + 1 < lines.size()) {
lines.addAll(i + 1, newLines);
} else {
lines.addAll(newLines);
}
break;
}
}

Split string after every 2 words and store into list

I have a string of words as follows:
String words = "disaster kill people action scary seriously world murder loose world";
Now, I wish to split every 2 words and store them into a list so that it will produce something like:
[disaster kill, people action, scary seriously,...]
The problem with my code is that it will split whenever it encounters a space. How do I modify it so that it will only be added into the list if it only encounters every 2nd space, preserving the space after each word)
My code:
ArrayList<String> wordArrayList = new ArrayList<String>();
for(String word : joined.split(" ")) {
wordArrayList.add(word);
}
Thanks.

Use this regular expression: (?<!\\G\\S+)\\s.
PROOF:
String words = "disaster kill people action scary seriously world murder loose world";
String[] result = words.split("(?<!\\G\\S+)\\s");
System.out.printf("%s%n", Arrays.toString(result));
And the result:
[disaster kill, people action, scary seriously, world murder, loose world]

Your loop should leave you with an ArrayList<String> that has each word, right? All you need to do now is iterate through that list and combine words together in sets of twos.
ArrayList<String> finalList = new ArrayList<String>();
for (int i = 0; i < wordArrayList.Size(); i+=2) {
if (i + 1 < wordArrayList.Size()
finalList.add(wordArrayList.get(i) + " " + wordArrayList.get(i + 1);
}
This should take your split words and add them to the list with spaces so that they look like your desired output.

I was looking for splitting a string after 'n' words.
So I modify the above solution.
private void spiltParagraph(int splitAfterWords, String someLargeText) {
String[] para = someLargeText.split(" ");
ArrayList<String> data = new ArrayList<>();
for (int i = 0; i < para.length; i += splitAfterWords) {
if (i + (splitAfterWords - 1) < para.length) {
StringBuilder compiledString = new StringBuilder();
for (int f = i; f <= i + (splitAfterWords - 1); f++) {
compiledString.append(para[f] + " ");
}
data.add(compiledString.toString());
}
}
}

I run into this problem today, adding an extra difficulty that is to write this solution in Scala. So, I needed to write a recursive solution that looks like:
val stringToSplit = "THIS IS A STRING THAT WE NEED TO SPLIT EVERY 2 WORDS"
#tailrec
def obtainCombinations(
value: String,
elements: List[String],
res: List[String]
): List[String] = {
if (elements.isEmpty)
res
else
obtainCombinations(elements.head, elements.tail, res :+ value + ' ' + elements.head)
}
obtainCombinations(
stringToSplit.split(' ').head,
stringToSplit.split(' ').toList.tail,
List.empty
)
The output will be:
res0: List[String] = List(THIS IS, IS A, A STRING, STRING THAT, THAT WE, WE NEED, NEED TO, TO SPLIT, SPLIT EVERY, EVERY 2, 2 WORDS)
Porting this to Java would be:
String stringToSplit = "THIS IS A STRING THAT WE NEED TO SPLIT EVERY 2 WORDS";
public ArrayList<String> obtainCombinations(String value, List<String> elements, ArrayList<String> res) {
if (elements.isEmpty()) {
return res;
} else {
res.add(value + " " + elements.get(0));
return obtainCombinations(elements.get(0), elements.subList(1, elements.size()), res);
}
}
ArrayList<String> result =
obtainCombinations(stringToSplit.split(" ")[0],
Arrays.asList(stringToSplit.split(" ")),
new ArrayList<>());

Counting occurrences in a string array and deleting the repeats using java

i'm having trouble with a code. I have read words from a text file into a String array, removed the periods and commas. Now i need to check the number of occurrences of each word. I managed to do that as well. However, my output contains all the words in the file, and the occurrences.
Like this:
the 2
birds 2
are 1
going 2
north 2
north 2
Here is my code:
public static String counter(String[] wordList)
{
//String[] noRepeatString = null ;
//int[] countArr = null ;
for (int i = 0; i < wordList.length; i++)
{
int count = 1;
for(int j = 0; j < wordList.length; j++)
{
if(i != j) //to avoid comparing itself
{
if (wordList[i].compareTo(wordList[j]) == 0)
{
count++;
//noRepeatString[i] = wordList[i];
//countArr[i] = count;
}
}
}
System.out.println (wordList[i] + " " + count);
}
return null;
I need to figure out 1) to get the count value into an array.. 2) to delete the repetitions.
As seen in the commenting, i tried to use a countArr[] and a noRepeatString[], in hopes of doing that.. but i had a NullPointerException.
Any thought on this matter will be much appreciated :)

I would first convert the array into a list because they are easier to operate on than arrays.
List<String> list = Arrays.asList(wordsList);
Then you should create a copy of that list (you'll se in a second why):
ArrayList<String> listTwo = new ArrayList<String>(list);
Now you remove all the duplicates in the second list:
HashSet hs = new HashSet();
hs.addAll(listTwo);
listTwo.clear();
listTwo.addAll(hs);
Then you loop through the second list and get the frequency of that word in the first list. But first you should create another arrayList to store the results:
ArrayList<String> results = new ArrayList<String>;
for(String word : listTwo){
int count = Collections.frequency(list, word);
String result = word +": " count;
results.add(result);
}
Finally you can output the results list:
for(String freq : results){
System.out.println(freq);}
I have not tested this code (can't do that right now). Please ask if there is a problem or it doesnÄt work. See these questions for reference:
How do I remove repeated elements from ArrayList?
One-liner to count number of occurrences of String in a String[] in Java?
How do I clone a generic List in Java?

some syntax issues in your code but works fine
ArrayList<String> results = new ArrayList<String>();
for(String word : listTwo){
int count = Collections.frequency(list, word);
String result = word +": "+ count;
results.add(result);
}

How to Check for Deleted Words Between 2 Sentences in Java

What's the best approach in Java if you want to check for words that were deleted from sentence A in sentence B. For example:
Sentence A: I want to delete unnecessary words on this simple sentence.
Sentence B: I want to delete words on this sentence.
Output: I want to delete (unnecessary) words on this (simple) sentence.
where the words inside the parenthesis are the ones that were deleted from sentence A.

Assuming order doesn't matter: use commons-collections.
Use String.split() to split both sentences into arrays of words.
Use commons-collections' CollectionUtils.addAll to add each array into an empty Set.
Use commons-collections' CollectionUtils.subtract method to get A-B.

Assuming order and position matters, this looks like it would be a variation of the Longest Common Subsequence problem, a dynamic programming solution.
wikipedia has a great page on the topic, there's really too much for me to outline here
http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

Everyone else is using really heavy-weight algorithms for what is actually a very simple problem. It could be solved using longest common subsequence, but it's a very constrained version of that. It's not a full diff; it only includes deletes. No need for dynamic programming or anything like that. Here's a 20-line implementation:
private static String deletedWords(String s1, String s2) {
StringBuilder sb = new StringBuilder();
String[] words1 = s1.split("\\s+");
String[] words2 = s2.split("\\s+");
int i1, i2;
i1 = i2 = 0;
while (i1 < words1.length) {
if (words1[i1].equals(words2[i2])) {
sb.append(words1[i1]);
i2++;
} else {
sb.append("(" + words1[i1] + ")");
}
if (i1 < words1.length - 1) {
sb.append(" ");
}
i1++;
}
return sb.toString();
}
When the inputs are the ones in the question, the output matches exactly.
Granted, I understand that for some inputs there are multiple solutions. For example:
a b a
a
could be either a (b) (a) or (a) (b) a and maybe for some versions of this problem, one of these solutions is more likely to be the "actual" solution than the other, and for those you need some recursive or dynamic programming approach... but let's not make it too much more complicated than what Israel Sato originally asked for!

String a = "I want to delete unnecessary words on this simple sentence.";
String b = "I want to delete words on this sentence.";
String[] aWords = a.split(" ");
String[] bWords = b.split(" ");
List<String> missingWords = new ArrayList<String> ();
int x = 0;
for(int i = 0 ; i < aWords.length; i++) {
String aWord = aWords[i];
if(x < bWords.length) {
String bWord = bWords[x];
if(aWord.equals(bWord)) {
x++;
} else {
missingWords.add(aWord);
}
} else {
missingWords.add(aWord);
}
}

This works well....for updated strings also
updated strings enclosed with square brackets.
import java.util.*;
class Sample{
public static void main(String[] args){
Scanner sc=new Scanner(System.in);
String str1 = sc.nextLine();
String str2 = sc.nextLine();
List<String> flist = Arrays.asList(str1.split("\\s+"));
List<String> slist = Arrays.asList(str2.split("\\s+"));
List<String> completedString = new ArrayList<String>();
String result="";
String updatedString = "";
String deletedString = "";
int i=0;
int startIndex=0;
int endIndex=0;
for(String word: slist){
if(flist.contains(word)){
endIndex = flist.indexOf(word);
if(!completedString.contains(word)){
if(deletedString.isEmpty()){
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
}
}
startIndex=endIndex+1;
if(!deletedString.isEmpty()){
result += "("+deletedString.substring(0,deletedString.length()-1)+") ";
deletedString="";
}
if(!updatedString.isEmpty()){
result += "["+updatedString.substring(0,updatedString.length()-1)+"] ";
updatedString="";
}
result += word+" ";
completedString.add(word);
if(i==slist.size()-1){
endIndex = flist.size();
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
startIndex = endIndex+1;
}
}
else{
if(i == 0){
boolean boundaryCheck = false;
for(int j=i+1;j<slist.size();j++){
if(flist.contains(slist.get(j))){
endIndex=flist.indexOf(slist.get(j));
boundaryCheck=true;
break;
}
}
if(!boundaryCheck){
endIndex = flist.size();
}
if(!completedString.contains(word)){
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
}
startIndex = endIndex+1;
}else if(i == slist.size()-1){
endIndex = flist.size();
if(!completedString.contains(word)){
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
}
startIndex = endIndex+1;
}
updatedString += word+" ";
completedString.add(word);
}
i++;
}
if(!deletedString.isEmpty()){
result += "("+deletedString.substring(0,deletedString.length()-1)+") ";
}
if(!updatedString.isEmpty()){
result += "["+updatedString.substring(0,updatedString.length()-1)+"] ";
}
System.out.println(result);
}
}

This is basically a differ, take a look at this:
diff
and the root algorithm:
Longest common subsequence problem
Here's a sample Java implementation:
http://introcs.cs.princeton.edu/java/96optimization/Diff.java.html
which compares lines. The only thing you need to do is split by word instead of by line or alternatively put each word of both sentences in a separate line.
If e.g. on Linux, you can actually see the results of the latter option using diff program itself before you even write any code, try this:
$ echo "I want to delete unnecessary words on this simple sentence."|tr " " "\n" > 1
$ echo "I want to delete words on this sentence."|tr " " "\n" > 2
$ diff -uN 1 2
--- 1 2012-10-01 19:40:51.998853057 -0400
+++ 2 2012-10-01 19:40:51.998853057 -0400
## -2,9 +2,7 ##
want
to
delete
-unnecessary
words
on
this
-simple
sentence.
The lines with - in front are different (alternatively, it would show + if the lines were added into sentence B that were not in sentence A). Try it out to see if that fits your problem.
Hope this helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Compare content of two text files and split words java - java

Related

How do I exclude capitalizing specific words in a String?

Set int position to the line after a string is read

Split string after every 2 words and store into list

Counting occurrences in a string array and deleting the repeats using java

How to Check for Deleted Words Between 2 Sentences in Java

Categories

Resources