Read & compare text files and print words in alphabetical order - java

First of all I'm sorry if similar questions has been asked before but I couldn't find a solution to what I was looking for. So I've this small java program which compares two text files (text1.txt & text2.txt) and print all the words of text1.txt which doesn't exist in text2.txt. The code below does the job:
text1.txt : This is text file 1. some # random - text
text2.txt : this is text file 2.
import java.io.*;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.util.*;
public class Read {
public static void main(String[] args) {
Set<String> textFile1 = readFiles("text1.txt");
Set<String> textFile2 = readFiles("text2.txt");
for (String t : textFile1) {
if (!textFile2.contains(t)) {
System.out.println(t);
}}}
public static Set<String> readFiles(String filename)
{
Set<String> words = new HashSet<String>();
try {
for (String line : Files.readAllLines(new File(filename).toPath(), Charset.defaultCharset())) {
String[] split = line.split("\\s+");
for (String word : split) {
words.add(word.toLowerCase());
}}}
catch (IOException e) {
System.out.println(e);
}
return words;
}
}
(Prints word in new line)
Output: #, some, random, 1.
I'm trying to print all the words in alphabetical order. And also if possible, it shouldn't print any specialized character(#,- or numbers). I've been trying to figure it out but no luck. I'd appreciate if someone could help me out with this.
Also I've taken the following line of code from internet which I'm not really familar with. Is there any other easier way to put this line of code:
String line : Files.readAllLines(new File(filename).toPath(), Charset.defaultCharset()))
Edit: HashSet is a must for this piece of work. Sorry I forgot to
mention that.

As you are not allowed to use a TreeSet and forced to use a HashSet, do it this way
import java.io.*;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.util.*;
public class Read {
public static void main(String[] args) {
Set<String> textFile1 = readFiles("text1.txt");
Set<String> textFile2 = readFiles("text2.txt");
Set<String> difference = new HashSet<String>();
// collect strings by dropping out every string that's not only letters
// using the regex "[a-zA-Z]+"
for (String t : textFile1) {
if (!textFile2.contains(t) && t.matches("[a-zA-Z]+")) {
difference.add(t);
}
}
// sort
List<String> dList = new ArrayList<String>(difference);
Collections.sort(dList);
// show
for (String s : dList) {
System.out.println(s);
}
}
public static Set<String> readFiles(String filename)
{
Set<String> words = new HashSet<String>();
try {
for (String line : Files.readAllLines(new File(filename).toPath(), Charset.defaultCharset())) {
String[] split = line.split("\\s+");
for (String word : split) {
words.add(word.toLowerCase());
}}}
catch (IOException e) {
System.out.println(e);
}
return words;
}
}

Have you looked at any other Set implementations? I think if you use a SortedSet such as a TreeSet, instead of a HashSet, the words will automatically sort into alphabetical order.
Stack Overflow works better if you ask one question at a time.

From What I've read on the java documentation, a HashSet doesn't guarantee sorting on the elements in the set. However if you were to implement instead as a SortedSet it should allow for ordering of the elements, but you may possibly need to make a comparator for it as well.
As for your other questions, for reading files in java there is this guide from geeks for geeks that I find is very user friendly, especially for beginners, and shows a variety of ways to read a file.
Special characters may be a bit tricky, there is a guide here from a previous Stack Overflow answer that may be helpful though.

Related

What would be an efficient way to find the Common words in 3 Files in Java?

What I am planning to do is basically :
Read the first file word by word and store the words in a Set(SetA).
Read the second file and check if the first Set(SetA) contains the word, if it does then store it in the second set(SetB). Now SetB contains the common words in first and Second file.
Similarly we will read the third file and check if SetB contains the word and store the words in SetC.
So if you have any suggestions or any problems in my approach. Please Suggest.
You can determine the intersection of two sets using retainAll
public class App {
public static void main(String[] args) {
App app = new App();
app.run();
}
private void run() {
List<String> file1 = Arrays.asList("aap", "noot", "aap", "wim", "vuur", "noot", "wim");
List<String> file2 = Arrays.asList("aap", "noot", "mies", "aap", "zus", "jet", "aap", "wim", "vuur");
List<String> file3 = Arrays.asList("noot", "mies", "wim", "vuur");
System.out.println(getCommonWords(file1, file2, file3));
}
#SafeVarargs
private final Set<String> getCommonWords(List<String>... files) {
Set<String> result = new HashSet<>();
// possible optimization sort files by ascending size
Iterator<List<String>> it = Arrays.asList(files).iterator();
if (it.hasNext()) {
result.addAll(it.next());
}
while (it.hasNext()) {
Set<String> words = new HashSet<>(it.next());
result.retainAll(words);
}
return result;
}
}
Also check out this answer which shows the same solution I gave above, and also ways to do it with Java 8 Streams.
Welcome to Stack Overflow!
The approach seems sound. May I suggest using Regex to possibly save your time coding. One other concern would be to make sure to not store every word, but instead to store only unique words in your set.

Add an element into a treeSet : infromation retrieval

i want to create a inverse index that mean
if i have a terms in a multi-document the result will be like this
term 1 =[doc1], term2 =[doc2 , doc3 , doc4 ] ....
this is my code:
public class TP3 {
private static String DIRNAME = "/home/amal/Téléchargements/lemonde";
private static String STOPWORDS_FILENAME = "/home/amal/Téléchargements/lemonde/frenchST.txt";
public static TreeMap<String, TreeSet<String>> getInvertedFile(File dir, Normalizer normalizer) throws IOException {
TreeMap<String, TreeSet<String>> st = new TreeMap<String, TreeSet<String>>();
ArrayList<String> wordsInFile;
ArrayList<String> words;
String wordLC;
if (dir.isDirectory()) {
String[] fileNames = dir.list();
Integer number;
for (String fileName : fileNames) {
System.err.println("Analyse du fichier " + fileName);
wordsInFile = new ArrayList<String>();
words = normalizer.normalize(new File(dir, fileName));
for (String word : words) {
wordLC = word.toLowerCase();
if (!wordsInFile.contains(word)) {
TreeSet<String> set = st.get(word);
set.add(fileName);
}
}
}
}
for (Map.Entry<String, TreeSet<String>> hit : st.entrySet()) {
System.out.println(hit.getKey() + "\t" + hit.getValue());
}
return st;
}
}
i have an erreor in
set.add(fileName);
i don't know what is the problem please help me
Your main issue is that these two lines are not going to be good:
if (!wordsInFile.contains(word)) {
TreeSet<String> set = st.get(word);
You never put a set into st so set will be null. After this line you should probably have something like:
if(set == null)
{
set = new TreeSet<String>();
st.put(word, set);
}
That should fix your current problem.
Hint for next time, this will be re-read by future users with the same problem and also represents YOU (Someone in the future will read this question when interviewing you for a job!)
Spend some time formatting it thinking of your readers. Prune out comments and correct indentation, don't just paste and run. Also post a little bit of the error stack trace--they are amazingly helpful! Had you posted it, it would have been a "NullPointerException", on that line there is really only one way to get an NPE and it would have saved us having to analyze your code.
PS: I Edited your question so you could see the difference (and to keep it from being closed on you). The main problem with your formatting was the use of tabs.. for programmers tabs are--well let's just say they only work in very controlled conditions. In this case it really helps to watch the preview pane (below your editing box) while you edit--scroll down before you submit to see what we will actually see.

How to take StringTokenizer result to ArrayList in Java?

I want to take StringTokenizer result to ArrayList. I used following code and in 1st print statement, stok.nextToken() print the correct values. But, in second print statement for ArrayList give error as java.util.NoSuchElementException .
How I take these results to an ArrayList?
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.StringTokenizer;
public class Test {
public static void main(String[] args) throws java.io.IOException {
ArrayList<String> myArray = new ArrayList<String>();
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("Enter : ");
String s = br.readLine();
StringTokenizer stok = new StringTokenizer(s, "><");
while (stok.hasMoreTokens())
System.out.println(stok.nextToken());
// -------until now ok
myArray.add(stok.nextToken()); //------------???????????
System.out.println(myArray);
}
}
Quoting javadoc of StringTokenizer:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
"New code" meaning anything written for Java 1.4 or later, i.e. ancient times.
The while loop will extract all values from the tokenizer. When you then call nextToken() after already having extracted all the tokens, why are you surprised that you get an exception?
Especially given this quote from the javadoc of nextToken():
Throws NoSuchElementException if there are no more tokens in this tokenizer's string.
Did you perhaps mean to do this?
ArrayList<String> myArray = new ArrayList<>();
StringTokenizer stok = new StringTokenizer(s, "><");
while (stok.hasMoreTokens()) {
String token = stok.nextToken(); // get and save in variable so it can be used more than once
System.out.println(token); // print already extracted value
// more code here if needed
myArray.add(token); // use already extracted value
}
System.out.println(myArray); // prints list
ArrayList<String> myArray = new ArrayList<String>();
while (stok.hasMoreTokens()){
myArray.add(stok.nextToken());
}
dont call stock.nextToken outside the while loop that results in exceptions and printing out arraylist in System.out.println wont help you have to use a for loop.
for(String s : myArray){
System.out.Println(s);
}

Ignore numbers in a text file when scanning it in Java

I am doing an assignment in Java that requires us to read two different files. One has the top 1000 boy names, and the other contains the top 1000 girl names. We have to write a program that returns all of the names that are in both files. We have to read each boy and girl name as a String, ignoring the number of namings, and add it to a HashSet. When adding to a HashSet, the add method will return false if the name to be added already exists int he HashSet. So to find the common names, you just have to keep track of which names returned false when adding. My problem is that I can't figure out how to ignore the number of namings in each file. My HashSet contains both, and I just want the names.
Here is what I have so far.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class Names {
public static void main(String[] args) {
Set<String> boynames = new HashSet<String>();
Set<String> girlnames = new HashSet<String>();
boynames = loadBoynames();
System.out.println(girlnames);
}
private static Set<String> loadBoynames() {
HashSet<String> d = new HashSet<String>();
File names = new File("boynames.txt");
Scanner s = null;
try {
s = new Scanner(names);
} catch (FileNotFoundException e) {
System.out.println("Can't find boy names file.");
System.exit(1);
}
while(s.hasNext()){
String currentName = s.next();
d.add(currentName.toUpperCase());
}
return d;
}
}
My plan is to take the HashSet that I currently have and add the girl names to it, but before I do I need to not have the numbers in my HashSet.
I tried to skip numbers with this code, but it just spat out errors
while(s.hasNextLine()){
if (s.hasNextInt()){
number = s.nextInt();
}else{
String currentName = s.next();
d.add(currentName.toUpperCase());
}
}
Any help would be appreciated.
You could also use regex to replace all numbers (or more special chars if needed)
testStr = testStr.replaceAll("\\d","");
Try to use StreamTokenizer (java.io) class to read file. it will split your file into tokens and also provide type of token like String value, number value in double data type, end of file, end of line). so you can easily identify the String token.
You can find details from here
http://docs.oracle.com/javase/6/docs/api/java/io/StreamTokenizer.html

Store associative array of strings with length as keys

I have this input:
5
it
your
reality
real
our
First line is number of strings comming after. And i should store it this way (pseudocode):
associative_array = [ 2 => ['it'], 3 => ['our'], 4 => ['real', 'your'], 7 => ['reality']]
As you can see the keys of associative array are the length of strings stored in inner array.
So how can i do this in java ? I came from php world, so if you will compare it with php, it will be very well.
MultiMap<Integer, String> m = new MultiHashMap<Integer, String>();
for(String item : originalCollection) {
m.put(item.length(), item);
}
djechlin already posted a better version, but here's a complete standalone example using just JDK classes:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
public class Main {
public static void main(String[] args) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String firstLine = reader.readLine();
int numOfRowsToFollow = Integer.parseInt(firstLine);
Map<Integer,Set<String>> stringsByLength = new HashMap<>(numOfRowsToFollow); //worst-case size
for (int i=0; i<numOfRowsToFollow; i++) {
String line = reader.readLine();
int length = line.length();
Set<String> alreadyUnderThatLength = stringsByLength.get(length); //int boxed to Integer
if (alreadyUnderThatLength==null) {
alreadyUnderThatLength = new HashSet<>();
stringsByLength.put(length, alreadyUnderThatLength);
}
alreadyUnderThatLength.add(line);
}
System.out.println("results: "+stringsByLength);
}
}
its output looks like this:
3
bob
bart
brett
results: {4=[bart], 5=[brett], 3=[bob]}
Java doesn't have associative arrays. But it does have Hashmaps, which mostly accomplishes the same goal. In your case, you can have multiple values for any given key. So what you could do is make each entry in the Hashmap an array or a collection of some kind. ArrayList is a likely choice. That is:
Hashmap<Integer,ArrayList<String>> words=new HashMap<Integer,ArrayList<String>>();
I'm not going to go through the code to read your list from a file or whatever, that's a different question. But just to give you the idea of how the structure would work, suppose we could hard-code the list. We could do it something like this:
ArrayList<String> set=new ArrayList<String)();
set.add("it");
words.put(Integer.valueOf(2), set);
set.clear();
set.add("your");
set.add("real");
words.put(Integer.valueOf(4), set);
Etc.
In practice, you probably would regularly be adding words to an existing set. I often do that like this:
void addWord(String word)
{
Integer key=Integer.valueOf(word.length());
ArrayList<String> set=words.get(key);
if (set==null)
{
set=new ArrayList<String>();
words.put(key,set);
}
// either way we now have a set
set.add(word);
}
Side note: I often see programmers end a block like this by putting "set" back into the Hashmap, i.e. "words.put(key,set)" at the end. This is unnecessary: it's already there. When you get "set" from the Hashmap, you're getting a reference, not a copy, so any updates you make are just "there", you don't have to put it back.
Disclaimer: This code is off the top of my head. No warranties expressed or implied. I haven't written any Java in a while so I may have syntax errors or wrong function names. :-)
As your key appears to be small integer, you could use a list of lists. In this case the simplest solution is to use a MultiMap like
Map<Integer, Set<String>> stringByLength = new LinkedHashMap<>();
for(String s: strings) {
Integer len = s.length();
Set<String> set = stringByLength.get(s);
if(set == null)
stringsByLength.put(len, set = new LinkedHashSet<>());
set.add(s);
}
private HashMap<Integer, List<String>> map = new HashMap<Integer, List<String>>();
void addStringToMap(String s) {
int length = s.length();
if (map.get(length) == null) {
map.put(length, new ArrayList<String>());
}
map.get(length).add(s);
}

Categories