i want to create a inverse index that mean
if i have a terms in a multi-document the result will be like this
term 1 =[doc1], term2 =[doc2 , doc3 , doc4 ] ....
this is my code:
public class TP3 {
private static String DIRNAME = "/home/amal/Téléchargements/lemonde";
private static String STOPWORDS_FILENAME = "/home/amal/Téléchargements/lemonde/frenchST.txt";
public static TreeMap<String, TreeSet<String>> getInvertedFile(File dir, Normalizer normalizer) throws IOException {
TreeMap<String, TreeSet<String>> st = new TreeMap<String, TreeSet<String>>();
ArrayList<String> wordsInFile;
ArrayList<String> words;
String wordLC;
if (dir.isDirectory()) {
String[] fileNames = dir.list();
Integer number;
for (String fileName : fileNames) {
System.err.println("Analyse du fichier " + fileName);
wordsInFile = new ArrayList<String>();
words = normalizer.normalize(new File(dir, fileName));
for (String word : words) {
wordLC = word.toLowerCase();
if (!wordsInFile.contains(word)) {
TreeSet<String> set = st.get(word);
set.add(fileName);
}
}
}
}
for (Map.Entry<String, TreeSet<String>> hit : st.entrySet()) {
System.out.println(hit.getKey() + "\t" + hit.getValue());
}
return st;
}
}
i have an erreor in
set.add(fileName);
i don't know what is the problem please help me
Your main issue is that these two lines are not going to be good:
if (!wordsInFile.contains(word)) {
TreeSet<String> set = st.get(word);
You never put a set into st so set will be null. After this line you should probably have something like:
if(set == null)
{
set = new TreeSet<String>();
st.put(word, set);
}
That should fix your current problem.
Hint for next time, this will be re-read by future users with the same problem and also represents YOU (Someone in the future will read this question when interviewing you for a job!)
Spend some time formatting it thinking of your readers. Prune out comments and correct indentation, don't just paste and run. Also post a little bit of the error stack trace--they are amazingly helpful! Had you posted it, it would have been a "NullPointerException", on that line there is really only one way to get an NPE and it would have saved us having to analyze your code.
PS: I Edited your question so you could see the difference (and to keep it from being closed on you). The main problem with your formatting was the use of tabs.. for programmers tabs are--well let's just say they only work in very controlled conditions. In this case it really helps to watch the preview pane (below your editing box) while you edit--scroll down before you submit to see what we will actually see.
Related
I'm facing this problem and kinda dont know how to deal with it. I need to process a csv file that can contain 100 or 100 thousand lines.
I need to do some validations before to proceed the processing, one of them is to check if each document has same typeOfDoc. Let me explain:
Content of file:
document;typeOfDoc
25693872076;2
25693872076;2
...
25693872076;1
This validations consists in check if to document has different type of typeOfDoc along the file, and if it is, show that's invalid.
Initially I thinked in two for-loop to iterate over first occurrence of document (which I assume that's correct, because I don't know what I'm going to receive), and for that correct document I iterate over the rest of file to verify if has another occurence of it, and if have same document but if typeOfDoc is different of first occurence, I store this validation on a object to show that this file has one document with two different types. But.... you'll imagine where it is going. This can't happen with 100k lines, even with 100.
Which is the better way to do that?
Something that can help.
This is how I open and process the file (try-catch, close(), and properly names were omitted):
List<String> lines = new BufferedReader(new FileReader(path)).lines().skip(1).collect(Collectors.toList());
for (String line : lines) {
String[] arr = line.split(";");
String document = arr[0];
String typeOfDoc = arr[1];
for (String line2 : lines) {
String[] arr2 = line2.split(";");
String document2 = arr2[0];
String typeOfDoc2 = arr2[1];
if (document.equals(document2) && !typeOfDoc.equals(typeOfDoc2)) {
...create object to show that error on grid...
}
}
}
You can try to look for duplicate keys and values in a Hashmap, which makes it easier.
public class App {
public static void main(String[] args) throws IOException {
String delimiter = ";";
Map<String, String> map = new HashMap<>();
Stream<String> lines = Files.lines(Paths.get("somefile.txt"));
lines.forEach(line -> checkAndPutInMap(line,map,delimiter));
lines.close();
}
private static void checkAndPutInMap(String line, Map<String,String> map, String delimiter) {
String document = line.split(delimiter)[0];
String typeOfDoc = line.split(delimiter)[1];
if (map.containsKey(document) && !map.get(document).equals(typeOfDoc)) {
...create object to show that error on grid...
}
else
map.put(document, typeOfDoc));
}
}
What I am planning to do is basically :
Read the first file word by word and store the words in a Set(SetA).
Read the second file and check if the first Set(SetA) contains the word, if it does then store it in the second set(SetB). Now SetB contains the common words in first and Second file.
Similarly we will read the third file and check if SetB contains the word and store the words in SetC.
So if you have any suggestions or any problems in my approach. Please Suggest.
You can determine the intersection of two sets using retainAll
public class App {
public static void main(String[] args) {
App app = new App();
app.run();
}
private void run() {
List<String> file1 = Arrays.asList("aap", "noot", "aap", "wim", "vuur", "noot", "wim");
List<String> file2 = Arrays.asList("aap", "noot", "mies", "aap", "zus", "jet", "aap", "wim", "vuur");
List<String> file3 = Arrays.asList("noot", "mies", "wim", "vuur");
System.out.println(getCommonWords(file1, file2, file3));
}
#SafeVarargs
private final Set<String> getCommonWords(List<String>... files) {
Set<String> result = new HashSet<>();
// possible optimization sort files by ascending size
Iterator<List<String>> it = Arrays.asList(files).iterator();
if (it.hasNext()) {
result.addAll(it.next());
}
while (it.hasNext()) {
Set<String> words = new HashSet<>(it.next());
result.retainAll(words);
}
return result;
}
}
Also check out this answer which shows the same solution I gave above, and also ways to do it with Java 8 Streams.
Welcome to Stack Overflow!
The approach seems sound. May I suggest using Regex to possibly save your time coding. One other concern would be to make sure to not store every word, but instead to store only unique words in your set.
I'm trying to split each object in the ArrayList because a lot of lines contains comma(","). Each object contains a item and value(but not all objects contains value):
Access control enable ,disabled
Access policy prototyping ,enabled
Access user group
Implicit roles access policy
World access policy ,disabled
This is my piece of code:
List<String> CEP = new ArrayList<String>();
List<String> CEV = new ArrayList<String>();
for (String str : CE) {
for (String s : str.split(",")) {
CEP.add(s.trim());
}
}
"CE" is the main ArrayList.
"CEP" is the ArrayList that should contain only the "items"(what before comma).
"CEV" is the ArrayList that should contain all "values"
My piece of code only split it by comma to the same ArrayList and another problem is how to see an object doesn't have "value" and add blank to the values ArrayList.
Thanks guys.
Maybe using a CSV parser like super-csv is a good option. It will really pay off when your comma separated list starts to become more diverse.
Univocity provides a benchmark of CSV parsers. It says that univocity-parsers
is fast, which is no surprise. You could give it a try.
Using an array and checking its length allows you to handle the missing values:
for (String str : CE) {
String[] a = str.split(",");
CEP.add(a[0].trim());
if(a.length > 1) {
CEV.add(a[1].trim());
} else {
CEV.add(null); //just check that this is OK
}
}
Just make sure that the value being added to CEV for missing values (null in the above code) is as required.
You could split and check the length of the array afterwards:
public static void main(String[] args) {
List<String> ce = new ArrayList<String>();
ce.add("Access control enable ,disabled");
ce.add("Access policy prototyping ,enabled");
ce.add("Access user group ");
ce.add("Implicit roles access policy ");
ce.add("World access policy ,disabled");
Map<String, String> cepCev = new HashMap<String, String>();
ce.forEach((String line) -> {
String[] splitLine = line.split(",");
if (splitLine.length > 1) {
cepCev.put(splitLine[0].trim(), splitLine[1].trim());
} else {
cepCev.put(splitLine[0].trim(), "not set");
}
});
cepCev.forEach((String key, String value) -> {
System.out.println(key + ": " + value);
});
}
You can do it using java 8 and steam in simple way,
List<String> CEP = new ArrayList<String>();
CE.stream().forEach(ce->CEP.addAll(Arrays.asList(ce.split(","))));
List<String> CEV= CEP.stream().map(String::trim).filter(s -> s.length()>0).collect(Collectors.toList());
System.out.println(CEP);
System.out.println(CEV);
I am a beginner in Java. Basically, I have loaded each text document and stored each individual words in the text document in the hasmap. Afterwhich, I tried storing all the hashmaps in an ArrayList. Now I am stuck with how to retrieve all the words in my hashmaps that is in the arraylist!
private static long numOfWords = 0;
private String userInputString;
private static long wordCount(String data) {
long words = 0;
int index = 0;
boolean prevWhiteSpace = true;
while (index < data.length()) {
//Intialise character variable that will be checked.
char c = data.charAt(index++);
//Determine whether it is a space.
boolean currWhiteSpace = Character.isWhitespace(c);
//If previous is a space and character checked is not a space,
if (prevWhiteSpace && !currWhiteSpace) {
words++;
}
//Assign current character's determination of whether it is a spacing as previous.
prevWhiteSpace = currWhiteSpace;
}
return words;
} //
public static ArrayList StoreLoadedFiles()throws Exception{
final File f1 = new File ("C:/Users/Admin/Desktop/dataFiles/"); //specify the directory to load files
String data=""; //reset the words stored
ArrayList<HashMap> hmArr = new ArrayList<HashMap>(); //array of hashmap
for (final File fileEntry : f1.listFiles()) {
Scanner input = new Scanner(fileEntry); //load files
while (input.hasNext()) { //while there are still words in the document, continue to load all the words in a file
data += input.next();
input.useDelimiter("\t"); //similar to split function
} //while loop
String textWords = data.replaceAll("\\s+", " "); //remove all found whitespaces
HashMap<String, Integer> hm = new HashMap<String, Integer>(); //Creates a Hashmap that would be renewed when next document is loaded.
String[] words = textWords.split(" "); //store individual words into a String array
for (int j = 0; j < numOfWords; j++) {
int wordAppearCount = 0;
if (hm.containsKey(words[j].toLowerCase().replaceAll("\\W", ""))) { //replace non-word characters
wordAppearCount = hm.get(words[j].toLowerCase().replaceAll("\\W", "")); //remove non-word character and retrieve the index of the word
}
if (!words[j].toLowerCase().replaceAll("\\W", "").equals("")) {
//Words stored in hashmap are in lower case and have special characters removed.
hm.put(words[j].toLowerCase().replaceAll("\\W", ""), ++wordAppearCount);//index of word and string word stored in hashmap
}
}
hmArr.add(hm);//stores every single hashmap inside an ArrayList of hashmap
} //end of for loop
return hmArr; //return hashmap ArrayList
}
public static void LoadAllHashmapWords(ArrayList m){
for(int i=0;i<m.size();i++){
m.get(i); //stuck here!
}
Firstly your login wont work correctly. In the StoreLoadedFiles() method you iterate through the words like for (int j = 0; j < numOfWords; j++) { . The numOfWords field is initialized to zero and hence this loop wont execute at all. You should initialize that with length of words array.
Having said that to retrieve the value from hashmap from a list of hashmap, you should first iterate through the list and with each hashmap you could take the entry set. Map.Entry is basically the pair that you store in the hashmap. So when you invoke map.entrySet() method it returns a java.util.Set<Map.Entry<Key, Value>>. A set is returned because the key will be unique.
So a complete program will look like.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map.Entry;
import java.util.Scanner;
public class FileWordCounter {
public static List<HashMap<String, Integer>> storeLoadedFiles() {
final File directory = new File("C:/Users/Admin/Desktop/dataFiles/");
List<HashMap<String, Integer>> listOfWordCountMap = new ArrayList<HashMap<String, Integer>>();
Scanner input = null;
StringBuilder data;
try {
for (final File fileEntry : directory.listFiles()) {
input = new Scanner(fileEntry);
input.useDelimiter("\t");
data = new StringBuilder();
while (input.hasNext()) {
data.append(input.next());
}
input.close();
String wordsInFile = data.toString().replaceAll("\\s+", " ");
HashMap<String, Integer> wordCountMap = new HashMap<String, Integer>();
for(String word : wordsInFile.split(" ")){
String strippedWord = word.toLowerCase().replaceAll("\\W", "");
int wordAppearCount = 0;
if(strippedWord.length() > 0){
if(wordCountMap.containsKey(strippedWord)){
wordAppearCount = wordCountMap.get(strippedWord);
}
wordCountMap.put(strippedWord, ++wordAppearCount);
}
}
listOfWordCountMap.add(wordCountMap);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} finally {
if(input != null) {
input.close();
}
}
return listOfWordCountMap;
}
public static void loadAllHashmapWords(List<HashMap<String, Integer>> listOfWordCountMap) {
for(HashMap<String, Integer> wordCountMap : listOfWordCountMap){
for(Entry<String, Integer> wordCountEntry : wordCountMap.entrySet()){
System.out.println(wordCountEntry.getKey() + " - " + wordCountEntry.getValue());
}
}
}
public static void main(String[] args) {
List<HashMap<String, Integer>> listOfWordCountMap = storeLoadedFiles();
loadAllHashmapWords(listOfWordCountMap);
}
}
Since you are beginner in Java programming I would like to point out a few best practices that you could start using from the beginning.
Closing resources : In your while loop to read from files you are opening a Scanner like Scanner input = new Scanner(fileEntry);, But you never closes it. This causes memory leaks. You should always use a try-catch-finally block and close resources in finally block.
Avoid unnecessary redundant calls : If an operation is the same while executing inside a loop try moving it outside the loop to avoid redundant calls. In your case for example the scanner delimiter setting as input.useDelimiter("\t"); is essentially a one time operation after a scanner is initialized. So you could move that outside the while loop.
Use StringBuilder instead of String : For repeated string manipulations such as concatenation should be done using a StringBuilder (or StringBuffer when you need synchronization) instead of using += or +. This is because String is an immutable object, meaning its value cannot be changed. So each time when you do a concatenation a new String object is created. This results in a lot of unused instances in memory. Where as StringBuilder is mutable and values could be changed.
Naming convention : The usual naming convention in Java is starting with lower-case letter and first letter upper-case for each word. So its a standard practice to name a method as storeLoadedFiles as opposed to StoreLoadedFiles. (This could be opinion based ;))
Give descriptive names : Its a good practice to give descriptive names. It helps in later code maintenance. Say its better to give a name as wordCountMap as opposed to hm. So in future if someone tries to go through your code they'll get a better and faster understanding about your code with descriptive names. Again opinion based.
Use generics as much as possible : This avoid additional casting overhead.
Avoid repetition : Similar to point 2 if you have an operation that result in the same output and need to be used multiple times try moving it to a variable and use the variable. In your case you were using words[j].toLowerCase().replaceAll("\\W", "") multiple times. All the time the result is the same but it creates unnecessary instances and repetitions. So you could move that to a String and use that String elsewhere.
Try using for-each loop where ever possible : This relieves us from taking care of indexing.
These are just suggestions. I tried to include most of it in my code but I wont say its the perfect one. Since you are a beginner if you tried to include these best practices now itself it'll get ingrained in you. Happy coding.. :)
for (HashMap<String, Integer> map : m) {
for(Entry<String,Integer> e:map.entrySet()){
//your code here
}
}
or, if using java 8 you can play with lambda
m.stream().forEach((map) -> {
map.entrySet().stream().forEach((e) -> {
//your code here
});
});
But before all you have to change method signature to public static void LoadAllHashmapWords(List<HashMap<String,Integer>> m) otherwise you would have to use a cast.
P.S. are you sure your extracting method works? I've tested it a bit and had list of empty hashmaps all the time.
I've a some strings like that "paddington road" and I need to extract the word "road" from this string. How can I do that?
The problem is that I need to process a list of streets and extract some words like "road" "park" "street" "boulevard" and many others.
What could be the best way to do that? The complexity is O(n*m) and if you consider that I process more than 5000 streets, the performance should be very important.
I'm extracting the values from a Postgres db and putting into a List but I'm not sure it's the best way, may be a hash table is faster to query?
I tried something like this:
// Parse selectedList
Iterator<String> it = streets.iterator();
Iterator<String> it_exception = exception.iterator();
int counter = streets.size();
while(it.hasNext()) {
while ( it_exception.hasNext() ) {
// remove substring it_exception.next() from it.next()
}
}
What do you think?
You can try Set:
Set<String> exceptions = new HashSet<String>(...);
for (String street : streets) {
String[] words = street.split(" ");
StringBuilder res = new StringBuilder();
for (String word : words) {
if (!exceptions.contains(word)) {
res.append(word).append(" ");
}
}
System.out.println(res);
}
I think complexity will be O(n), where n is a number of all words in streets.
You need to get a new iterator for your list of keywords at each iteration of the outer loop. The easiest way is to use the foreach syntax:
for (String streetName : streets) {
for (String keyword : keywords) {
// find if the string contains the keyword, and perhaps break if found to avoid searching for the other keywords
}
}
Don't preoptimize. 5000 is nothing for a computer, and street names are short strings. And if you place the most frequent keywords (street, rather than boulevard) at the beginning of the keyword list, you'll have less iterations.
List streets = new ArrayList<String>();
streets.add("paddington road");
streets.add("paddington park");
for (Object object : streets) {
String cmpstring = object.toString();
String[] abc = cmpstring.split(" ");
String secondwrd = abc[1];
System.out.println("secondwrd"+secondwrd);
}
you can keep secondwrd in a list or string buffer etc....