How can i extract specific terms from string lines in Java? - java

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)
So, here's just example string line among thousands of string lines:
(split() doesn't work.!!! )
test.csv
"31451 CID005319044   15939353   C8H14O3S2    beta-lipoic acid   C1C[S#](=O)S[C##H]1CCCCC(=O)O "
"12232 COD05374044 23439353  C924O3S2    saponin   CCCC(=O)O "
"9048   CTD042032 23241  C3HO4O3S2 Berberine  [C##H]1CCCCC(=O)O "
I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position.
You can see there are big spaces between terms, so that's why I said 5th position.
In this case, how can I extract terms located in 5th position for each line?
One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that.
Because the length of whitespace is random, I can not use the .split() function.
For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**

Here is a solution for your problem using the string split and index of,
import java.util.ArrayList;
public class StringSplit {
public static void main(String[] args) {
String[] seperatedStr = null;
int fourthStrIndex = 0;
String modifiedStr = null, finalStr = null;
ArrayList<String> strList = new ArrayList<String>();
strList.add("31451 CID005319044   15939353   C8H14O3S2 beta-lipoic acid C1C[S#](=O)S[C##H]1CCCCC(=O)O ");
strList.add("12232 COD05374044 23439353 C924O3S2 saponin CCCC(=O)O ");
strList.add("9048 CTD042032 23241 C3HO4O3S2 Berberine [C##H]1CCCCC(=O)O ");
for (String item: strList) {
seperatedStr = item.split("\\s+");
fourthStrIndex = item.indexOf(seperatedStr[3]) + seperatedStr[3].length();
modifiedStr = item.substring(fourthStrIndex, item.length());
finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
System.out.println(finalStr.trim());
}
}
}
Output:
beta-lipoic acid
saponin
Berberine

Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:
String s[] = str.split("\\s\\s+");
for (String string : s) {
System.out.println(string);
}
Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)
public static List<String> getData(String str) {
List<String> list = new ArrayList<>();
String s="";
int count=0;
for(char c : str.toCharArray()){
System.out.println(c);
if (c==' '){
count++;
}else {
s = s+c;
}
if(count>1&&!s.equalsIgnoreCase("")){
list.add(s);
count=0;
s="";
}
}
return list;
}

This would be a relatively easy fix if it weren't for beta-lipoic acid...
Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.
Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array
While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...
Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like
Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
// return line[4].append(line[5]) or something like that
}
Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes
line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");
Then hopefully the only thing that is left would be the term you're looking for.
Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.
Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

Related

java string matching

All that I am doing in my project is taking two values(that I am reading from two different excel files) and checking how similar they are.! I tried using the pattern and matcher classes which works perfectly fine when both the words are exactly the same (as in organisation and organisation/s). In my data I have say something like (employee and employment), I just need "employ" as the common string between the two, in which case..pattern and matches fails.! I am stuck with this since a week.I have about 700 rows in the first excel file and about 9000 in the other. Each cell value that I am reading into the program using java, I am storing them in two separate variables. Next, i tried using 4 for loops to match word by word and character by character to find only those characters that match between the two.I have pasted the coded for the for loop implementation. Four for loops are like driving me nuts.! Any help in completing this would be greatly appreciated.
String str1 = "Cover for employees of the company";
String str2 = "Employment Agencies ";
String str,strfinal;
String[] count1 = str1.split("\\s+");
String[] count2 = str2.split("\\s+");
char[] count11 = str1.toCharArray();
char[] count22 = str2.toCharArray();
for(int i=0;i<count1.length;i++)
{
for(int j=0;j<count2.length;j++)
{
for(int m=0;m<count1[i].length();m++)
{
for(int n=0;n<count2[j].length();n++)
{
if(count11[m]==count22[n])
{
// please look at the logic that I am looking for to implement
}
}
}
}
}
Expected output: employ
one more concept that I am trying to implement (in order to make my program more efficient) is..
cover ----(compared with) employment. First character itself does not match.Implies go to the next word in the second string. Once all words in the second string are traversed and checked for, go to the next word in the first string and compare this word with all the words in the second string.
Okay.. so this is what I am looking for right now.. Any help will be greatly appreciated.
Thanks!

How do you pull data from a .FIC file in java?

So I am writing a scrabble word suggestion program that I decided to do because I wanted to learn sets (don't worry, I at least got that part) and referencing info/data not created within the program. Im pretty new to Java (and programming in general), but I was wondering how to pull words from a word list .FIC file in order to check them against words generated from the letters inputted.
To clarify, I have written a program which takes a series of letters and returns a set of every possible word created from those letters. for example:
input:
abc
would give a set containing the "words":
a, ab, ac, abc, acb, b, ba, bc, bac, bca, c, ca, cb, cab, cba
What I am asking, really, is how to check those to find the ones contained in the .FIC file.
The file is the "official crosswords" file from the Moby project word list and I am still (very) shaky on parsing and other file dealing-with methods. I am continuing to research so I dont have any prototype code for that.
Sorry if the question isn't entirely clear.
edit: here is the method that makes the "words" to make it easier to understand the idea. The part I don't understand is specifically how to pull a word(as a string) from the .FIC file.
private static Set<String> Words(String s)
{
Set<String> tempwords = new TreeSet<String>();
if (s.length() == 1)
{ // base case, last letter
tempwords.add(s);
// System.out.println(s); uncomment when debugging
}
else
{
//set up to add each letter in s
for (int i = 0; i < s.length(); i++)
{ //cut the i letter out of the string
String remaining = s.substring(0, i) + s.substring(i+1);
//recursion to add all combinations of letters onto the current letter/"word"
for (String permutation : Words(remaining))
{
// System.out.println(s.substring(i, i+1) + permutation); uncomment when debugging
//add the full length words
tempwords.add(s.substring(i, i+1) + permutation);
// System.out.println(permutation); uncomment when debugging
//add the not-full-length words
tempwords.add(permutation);
}
}
}
// System.out.println(tempwords); uncomment when debugging
return tempwords;
}
I dont know if it is the best solution, but i figured it out (hobbs the line thing helped a lot, thank you). I found that this works:
public static void main(String[] args) throws FileNotFoundException
{
Scanner s = new Scanner(new FileReader("C:/Users/Sean/workspace/Imbored/bin/113809of.fic"));
while(true)
{
words.clear();
String letters = enterLetters();
words.addAll(Words(letters));
while(s.hasNextLine()) {
String line = s.nextLine();
String finalword = checkWords(line, words);
if (finalword != null) finalwordset.add(finalword);
}
s.reset();
System.out.println(finalwordset);
System.out.println();
System.out.println("_________________________________________________________________________");
}
}
A few things:
The checkWords method checks if the current word from the file is in the generated list of "words"
The enterletters method takes user inputted letters and returns them in a string
The Words method returns a set of strings of all of the possible combinations of the characters in the given string, with each character used up to as many times as it appears in the string and no repeated "words" in the returned set.
finalwordset and words are arraylists of strings defined as instance variables(i would put them in the main method but I'm lazy and it doesn't matter for this case)
I am very sure there is a better/more efficient way to do this, but this at least works.
Finally: I decided to answer rather than delete because I didn't see this answered anywhere else, so if it is feel free to delete the question or link to the other answer or whatever, at this point it is to help other people.

Split paragraph into sentences with titles and numbers

I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :
public Map<String, Double> breakSentence(String document) {
sentences = new HashMap<String, Double>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
Double tfIdf = 0.0;
int start = bi.first();
for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
String sentence = document.substring(start, end);
sentences.put(sentence, tfIdf);
}
return sentences;
}
The problem is when the paragraph contain titles or numbers, for example :
"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."
What my code will produce is :
sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code
Instead of 1 single sentence because of the period in titles and numbers.
Is there a way to fix this to handle titles and numbers with Java?
Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.
I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)
I'm including my complete program so you can see it all:
import java.text.BreakIterator;
import java.util.*;
public class TestCode {
private static final String[] ABBREVIATIONS = {
"Dr." , "Prof." , "Mr." , "Mrs." , "Ms." , "Jr." , "Ph.D."
};
public static void main(String[] args) throws Exception {
String text = "Prof. Roberts and Dr. Andrews trying to solve a " +
"problem by writing a 1.200 lines of code. This will " +
"work if Mr. Java writes solid code.";
for (String s : breakSentence(text)) {
System.out.println(s);
}
}
public static List<String> breakSentence(String document) {
List<String> sentenceList = new ArrayList<String>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
int start = bi.first();
int end = bi.next();
int tempStart = start;
while (end != BreakIterator.DONE) {
String sentence = document.substring(start, end);
if (! hasAbbreviation(sentence)) {
sentence = document.substring(tempStart, end);
tempStart = end;
sentenceList.add(sentence);
}
start = end;
end = bi.next();
}
return sentenceList;
}
private static boolean hasAbbreviation(String sentence) {
if (sentence == null || sentence.isEmpty()) {
return false;
}
for (String w : ABBREVIATIONS) {
if (sentence.contains(w)) {
return true;
}
}
return false;
}
}
What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:
"Prof."
and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:
"Roberts and Dr."
It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:
"Andrews trying to solve a problem by writing a 1.200 lines of code."
Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.
Now it sets the temp start to the current 'end' and continues.
Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug
It appears that Prof. Roberts only gets split if Roberts begins with a capital letter.
If Roberts begins with a lowercase r, it does not get split.
So... I guess that's how BreakIterator deals with periods.
I'm sure further reading of the documentation will explain how this behavior can be modified.

Determining if a given string of words has words greater than 5 letters long

So, I'm in need of help on my homework assignment. Here's the question:
Write a static method, getBigWords, that gets a String parameter and returns an array whose elements are the words in the parameter that contain more than 5 letters. (A word is defined as a contiguous sequence of letters.) So, given a String like "There are 87,000,000 people in Canada", getBigWords would return an array of two elements, "people" and "Canada".
What I have so far:
public static getBigWords(String sentence)
{
String[] a = new String;
String[] split = sentence.split("\\s");
for(int i = 0; i < split.length; i++)
{
if(split[i].length => 5)
{
a.add(split[i]);
}
}
return a;
}
I don't want an answer, just a means to guide me in the right direction. I'm a novice at programming, so it's difficult for me to figure out what exactly I'm doing wrong.
EDIT:
I've now modified my method to:
public static String[] getBigWords(String sentence)
{
ArrayList<String> result = new ArrayList<String>();
String[] split = sentence.split("\\s+");
for(int i = 0; i < split.length; i++)
{
if(split[i].length() > 5)
{
if(split[i].matches("[a-zA-Z]+"))
{
result.add(split[i]);
}
}
}
return result.toArray(new String[0]);
}
It prints out the results I want, but the online software I use to turn in the assignment, still says I'm doing something wrong. More specifically, it states:
Edith de Stance states:
⇒     You might want to use: +=
⇒     You might want to use: ==
⇒     You might want to use: +
not really sure what that means....
The main problem is that you can't have an array that makes itself bigger as you add elements.
You have 2 options:
ArrayList (basically a variable-length array).
Make an array guaranteed to be bigger.
Also, some notes:
The definition of an array needs to look like:
int size = ...; // V- note the square brackets here
String[] a = new String[size];
Arrays don't have an add method, you need to keep track of the index yourself.
You're currently only splitting on spaces, so 87,000,000 will also match. You could validate the string manually to ensure it consists of only letters.
It's >=, not =>.
I believe the function needs to return an array:
public static String[] getBigWords(String sentence)
It actually needs to return something:
return result.toArray(new String[0]);
rather than
return null;
The "You might want to use" suggestions points to that you might have to process the array character by character.
First, try and print out all the elements in your split array. Remember, you do only want you look at words. So, examine if this is the case by printing out each element of the split array inside your for loop. (I'm suspecting you will get a false positive at the moment)
Also, you need to revisit your books on arrays in Java. You can not dynamically add elements to an array. So, you will need a different data structure to be able to use an add() method. An ArrayList of Strings would help you here.
split your string on bases of white space, it will return an array. You can check the length of each word by iterating on that array.
you can split string though this way myString.split("\\s+");
Try this...
public static String[] getBigWords(String sentence)
{
java.util.ArrayList<String> result = new java.util.ArrayList<String>();
String[] split = sentence.split("\\s+");
for(int i = 0; i < split.length; i++)
{
if(split[i].length() > 5)
{
if(split[i].matches("[a-zA-Z]+"))
{
result.add(split[i]);
}
if (split[i].matches("[a-zA-Z]+,"))
{
String temp = "";
for(int j = 0; j < split[i].length(); j++)
{
if((split[i].charAt(j))!=((char)','))
{
temp += split[i].charAt(j);
//System.out.print(split[i].charAt(j) + "|");
}
}
result.add(temp);
}
}
}
return result.toArray(new String[0]);
}
Whet you have done is correct but you can't you add method in array. You should set like a[position]= spilt[i]; if you want to ignore number then check by Float.isNumber() method.
Your logic is valid, but you have some syntax issues. If you are not using an IDE like Eclipse that shows you syntax errors, try commenting out lines to pinpoint which ones are syntactically incorrect. I want to also tell you that once an array is created its length cannot change. Hopefully that sets you off in the right directions.
Apart from syntax errors at String array declaration should be like new String[n]
and add method will not be there in Array hence you should use like
a[i] = split[i];
You need to add another condition along with length condition to check that the given word have all letters this can be done in 2 ways
first way is to use Character.isLetter() method and second way is create regular expression
to check string have only letter. google it for regular expression and use matcher to match like the below
Pattern pattern=Pattern.compile();
Matcher matcher=pattern.matcher();
Final point is use another counter (let say j=0) to store output values and increment this counter as and when you store string in the array.
a[j++] = split[i];
I would use a string tokenizer (string tokenizer class in java)
Iterate through each entry and if the string length is more than 4 (or whatever you need) add to the array you are returning.
You said no code, so... (This is like 5 lines of code)

Need help parsing strings in Java

I am reading in a csv file in Java and, depending on the format of the string on a given line, I have to do something different with it. The three different formats contained in the csv file are (using random numbers):
833
"79, 869"
"56-57, 568"
If it is just a single number (833), I want to add it to my ArrayList. If it is two numbers separated by a comma and surrounded by quotations ("79, 869)", I want to parse out the first of the two numbers (79) and add it to the ArrayList. If it is three numbers surrounded by quotations (where the first two numbers are separated by a dash, and the third by a comma ["56-57, 568"], then I want to parse out the third number (568) and add it to the ArrayList.
I am having trouble using str.contains() to determine if the string on a given line contains a dash or not. Can anyone offer me some help? Here is what I have so far:
private static void getFile(String filePath) throws java.io.IOException {
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String str;
while ((str = reader.readLine()) != null) {
if(str.endsWith("\"")){
if (str.contains(charDash)){
System.out.println(str);
}
}
}
}
Thanks!
I recommend using the version of indexOf that actually takes a char rather than a string, since this method is much faster. (It is a simple loop, without a nested loop.)
I.e.
if (str.indexOf('-')!=-1) {
System.out.println(str);
}
(Note the single quotes, so this is a char, rather than a string.)
But then you have to split the line and parse the individual values. At present, you are testing if the whole line ends with a quote, which is probably not what you want.
The following code works for me (note: I wrote it with no optimization in mind - it's just for testing purposes):
public static void main(String args[]) {
ArrayList<String> numbers = GetNumbers();
}
private static ArrayList<String> GetNumbers() {
String str1 = "833";
String str2 = "79, 869";
String str3 = "56-57, 568";
ArrayList<String> lines = new ArrayList<String>();
lines.add(str1);
lines.add(str2);
lines.add(str3);
ArrayList<String> numbers = new ArrayList<String>();
for (Iterator<String> s = lines.iterator(); s.hasNext();) {
String thisString = s.next();
if (thisString.contains("-")) {
numbers.add(thisString.substring(thisString.indexOf(",") + 2));
} else if (thisString.contains(",")) {
numbers.add(thisString.substring(0, thisString.indexOf(",")));
} else {
numbers.add(thisString);
}
}
return numbers;
}
Output:
833
79
568
Although it gets a lot of hate these days, I still really like the StringTokenizer for this kind of stuff. You can set it up to return the tokens and, at least to me, it makes the processing trivial without interacting with regexes
you'd have to create it using ",- as your tokens, then just kick it off in a loop.
st=new StringTokenizer(line, "\",-", true);
Then you set up a loop:
while(st.hasNextToken()) {
String token=st.nextToken();
Each case becomes it's own little part of the loop:
// Use punctuation to set flags that tell you how to interpret the numbers.
if(token == "\"") {
isQuoted = !isQuoted;
} else if(token == ",") {
...
} else if(...) {
...
} else { // The punctuation has been dealt with, must be a number group
// Apply flags to determine how to parse this number.
}
I realize that StringTokenizer is outdated now, but I'm not really sure why. Parsing regular expressions can't be faster and the syntax is--well split is a pretty sweet syntax I gotta admit.
I guess if you and everyone you work with is really comfortable with Regular Expressions you could replace that with split and just iterate over the resultant array but I'm not sure how to get split to return the punctuation--probably that "+" thing from other answers but I never trust that some character I'm passing to a regular expression won't do something utterly unexpected.
will
if (str.indexOf(charDash.toString()) > -1){
System.out.println(str);
}
do the trick?
which by the way is fastest than contains... because it implements indexOf
Will this work?
if(str.contains("-")) {
System.out.println(str);
}
I wonder if the charDash variable is not what you are expecting it to be.
I think three regexes would be your best bet - because with a match, you also get the bit you're interested in. I suck at regex, but something along the lines of:
.*\-.*, (.+)
.*, (.+)
and
(.+)
ought to do the trick (in order, because the final pattern matches anything including the first two).

Categories