Is there a fast way to search for string in another string?
I have this kind of a file:
<br>
Comment EC00:
<br>
The EC00 is different from EC12 next week. The EC00 much wetter in the very end, which is not seen before.
<br>
<br>
<br>
Comment EC12:
<br>
The Ec12 of today is reliable. It starts cold, but temp are rising. From Sunday normal temp and wet, except for a strengthening high from SE in the very end.
<br>
I have deleted all the <br>'s and I will be searching for a string like "Comment EC12:" to retrieve what comes after:
The Ec12 of today is reliable. It starts cold, but temp are rising. From Sunday normal temp and wet, except for a strengthening high from SE in the very end.
Or maybe it could be a better idea to leave all the <br>'s so that I will know at least where to stop reading the lines..
P.S. These comments might have multiple occurences in the document.
EDIT:
I think that this solution would be ok for finding the occurences, at least a good place to start..
This is the last version, it works for me very good, because I know what in the HTML will be static and what is not.. But for those, who would like to do something simmilar, you can rewrite first two loops in the simmilar way as the last one(instead of 'if' using while - going down the lines of the text file)
StringTokenizer parser = new StringTokenizer(weatherComments);
String commentLine = "";
String commentWord = "";
while (parser.hasMoreTokens()) {
if (parser.nextToken().equals("Comment")) {
String commentType = parser.nextToken();
if (commentType.equals(forecastZone + ":")) {
parser.nextToken(); //first occured <br>
commentWord = parser.nextToken();
while(!commentWord.equals("<br>")){
commentLine += commentWord + " ";
commentWord = parser.nextToken();
}
commentLine += "\n";
System.out.println(commentLine);
}
}
}
P.P.S.
Before downloading a lot of libraries to make your code look smaller or to understand things easier, think first how to solve it yourself
You can try to simply use indexOf():
String html = ...;
String search = "Comment EC12:";
int comment = html.indexOf(search);
if (comment != -1) {
int start = comment + search.length();
int end = start + ...;
String after = html.substring(start, end);
...
}
The problem is to find the end of the text. So it may be useful not to replace the <br> and split the HTML on the tags:
String html = ...;
String[] parts = html.split("\\p{Space}*<br>\\p{Space}*")
for (int i = 0; i < parts.length; i += 2) {
String search = parts[i];
String after = parts[i + 1];
System.out.println(search + "\n\t" + after);
}
The example will print the following:
Comment EC00:
The EC00 is different from EC12 next week. The EC00 much wetter in the very end, which is not seen before.
Comment EC12:
The Ec12 of today is reliable. It starts cold, but temp are rising. From Sunday normal temp and wet, except for a strengthening high from SE in the very end.
Firstly i would remove blank lines and < br > and the i would implement an algorithm like BNDM for searching or better use a library like StringSearch. From the site "High-performance pattern matching algorithms in Java" http://johannburkard.de/software/stringsearch/
Depending on what you want to achieve, this might be an overkill, but I suggest you use finite state automaton string searching. You ca have a look at an example at http://en.literateprograms.org/Finite_automaton_string_search_algorithm_%28Java%29.
Related
I have extracted text from "web 2.0 wikipedia" article, and splitted it into "sentences". After that, I am going to create "Strings" which each string containing 5 sentences.
When extracted, the text looks like below, in EditText
Below is my code
finalText = textField.getText().toString();
String[] textArrayWithFullStop = finalText.split("\\. ");
String colelctionOfFiveSentences = "";
List<String>textCollection = new ArrayList<String>();
for(int i=0;i<textArrayWithFullStop.length;i++)
{
colelctionOfFiveSentences = colelctionOfFiveSentences + textArrayWithFullStop[i];
if( (i%5==0) )
{
textCollection.add(colelctionOfFiveSentences);
colelctionOfFiveSentences = "";
}
}
But, when I use the Toast to display the text, here what is gives
Toast.makeText(Talk.this, textCollection.get(0), Toast.LENGTH_LONG).show();
As you can see, this is only one sentence! But I expected it to have 5 sentences!
And the other thing is, the second sentence is starting from somewhere else. Here how I have extracted it into Toast
Toast.makeText(Talk.this, textCollection.get(1), Toast.LENGTH_LONG).show();
This make no sense to me! How can I properly split the text into sentences and, create Strings containing 5 sentences each?
The problem is that for the first sentence, 0 % 5 = 0, so it is being added to the array list immediately. You should use another counter instead of mod.
finalText = textField.getText().toString();
String[] textArrayWithFullStop = finalText.split("\\. ");
String colelctionOfFiveSentences = "";
int sentenceAdded = 0;
List<String>textCollection = new ArrayList<String>();
for(int i=0;i<textArrayWithFullStop.length;i++)
{
colelctionOfFiveSentences += textArrayWithFullStop[i] + ". ";
sentenceAdded++;
if(sentenceAdded == 5)
{
textCollection.add(colelctionOfFiveSentences);
colelctionOfFiveSentences = "";
sentenceAdded = 0;
}
}
add ". " to textArrayWithFullStop[i]
colelctionOfFiveSentences = colelctionOfFiveSentences + textArrayWithFullStop[i]+". ";
I believe that if you modify the mod line to this:
if(i%5==4)
you will have what you need.
You probably realize this, but there are other reasons why someone might use a ". ", that doesn't actually end a sentence, for instance
I spoke to John and he said... "I went to the store.
Then I went to the Tennis courts.",
and I don't believe he was telling the truth because
1. Why would someone go to play tennis after going to the store and
2. John has no legs!
I had to ask, am I going to let him get away with these lies?
That's two sentences that don't end with a period and would mislead your code into thinking it's 5 sentences broken up at entirely the wrong places, so this approach is really fraught with problems. However, as an exercise in splitting strings, I guess it's as good as any other.
As a side problem(splitting sentences) solution I would suggest to start with this regexp
string.split(".(\\[[0-9\\[\\]]+\\])? ")
And for main problem may be you could use copyOfRange()
I have a situation in a free-text file, where between any pair of two string matches of my choice - e.g.
<hello> and </hello>
I want to replace the occurrence of a third string-match with a different string e.g. '=' with '&EQ;'
e.g.
hi=I want this equals sign to stay the same,but=<hello>
<I want="this one in the hello tag to be replaced"/>
</hello>,and=of course this one outside the tag to stay the same
becomes
hi=I want this equals sign to stay the same,but=<hello>
<I want&EQ;"this one in the hello tag to be replaced"/>
</hello>,and=of course this one outside the tag to stay the same
Basically this is because an XML body is being sent in a value-pair and it is royally screwing things up (I am sent this format by a venue and don't have control over it
My immediate approach was to start with a BufferedReader and parse into a StringBuilder going through line by line using String.indexOf( ) to toggle on and off whether we are in tags or not, but 20 minutes in to this approach it occurred to me this may be a bit brute-force and there might be an existing solution to this kind of problem
I know this approach will work eventually but my question is, is there a better way (that is one that is higher level and uses existing Java libraries / common frameworks e.g. Apache Commons, etc. which would make it less error-prone and more maintainable. I.e. is there a more intelligent way of solving this problem than the approach I am taking? Which is effectively brute-force parsing.
If you want to escape XML, have a look at Apache Commons Lang StringEscapeUtils, specifically StringEscapeUtils.escapeXML, it should do what you need.
My great impenetrable solution is as follows, and it seems to work.
I do apologise that it's so hard to follow but it basically came down to this from factorising and re-factorising, many times over combining similar pieces of code.
It will replace all the occurences of String 'replace' with String 'with' between tokens of 'openToken' and 'closeToken' and should be started with mode=false to begin with
As with most things in life, there's probably a really clever succinct way to do this with RegEx
boolean mode=false
StringBuilder output
while( String line = newLine ) {
mode = bodge( "<hello>", "</hello>", "=", "&EQ;", output, mode );
}
private static boolean bodge( String openToken, String closeToken, String replace, String with, String line, StringBuilder out, boolean mode ) {
String comparator = mode ? closeToken : openToken;
int index = line.indexOf( comparator );
// drop through straight if nothing interesting
if( index == -1 ) {
String outLine = mode ?
replacer( line , replace, with ) :
line;
out.append( outLine );
out.append( "\r\n" );
return mode;
}
else {
int endOfToken = index + comparator.length();
String outLine = line.substring(0, endOfToken);
outLine = mode ?
replacer( outLine , replace, with ) :
outLine;
out.append(outLine );
return bodge( openToken, closeToken, replace, with, line.substring( endOfToken ), out, !mode );
}
}
I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :
public Map<String, Double> breakSentence(String document) {
sentences = new HashMap<String, Double>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
Double tfIdf = 0.0;
int start = bi.first();
for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
String sentence = document.substring(start, end);
sentences.put(sentence, tfIdf);
}
return sentences;
}
The problem is when the paragraph contain titles or numbers, for example :
"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."
What my code will produce is :
sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code
Instead of 1 single sentence because of the period in titles and numbers.
Is there a way to fix this to handle titles and numbers with Java?
Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.
I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)
I'm including my complete program so you can see it all:
import java.text.BreakIterator;
import java.util.*;
public class TestCode {
private static final String[] ABBREVIATIONS = {
"Dr." , "Prof." , "Mr." , "Mrs." , "Ms." , "Jr." , "Ph.D."
};
public static void main(String[] args) throws Exception {
String text = "Prof. Roberts and Dr. Andrews trying to solve a " +
"problem by writing a 1.200 lines of code. This will " +
"work if Mr. Java writes solid code.";
for (String s : breakSentence(text)) {
System.out.println(s);
}
}
public static List<String> breakSentence(String document) {
List<String> sentenceList = new ArrayList<String>();
BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
bi.setText(document);
int start = bi.first();
int end = bi.next();
int tempStart = start;
while (end != BreakIterator.DONE) {
String sentence = document.substring(start, end);
if (! hasAbbreviation(sentence)) {
sentence = document.substring(tempStart, end);
tempStart = end;
sentenceList.add(sentence);
}
start = end;
end = bi.next();
}
return sentenceList;
}
private static boolean hasAbbreviation(String sentence) {
if (sentence == null || sentence.isEmpty()) {
return false;
}
for (String w : ABBREVIATIONS) {
if (sentence.contains(w)) {
return true;
}
}
return false;
}
}
What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:
"Prof."
and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:
"Roberts and Dr."
It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:
"Andrews trying to solve a problem by writing a 1.200 lines of code."
Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.
Now it sets the temp start to the current 'end' and continues.
Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug
It appears that Prof. Roberts only gets split if Roberts begins with a capital letter.
If Roberts begins with a lowercase r, it does not get split.
So... I guess that's how BreakIterator deals with periods.
I'm sure further reading of the documentation will explain how this behavior can be modified.
my first time posting!
The problem I'm having is I'm using XPath and Tag-Soup to parse a webpage and read in the data. As these are news articles sometimes they have links embedded in the content and these are what is messing with my program.
The XPath I'm using is storyPath = "//html:article//html:p//text()"; where the page has a structure of:
<article ...>
<p>Some text from the story.</p>
<p>More of the story, which proves what a great story this is!</p>
<p>More of the story without links!</p>
</article>
My code relating to the xpath evaluation is this:
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
String tmp = n.toString();
tmp = tmp.replace("[#text:", "");
tmp = tmp.replace("]", "");
tmp = tmp.replaceAll("’", "'");
tmp = tmp.replaceAll("‘", "'");
tmp = tmp.replaceAll("–", "-");
tmp = tmp.replaceAll("¬", "");
tmp = tmp.trim();
story.add(tmp);
}
this.setStory(story);
...
private void setStory(LinkedList<String> story) {
String tmp = "";
for (String p : story) {
tmp = tmp + p + "\n\n";
}
this.story = tmp.trim();
}
The output this gives me is
Some text from the story.
More of the story, which proves
what a great story this is
!
More of the story without links!
Does anyone have a way of me eliminating this error? Am I taking a wrong approach somewhere? (I understand I could well be with the setStory code, but don't see another way.
And without the tmp.replace() codes, all the results appear like [#text: what a great story this is] etc
EDIT:
I am still having troubles, though possibly of a different kind.. what is killing me here is again a link, but the way the BBC have their website, the link is on a separate line, thus it still reads in with the same problem as described before (note that problem was fixed with the example given). The section of code on the BBC page is:
<p> Former Queens Park Rangers trainee Sterling, who
<a href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a>
had not started a senior match for the Reds before this season.
</p>
which appears in my output as:
Former Queens Park Rangers trainee Sterling, who
moved to the Merseyside club in February 2010 aged 15,
had not started a senior match for the Reds before this season.
For the problem with your edit where new lines in the html source code come out into your text document, you'll want to remove them before you print them. Instead of System.out.print(text.trim()); do System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));
First find paragraphs,: storyPath = "//html:article//html:p, then for each paragraph, get out all the text with another xpath query and concatenate them without new lines and put two new lines just at the end of the paragraph.
On another note, you shouldn't have to replaceAll("’", "'"). That is a sure sign that you are opening your file incorrectly. When you open your file you need to pass a Reader into tag soup. You should initialize the Reader like this: Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252")); Where you specify the correct character set for the file. A list of character sets is here: http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html My guess is that it is Windows latin 1.
The [#text: thing is simply the toString() representation of a DOM Text node. The toString() method is intended to be used when you want a string representation of the node for debugging purposes. Instead of toString() use getTextContent() which returns the actual text.
If you don't want the link content to appear on separate lines then you could remove the //text() from your XPath and just take the textContent of the element nodes directly (getTextContent() for an element returns the concatenation of all the descendant text nodes)
String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
story.add(n.getTextContent().trim());
}
The fact that you are having to manually fix up things like "’" suggests your HTML is actually encoded in UTF-8 but you're reading it using a single-byte character set such as Windows1252. Rather than try and fix it post-hoc you should instead work out how to read the data in the correct encoding in the first place.
I'm moving some code from objective-c to java. The project is an XML/HTML Parser. In objective c I pretty much only use the scanUpToString("mystring"); method.
I looked at the Java Scanner class, but it breaks everything into tokens. I don't want that. I just want to be able to scan up to occurrences of substrings and keep track of the scanners current location in the overall string.
Any help would be great thanks!
EDIT
to be more specific. I don't want Scanner to tokenize.
String test = "<title balh> blah <title> blah>";
Scanner feedScanner = new Scanner(test);
String title = "<title";
String a = feedScanner.next(title);
String b = feedScanner.next(title);
In the above code I'd like feedScanner.next(title); to scan up to the end of the next occurrence of "<title"
What actually happens is the first time feeScanner.next is called it works since the default delimiter is whitespace, however, the second time it is called it fails (for my purposes).
You can achieve this with String class (Java.lang.String).
First get the first index of your substring.
int first_occurence= string.indexOf(substring);
Then iterate over entire string and get the next value of substrings
int next_index=indexOf( str,fromIndex);
If you want to save the values, add them to the wrapper class and the add to a arraylist object.
This really is easier by just using String's methodsdirectly:
String test = "<title balh> blah <title> blah>";
String target = "<title";
int index = 0;
index = test.indexOf( target, index ) + target.length();
// Index is now 6 (the space b/w "<title" and "blah"
index = test.indexOf( target, index ) + target.length();
// Index is now at the ">" in "<title> blah"
Depending on what you want to actually do besides walk through the string, different approaches might be better/worse. E.g. if you want to get the blah> blah string between the <title's, a Scanner is convenient:
String test = "<title balh> blah <title> blah>";
Scanner scan = new Scanner(test);
scan.useDelimiter("<title");
String stuff = scan.next(); // gets " blah> blah ";
Maybe String.split is something for you?
s = "The almighty String is mystring is your String is our mystring-object - isn't it?";
parts = s.split ("mystring");
Result:
Array("The almighty String is ", " is your String is our ", -object - isn't it?)
You know that in between your "mystring" must be. I'm not sure for start and end, so maybe you need some s.startsWith ("mystring") / s.endsWith.