parse and capture the numbers that are at the end of strings - java

I am building a program to go through a log file that has entries like this:
en halo%20reach%20noble%20actual%20in%20theater 1 659
en Wazir_Khan_Mosque 2 77859
en Waziristan_War 3 285976
en Wazirpur_Upazila 1 364
I want to output the numbers that appear at the end of each string (ie 659, 77859, 285976, 285976, 364). As you can see the numbers have differing amounts of digits.
How can I grab the last numbers from these strings?

One possible solution is to split the String according to whitespaces:
String[] splitted = myStr.split("\\s+");
Then take the last element:
splitted[splitted.length - 1];
If you want to int value, you should use Integer#parseInt.
Another solution is using lastIndexOf and substring..

int pos = line.lastIndexOf(' ');
int value = Integer.parseInt(line.substr(pos+1));

If you are reading each line and assigning to a string like this
String line = "en halo%20reach%20noble%20actual%20in%20theater 1 659";
then doing this should give you the last number
String words[] = line.split("\\s");
System.out.println(words[words.length - 1]);

I usually don't recommend regular expressions as they are so often abused here on Stackoverflow ( especially when it comes to XML/HTML ), but this the perfect case to learn how to use them!
Splitting on whitespace, while that will work isn't as robust as this approach; which will continue to work if the whitespace varies, and allows you to capture all the other data in one operation, which you will probably want eventually:
^en\s+(.*)\s+(\d+)\s+(\d+)$ : Click for an explanation of how it works!
Then to use it:
final Pattern p = Pattern.compile("^en\\s+(.*)\\s+(\\d+)\\s+(\\d+)$");
final Matcher m = p.matches("en Wazirpur_Upazila 1 364");
final String g1 = m.group(1); // Wazirpur_Upazila
final String g2 = m.group(2); // 1
final String g3 = m.group(3); // 364

try this, may not be very good
public static void main(String[] args) throws FileNotFoundException, IOException {
BufferedReader br = new BufferedReader(new FileReader("log.txt"));
try {
String line = br.readLine();
List<String> stringList = new ArrayList<>();
while(line!=null) {
String[] strsplit = line.split(" ");
line = br.readLine();
for(int i=3;i<strsplit.length;i+=4) {
stringList.add(strsplit[i]);
}
}
System.out.println(stringList);
} finally {
br.close();
}

Related

Extract last number after decimal

I am getting a piece of JSON text from a url connection and saving it to a string currently as such:
...//setting up url and connection
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String str = in.readLine();
When I print str, I correctly find the data {"build":{"version_component":"1.0.111"}}
Now I want to extract the 111 from str, but I am having some trouble.
I tried
String afterLastDot = inputLine.substring(inputLine.lastIndexOf(".") + 1);
but I end up with 111"}}
I need a solution that is generic so that if I have String str = {"build":{"version_component":"1.0.111111111"}}; the solution still works and extracts 111111111 (ie, I don't want to hard code extract the last three digits after the decimal point)
If you cannot use a JSON parser then you can this regex based extraction:
String lastNum = str.replaceAll("^.*\\.(\\d+).*", "$1");
RegEx Demo
^.* is greedy match that matches everything until last DOT and 1 or more digits that we put in group #1 to be used in replacement.
Find the start and the end indexes of the String you need and substring(start, end) :
// String str = "{"build":{"version_component":"1.0.111"}};" cannot compile without escaping
String str = "{\"build\":{\"version_component\":\"1.0.111\"}}";
int start = str.lastIndexOf(".")+1;
int end = str.lastIndexOf("\"");
String substring = str.substring(start,end);
just use JSON api
JSONObject obj = new JSONObject(str);
String versionComponent= obj.getJSONObject("build").getString("version_component");
Then just split and take the last element
versionComponent.split("\\.")[2];
Please, your can try the following code :
...
int index = inputLine.lastIndexOf(".")+1 ;
String afterLastDot = inputLine.substring(index, index+3);
With Regular Expressions (Rexp),
You can solve your problem like this ;
Pattern pattern = Pattern.compile("111") ;
Matcher matcher = pattern.matcher(str) ;
while(matcher.find()){
System.out.println(matcher.start()+" "+matcher.end());
System.out.println(str.substring(matcher.start(), matcher.end()));
}

Getting the last word of a line passed to a mapper in hadoop

If I have a dataset with lines like this 199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245 and I am running a map reduce job with hadoop, how can I get the last element in each line?
I have tried all the obvious answers, such as String lastWord = test.substring(test.lastIndexOf(" ")+1); but this gives me the - character. I have tried splitting it based on a space, and getting the last element, but the last character is still a -.
Can I not expect that the data will be delivered to me line by line. In other words, can I not expect a file in the form a b c d \n e f g h\n to be delivered line by line?
And does anyone have any tips on how to get the last word in this line?
This is a snippet from my map function, where I try to get the data:
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String test = value.toString();
StringTokenizer tokenizer = new StringTokenizer(test);
//String lastWord = test.substring(test.lastIndexOf(" ")+1); <--first try
//String [] array = test.split(" ");//<--second try
//one.set(Integer.valueOf(array[8]));
int i = 0;
String candidate = null;
while (tokenizer.hasMoreTokens()) {
candidate = tokenizer.nextToken();
if (i == 3) {
//this works to get the date field
String wholeDate = candidate;
String[] dateArray = wholeDate.split(":");
String date = dateArray[0].substring(1); // get rid of '['
String hour = dateArray[1];
word.set(date + " " + hour);
} else if (i == 7) {
// <-- third try
String replySizeString = candidate;
one.set(Integer.valueOf(replySizeString)); }
}
i++;
Instead of using a StringTokenizer you could just use the String[] String.split(String regex) method to return an array of Strings for each line. Then, assuming that each line of your data has the same number of fields, separated by spaces, you can just look at that array element.
String line = value.toString();
String[] lineArray = line.split(" ");
String lastWord = lineArray[9];
Or if you know that you always want the last token you could see how long the array is and then just grab the last element.
String lastWord = lineArray[lineArray.length - 1];

Cut ':' && " " from a String with a tokenizer

right now I am a little bit confused. I want to manipulate this string with a tokenizer:
Bob:23456:12345 Carl:09876:54321
However, I use a Tokenizer, but when I try:
String signature1 = tok.nextToken(":");
tok.nextToken(" ")
I get:
12345 Carl
However I want to have the first int and the second int into a var.
Any ideas?
You have two different patterns, maybe you should handle both separated.
Fist you should split the space separated values. Only use the string split(" "). That will return a String[].
Then for each String use tokenizer.
I believe will works.
Code:
String input = "Bob:23456:12345 Carl:09876:54321";
String[] words = input.split(" ")
for (String word : words) {
String[] token = each.split(":");
String name = token[0];
int value0 = Integer.parseInt(token[1]);
int value1 = Integer.parseInt(token[2]);
}
Following code should do:
String input = "Bob:23456:12345 Carl:09876:54321";
StringTokenizer st = new StringTokenizer(input, ": ");
while(st.hasMoreTokens())
{
String name = st.nextToken();
String val1 = st.nextToken();
String val2 = st.nextToken();
}
Seeing as you have multiple patterns, you cannot handle them with only one tokenizer.
You need to first split it based on whitespace, then split based on the colon.
Something like this should help:
String[] s = "Bob:23456:12345 Carl:09876:54321".split(" ");
System.out.println(Arrays.toString(s ));
String[] so = s[0].split(":", 2);
System.out.println(Arrays.toString(so));
And you'd get this:
[Bob:23456:12345, Carl:09876:54321]
[Bob, 23456:12345]
If you must use tokeniser then I tink you need to use it twice
String str = "Bob:23456:12345 Carl:09876:54321";
StringTokenizer spaceTokenizer = new StringTokenizer(str, " ");
while (spaceTokenizer.hasMoreTokens()) {
StringTokenizer colonTokenizer = new StringTokenizer(spaceTokenizer.nextToken(), ":");
colonTokenizer.nextToken();//to igore Bob and Carl
while (colonTokenizer.hasMoreTokens()) {
System.out.println(colonTokenizer.nextToken());
}
}
outputs
23456
12345
09876
54321
Personally though I would not use tokenizer here and use Claudio's answer which splits the strings.

using tokenizer to read a line

public void GrabData() throws IOException
{
try {
BufferedReader br = new BufferedReader(new FileReader("data/500.txt"));
String line = "";
int lineCounter = 0;
int TokenCounter = 1;
arrayList = new ArrayList < String > ();
while ((line = br.readLine()) != null) {
//lineCounter++;
StringTokenizer tk = new StringTokenizer(line, ",");
System.out.println(line);
while (tk.hasMoreTokens()) {
arrayList.add(tk.nextToken());
System.out.println("check");
TokenCounter++;
if (TokenCounter > 12) {
er = new DataRecord(arrayList);
DR.add(er);
arrayList.clear();
System.out.println("check2");
TokenCounter = 1;
}
}
}
} catch (FileNotFoundException ex) {
Logger.getLogger(Driver.class.getName()).log(Level.SEVERE, null, ex);
}
}
Hello , I am using a tokenizer to read the contents of a line and store it into an araylist. Here the GrabData class does that job.
The only problem is that the company name ( which is the third column in every line ) is in quotes and has a comma in it. I have included one line for your example. The tokenizer depends on the comma to separate the line into different tokens. But the company name throws it off i guess. If it weren't for the comma in the company column , everything goes as normal.
Example:-
Essie,Vaill,"Litronic , Industries",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com
Any ideas?
First of all StringTokenizer is considered to be legacy code. From Java doc:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
Using the split() method you get an array of strings. While iterating through the array you can check if the current string starts with a quote and if that's the case check if the next one ends with a quote. If you meet these 2 conditions then you know you didn't split where you wanted and you can merge these 2 together, process it like you want and continue iterating through the array normally after that. In that pass you will probably do i+=2 instead of your regular i++ and it should go unnoticed.
You can accomplish this using Regular Expressions. The following code:
String s = "asd,asdasd,asd\"asdasdasd,asdasdasd\", asdasd, asd";
System.out.println(s);
s = s.replaceAll("(?<=\")([^\"]+?),([^\"]+?)(?=\")", "$1 $2");
s = s.replaceAll("\"", "");
System.out.println(s);
yields
asd,asdasd,asd, "asdasdasd,asdasdasd", asdasd, asd
asd,asdasd,asd, asdasdasd asdasdasd, asdasd, asd
which, from my understanding, is the preprocessing you require for your tokenizer-code to work. Hope this helps.
While StringTokenizer might not natively handle this for you, a couple lines of code will do it... probably not the most efficient, but should get the idea across...
while(tk.hasMoreTokens()) {
String token = tk.nextToken();
/* If the item is encapsulated in quotes, loop through all tokens to
* find closing quote
*/
if( token.startsWIth("\"") ){
while( tk.hasMoreTokens() && ! tk.endsWith("\"") ) {
// append our token with the next one. Don't forget to retain commas!
token += "," + tk.nextToken();
}
if( !token.endsWith("\"") ) {
// open quote found but no close quote. Error out.
throw new BadFormatException("Incomplete string:" + token);
}
// remove leading and trailing quotes
token = token.subString(1, token.length()-1);
}
}
As you can see, in the class description, the use of StringTokenizer is discouraged by Oracle.
Instead of using tokenizer I would use the String split() method
which you can use a regular expression as argument and significantly reduce your code.
String str = "Essie,Vaill,\"Litronic , Industries\",14225 Hancock Dr,Anchorage,Anchorage,AK,99515,907-345-0962,907-345-1215,essie#vaill.com,http://www.essievaill.com";
String[] strs = str.split("(?<! ),(?! )");
List<String> list = new ArrayList<String>(strs.length);
for(int i = 0; i < strs.length; i++) list.add(strs[i]);
Just pay attention to your regex, using this one you're assuming that the comma will be always between spaces.

Filter words from string

I want to filter a string.
Basically when someone types a message, I want certain words to be filtered out, like this:
User types: hey guys lol omg -omg mkdj*Omg*ndid
I want the filter to run and:
Output: hey guys lol - mkdjndid
And I need the filtered words to be loaded from an ArrayList that contains several words to filter out. Now at the moment I am doing if(message.contains(omg)) but that doesn't work if someone types zomg or -omg or similar.
Use replaceAll with a regex built from the bad word:
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
This passes your test case:
public static void main( String[] args ) {
List<String> badWords = Arrays.asList( "omg", "black", "white" );
String message = "hey guys lol omg -omg mkdj*Omg*ndid";
for ( String badWord : badWords ) {
message = message.replaceAll("(?i)\\b[^\\w -]*" + badWord + "[^\\w -]*\\b", "");
}
System.out.println( message );
}
try:
input.replaceAll("(\\*?)[oO][mM][gG](\\*?)", "").split(" ")
Dave gave you the answer already, but I will emphasize the statement here. You will face a problem if you implement your algorithm with a simple for-loop that just replaces the occurrence of the filtered word. As an example, if you filter the word ass in the word 'classic' and replace it with 'butt', the resultant word will be 'clbuttic' which doesn't make any sense. Thus, I would suggest using a word list,like the ones stored in Linux under /usr/share/dict/ directory, to check if the word is valid or it needs filtering.
I don't quite get what you are trying to do.
I ran into this same problem and solved it in the following way:
1) Have a google spreadsheet with all words that I want to filter out
2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)
3) Replace all l33tsp33k characters with their respective alphabet letter
4) Replace all special characters but letters from the sentence
5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key - you don't want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input.
6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet
6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.
We are using this in our own games and it's working like a charm. Hope you guys enjoy.
https://pimdewitte.me/2016/05/28/filtering-combinations-of-bad-words-out-of-string-inputs/
public static HashMap<String, String[]> words = new HashMap<String, String[]>();
public static void loadConfigs() {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
String line = "";
int counter = 0;
while((line = reader.readLine()) != null) {
counter++;
String[] content = null;
try {
content = line.split(",");
if(content.length == 0) {
continue;
}
String word = content[0];
String[] ignore_in_combination_with_words = new String[]{};
if(content.length > 1) {
ignore_in_combination_with_words = content[1].split("_");
}
words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("Loaded " + counter + " words to filter out");
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
* #param input
* #return
*/
public static ArrayList<String> badWordsFound(String input) {
if(input == null) {
return new ArrayList<>();
}
// remove leetspeak
input = input.replaceAll("1","i");
input = input.replaceAll("!","i");
input = input.replaceAll("3","e");
input = input.replaceAll("4","a");
input = input.replaceAll("#","a");
input = input.replaceAll("5","s");
input = input.replaceAll("7","t");
input = input.replaceAll("0","o");
ArrayList<String> badWords = new ArrayList<>();
input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");
for(int i = 0; i < input.length(); i++) {
for(int fromIOffset = 1; fromIOffset < (input.length()+1 - i); fromIOffset++) {
String wordToCheck = input.substring(i, i + fromIOffset);
if(words.containsKey(wordToCheck)) {
// for example, if you want to say the word bass, that should be possible.
String[] ignoreCheck = words.get(wordToCheck);
boolean ignore = false;
for(int s = 0; s < ignoreCheck.length; s++ ) {
if(input.contains(ignoreCheck[s])) {
ignore = true;
break;
}
}
if(!ignore) {
badWords.add(wordToCheck);
}
}
}
}
for(String s: badWords) {
Server.getSlackManager().queue(s + " qualified as a bad word in a username");
}
return badWords;
}

Categories