I'm trying to parse a html tag so far I got the text which can be as follows:
"Guide Price £50,000"
or
"£50,000"
or even
"£50,000 - £55,000"
In the third case to make things simpler all I need is the first price listed.
My question is how can I convert the following numbers into an int or double, preferably an int as the numbers are quite large. Would number formatter do this or would I need a regex expression especially if some text trails the tag block.
Example after what I got so far
String priceNumber = url.select("span.price").text(); //using JSoup Libary
String priceNumber = priceNumber.replaceAll("[^\\d.])
This removes everything which is not a digit I think.
What if the example has 2 numbers in it how do I get the first?
Use a regex with Matcher.find to search for occurrences, then remove the commas and try to parse. Here's the decimal case:
String input = "£50,000 - £55,000";
Pattern regex = Pattern.compile("\\d[\\d,\\.]+");
Matcher finder = regex.matcher(input);
if( finder.find() ) { // or while() if you want to process each
try {
double value = Double.parseDouble(finder.group(0).replaceAll(",", ""));
// do something with value
} catch (NumberFormatException e ) {
// handle unparseable
}
}
Youu can convert any String to a int or double with Integer.parseInt(\\String you want to convert) or Double.parseDouble(\\String you want to convert) respectively.
In your first and second case this would get you 50000.
In the third cae you need to split the string into 2 first and then repeat the trick.
Your title is a bit misleading as you are not asking on how to convert from pound to lets say euro.
Use a regex to remove the unimportant characters and then parse the result as a double. You can then truncate to int if you only care about dollar values.
NumberFormat format = NumberFormat.getInstance();
format.parse(priceNumber.replaceAll("[^\\d]*([\\d,]*).*", "$1")).doubleValue()
The first part of the replace pattern [^\\d] matches and throws away leading characters, the second part ([\\d,]) saves the next series of digits and commas, then the third part .* throws away the rest of the input.
Then the whole input is replaced with the contents of the first saved match (the second part of the replace pattern).
Then you use the NumberFormat class to parse the number (you could use Double.parseDouble() if it weren't for the comma)
This will work I think!
String string = "This is £50,000 pounds, this is £5.00 pounds.";
String newString = string;
while (string.contains("£")) {
if (string.indexOf("£") != -1) {
// it contains £
string = string.substring(string.indexOf("£"));
newString = string.substring(0, string.indexOf(" "));
string = string.replaceFirst(newString, "");
newString = newString.replaceAll("£", "");
newString = newString.replaceAll(",", "");
double money = Double.parseDouble(newString);
System.out.println(money);
}
}
you can try this out (for all the cases),
String priceNumber = "£500001 wcjnwknv122333- £55,000";
String regex = "£(\\d+,?\\d+)\\D?";
Pattern p =Pattern.compile(regex);
Matcher m = p.matcher(priceNumber);
if(m.find()){
System.out.println(m.group(1));
}
Try below regex :
((\$|£)\d+\s|(\$|£)\d+-(\$|£)\d+\s)
Related
I have a String like this as shown below. From below string I need to extract number 123 and it can be at any position as shown below but there will be only one number in a string and it will always be in the same format _number_
text_data_123
text_data_123_abc_count
text_data_123_abc_pqr_count
text_tery_qwer_data_123
text_tery_qwer_data_123_count
text_tery_qwer_data_123_abc_pqr_count
Below is the code:
String value = "text_data_123_abc_count";
// this below code will not work as index 2 is not a number in some of the above example
int textId = Integer.parseInt(value.split("_")[2]);
What is the best way to do this?
With a little guava magic:
String value = "text_data_123_abc_count";
Integer id = Ints.tryParse(CharMatcher.inRange('0', '9').retainFrom(value)
see also CharMatcher doc
\\d+
this regex with find should do it for you.
Use Positive lookahead assertion.
Matcher m = Pattern.compile("(?<=_)\\d+(?=_)").matcher(s);
while(m.find())
{
System.out.println(m.group());
}
You can use replaceAll to remove all non-digits to leave only one number (since you say there will be only 1 number in the input string):
String s = "text_data_123_abc_count".replaceAll("[^0-9]", "");
See IDEONE demo
Instead of [^0-9] you can use \D (which also means non-digit):
String s = "text_data_123_abc_count".replaceAll("\\D", "");
Given current requirements and restrictions, the replaceAll solution seems the most convenient (no need to use Matcher directly).
u can get all parts from that string and compare with its UPPERCASE, if it is equal then u can parse it to a number and save:
public class Main {
public static void main(String[] args) {
String txt = "text_tery_qwer_data_123_abc_pqr_count";
String[] words = txt.split("_");
int num = 0;
for (String t : words) {
if(t == t.toUpperCase())
num = Integer.parseInt(t);
}
System.out.println(num);
}
}
Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");
I have a string which looks like following:
Turns 13,000,000 years old
Now i want to convert the digits to words in English, I have a function ready for that however I am finding problems to detect the original numbers (13,000,000) in this case, because it is separated by commas.
Currently I am using the following regex to detect a number in a string:
stats = stats.replace((".*\\d.*"), (NumberToWords.start(Integer.valueOf(notification_data_greet))));
But the above seems not to work, any suggestions?
You need to extract the number using a RegEx wich allows for the commas. The most robust one I can think of right now is
\d{1,3}(,?\d{3})*
Wich matches any unsigned Integer both with correctly placed commas and without commas (and weird combinations thereof like 100,000000)
Then replace all , from the match by the empty String and you can parse as usual:
Pattern p = Pattern.compile("\\d{1,3}(,?\\d{3})*"); // You can store this as static final
Matcher m = p.matcher(input);
while (m.find()) { // Go through all matches
String num = m.group().replace(",", "");
int n = Integer.parseInt(num);
// Do stuff with the number n
}
Working example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) throws InterruptedException {
String input = "1,300,000,000";
Pattern p = Pattern.compile("\\d{1,3}(,?\\d{3})*"); // You can store this as static final
Matcher m = p.matcher(input);
while (m.find()) { // Go through all matches
String num = m.group().replace(",", "");
System.out.println(num);
int n = Integer.parseInt(num);
System.out.println(n);
}
}
}
Gives output
1300000000
1300000000
Try this regex:
[0-9][0-9]?[0-9]?([0-9][0-9][0-9](,)?)*
This matches numbers that are seperated by a comma for each 1000. So it will match
10,000,000
but not
10,1,1,1
You can do it with the help of DecimalFormat instead of a regular expression
DecimalFormat format = (DecimalFormat) DecimalFormat.getInstance();
System.out.println(format.parse("10,000,000"));
Try the below regex to match the comma separted numbers,
\d{1,3}(,\d{3})+
Make the last part as optional to match also the numbers which aren't separated by commas,
\d{1,3}(,\d{3})*
I'm attempting to make the following replacement in java
#Test
public void testReplace(){
String str = "1JU3C_2.27.CBT";
String find = "(\\d*)\\.(\\d*)";
String replace = "$1,$2";
String modified = str.replaceAll(find, replace);
System.out.println(modified);
assertEquals("1JU3C_2,27.CBT", modified); //fails
}
However both full stops seem to be getting replaced. I'm looking at replacing only the numeric decimal. (i.e expecting output 1JU3C_2,27.CBT)
Use (\\d+)\\.(\\d+) instead of (\\d*)\\.(\\d*).
Your regex asks to replace zero or more digits followed by a dot, followed by zero or more digits. So . in .CBT is matched as it has a dot with zero digits on both sides.
1JU3C_2.27.CBT has two dots with zero or more digits on both sides.
If you want to convert string like 5.67.8 to 5,67,8 use lazy matching as (\\d+?)\\.(\\d+?).
*
stands for zero or more times, try replacing it with
+
Instead do this:
public void testReplace()
{
String str = "1JU3C_2.27.CBT";
String modified = str.replaceFirst("[.]", ",");
System.out.println(modified);
assertEquals("1JU3C_2,27.CBT", modified);
}
What regular expression can get a number sequence from the input string, contains backslashes and not a numbers, for example -
"12\34a56ss7890"
I need to -
1234567890
If we assume you have this in a String. You could do something like:
string = string.replaceAll("\\D", "");
This will replace all non digit Characters from your String.
str.replaceAll("[^\d]", "");
bootnote: im not a java developer, but the regex itself should be correct
Sorry for adding another Answer but this is needed because this won't fit to an Comment.
I think this is because of the \34. If I do call System.out.print("12\34a56ss7890"); I will get the following output 12a56ss7890. This is because the \34 will be escaped. This is an Issue in Java. You can fix this by first calling this Method on your InputStream:
private InputStreamReader replaceBackSlashes() throws Exception {
FileInputStream fis = new FileInputStream(new File("PATH TO A FILE");
Scanner in = new Scanner(fis, "UTF-8");
ByteArrayOutputStream out = new ByteArrayOutputStream();
while (in.hasNext()) {
String nextLine = in.nextLine().replace("\", "");
out.write(nextLine.getBytes());
out.write("\n".getBytes());
}
return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
}
BTW: Sorry for my Edit, but there was a little Mistake in the Code.
After calling this Method you will convert your InputStream to a String and the call this on the String:
string = string.replaceAll("\\D", "");
This should hopefully work now :)
String num;
String str =" 12\34a56ss7890";
str= str.replace("\34", "34");
String regex = "[\\d]+";
Matcher matcher = Pattern.compile( regex ).matcher( str);
while (matcher.find( ))
{
num = matcher.group();
System.out.print(num);
}
replace \34 by 34 and match the rest using regular expression.
User a regular exxpression.
String numvber;
String str =" 12\34a56ss7890";
str= str.replace("\34", "34");
String regex = "[\\d]+";//match only digits.
Matcher matcher = Pattern.compile( regex ).matcher( str);
while (matcher.find( ))
{
num = matcher.group();
System.out.print(num);
}
The following example:
String a ="1\2sas";
String b ="1\\2sas";
System.out.println(a.replaceAll("[a-zA-Z\\\\]",""));
System.out.println(b.replaceAll("[a-zA-Z\\\\]",""));
gives output:
1X
12
where X is not a X but a little rectangle - a symbol which is shown when the text showing control does not know how to draw it, a so called non printable character.
It is because in String a the "\2" part obviously tries to be interpreted as a single escaped sign "\u0002"- similar to "\n" "\t" - you can see this in debugger (i tried it using NetBeans)
Since the first argument of a replaceAll method is passed to [Pattern.compile](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String)) it needs to be escaped twice as opposed to String literal (like b).
So if the String "12\34a56ss7890" looks like this on screen you have printed it out like this:
System.out.println("12\\34a56ss7890");
which is solved in the second example.
However if the literal is given as "12\34a56ss7890" then I think you can't handle it with a single regexp, because if the backslash is followed by a number it gets interpreted as as \u0000 -\u0009 so the best I can think of is a very ugly solution:
str.replaceAll("\u0000","0").replaceAll("\u0001","1") ... .replaceAll("\u0009","9").replaceAll("[^\\d]")
the first then replacements (\u0000-\u0009) might be rewritten as a for loop to make it look elegant.
+1 for an EXCELLENT question :)
EDIT:
actually if a backslash is followed by more than one number they all get interpreted as a single sign - up to three numbers after a backslash, the fourth number will be treated as a single number.
Therefore, my solution is not generally correct, but could be extended to be. I would recommend Robin's solution below as it is far more efficient.
The character \34 is an octal number in the string 12\34a56ss7890, so you could use:
str.replaceAll("\034", "34").replaceAll("\\D", "")