How can I remove the subdomain part of a URL

How can I remove the subdomain part of a URL - java

I am trying to remove subdomain and leave only the domain name followed by the extension.
It is difficult to find the subdomain because I do not know how many dots to expect in a url. some urls end in .com some in .co.uk for example.
How can I remove the subdomain safely so that foo.bar.com becomes bar.com and foo.bar.co.uk becomes bar.co.uk
if(!rawUrl.startsWith("http://")&&!rawUrl.startsWith("https://")){
rawUrl = "http://"+rawUrl;
}
String url = new java.net.URL(rawUrl).getHost();
String urlWithoutSub = ???

What you need is a Public Sufix List, such as the one available at https://publicsuffix.org/. Basically, there is no algorithm that can tell you which suffixes are public, so you need a list. And you’d better used one that is public and well-maintained.

just stumped upon this question and decided to write the following function.
Example Input -> Output:
http://example.com -> http://example.com
http://www.example.com -> http://example.com
ftp://www.a.example.com -> ftp://example.com
SFTP://www.a.example.com -> SFTP://example.com
http://www.a.b.example.com -> http://example.com
http://www.a.c.d.example.com -> http://example.com
http://example.com/ -> http://example.com/
https://example.com/aaa -> http://example.com/aaa
http://www.example.com/aa/bb../d -> http://example.com/aa/bb../d
FILE://www.a.example.com/ddd/dd/../ff -> FILE://example.com/ddd/dd/../ff
HTTPS://www.a.b.example.com/index.html?param=value -> HTTPS://example.com/index.html?param=value
http://www.a.c.d.example.com/#yeah../..! -> http://lmao.com/#yeah../..!
Same goes for second level domains
http://some.thing.co.uk/?ke - http://thing.co.uk/?ke
something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk - something.co.uk
https://www.something.co.uk - https://something.co.uk
Code:
public static String removeSubdomains(String url, ArrayList<String> secondLevelDomains) {
// We need our URL in three parts, protocol - domain - path
String protocol= getProtocol(url);
url = url.substring(protocol.length());
String urlDomain=url;
String path="";
if(urlDomain.contains("/")) {
int slashPos = urlDomain.indexOf("/");
path=urlDomain.substring(slashPos);
urlDomain=urlDomain.substring(0, slashPos);
}
// Done, now let us count the dots . .
int dotCount = Strng.countOccurrences(urlDomain, ".");
// example.com <-- nothing to cut
if(dotCount==1){
return protocol+url;
}
int dotOffset=2; // subdomain.example.com <-- default case, we want to remove everything before the 2nd last dot
// however, somebody had the glorious idea, to have second level domains, such as co.uk
for (String secondLevelDomain : secondLevelDomains) {
// we need to check if our domain ends with a second level domain
// example: something.co.uk we don't want to cut away "something", since it isn't a subdomain, but the actual domain
if(urlDomain.endsWith(secondLevelDomain)) {
// we increase the dot offset with the amount of dots in the second level domain (co.uk = +1)
dotOffset += Strng.countOccurrences(secondLevelDomain, ".");
break;
}
}
// if we have something.co.uk, we have a offset of 3, but only 2 dots, hence nothing to remove
if(dotOffset>dotCount) {
return protocol+urlDomain+path;
}
// if we have sub.something.co.uk, we have a offset of 3 and 3 dots, so we remove "sub"
int pos = Strng.nthLastIndexOf(dotOffset, ".", urlDomain)+1;
urlDomain = urlDomain.substring(pos);
return protocol+urlDomain+path;
}
public static String getProtocol(String url) {
String containsProtocolPattern = "^([a-zA-Z]*:\\/\\/)|^(\\/\\/)";
Pattern pattern = Pattern.compile(containsProtocolPattern);
Matcher m = pattern.matcher(url);
if (m.find()) {
return m.group();
}
return "";
}
public static ArrayList<String> getPublicSuffixList(boolean loadFromPublicSufficOrg) {
ArrayList<String> secondLevelDomains = new ArrayList<String>();
if(!loadFromPublicSufficOrg) {
secondLevelDomains.add("co.uk");secondLevelDomains.add("co.at");secondLevelDomains.add("or.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("gv.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("ac.uk");secondLevelDomains.add("gov.uk");secondLevelDomains.add("ltd.uk");secondLevelDomains.add("fed.us");secondLevelDomains.add("isa.us");secondLevelDomains.add("nsn.us");secondLevelDomains.add("dni.us");secondLevelDomains.add("ac.ru");secondLevelDomains.add("com.ru");secondLevelDomains.add("edu.ru");secondLevelDomains.add("gov.ru");secondLevelDomains.add("int.ru");secondLevelDomains.add("mil.ru");secondLevelDomains.add("net.ru");secondLevelDomains.add("org.ru");secondLevelDomains.add("pp.ru");secondLevelDomains.add("com.au");secondLevelDomains.add("net.au");secondLevelDomains.add("org.au");secondLevelDomains.add("edu.au");secondLevelDomains.add("gov.au");
}
try {
String a = URLHelpers.getHTTP("https://publicsuffix.org/list/public_suffix_list.dat", false, true);
Scanner scanner = new Scanner(a);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if(!line.startsWith("//") && !line.startsWith("*") && line.contains(".")) {
secondLevelDomains.add(line);
}
}
scanner.close();
} catch (Exception e) {
e.printStackTrace();
}
return secondLevelDomains;
}

Related

Extracting Operation(...); and sub Operation from String using REGEX

I have an issue with a Regex in java for Android.
i would like to retreive the first operation (and each sub operations) like in the following samples:
"OPERATION(ASYNC_OPERATION,_RFID_ITEM_SERIAL);"
"OPERATION(CONCAT,~1261,01,OPERATION(ASYNC_OPERATION,_RFID_ITEM_ID);,21,OPERATION(ASYNC_OPERATION,_RFID_ITEM_SERIAL););"
As you can see each Operation can have sub Operations... And that's where i'm getting problems.
Actually i am using this Regex: ^\s*(OPERATION\s*\(\s*)(.*)(\);)
but the index of ");" returned is always the last index, and in case of two sub operations, inside of a "Main" operation, this is wrong...
private static Pattern operationPattern=Pattern.compile("^\\s*(OPERATION\\s*\\(\\s*)(.*)(\\);)",Pattern.CASE_INSENSITIVE);
public Operation(String text){
parseOperationText(text);
}
private void parseOperationText(String text){
String strText = text.replace("#,", "§");
Matcher matcher=operationPattern.matcher(strText);
if(matcher.find()) {
//This is an OPERATION
subOperations=new ArrayList<>();
String strChain = matcher.group(2);//Should only contain the text between "OPERATION(" and ");"
int commaIdx = strChain.indexOf(",");
if (commaIdx == -1) {
//Operation without parameter
operationType = strChain;
} else {
//Operation with parameters
operationType = strChain.substring(0, commaIdx);
strChain = strChain.substring(commaIdx + 1);
while (strChain.length()>0) {
matcher = operationPattern.matcher(strChain);
if (matcher.find()) {
String subOpText=matcher.group(0);
strChain=StringUtils.stripStart(strChain.substring(matcher.end())," ");
if(strChain.startsWith(",")){
strChain=strChain.substring(1);
}
subOperations.add(new Operation(subOpText));
}
else{
commaIdx = strChain.indexOf(",");
if(commaIdx==-1)
{
subOperations.add(new Operation(strChain));
strChain="";
}
else{
subOperations.add(new Operation(strChain.substring(0,commaIdx)));
strChain=strChain.substring(commaIdx+1);
}
}
}
}
}
else {
//Not an operation
//...
}
}
It works for sample 1 but for Sample 2, after finding the "Main" operation (CONCAT in the sample), the second match returns this:
OPERATION(ASYNC_OPERATION,_RFID_ITEM_ID);,21,OPERATION(ASYNC_OPERATION,_RFID_ITEM_SERIAL);
What i would like to retrieve is this:
"CONCAT,~1261,01,OPERATION(ASYNC_OPERATION,_RFID_ITEM_ID);,21,OPERATION(ASYNC_OPERATION,_RFID_ITEM_SERIAL);"
"ASYNC_OPERATION,_RFID_ITEM_ID"
"ASYNC_OPERATION,_RFID_ITEM_SERIAL"

Could use this
"(?s)(?=OPERATION\\s*\\()(?:(?=.*?OPERATION\\s*\\((?!.*?\\1)(.*\\)(?!.*\\2).*))(?=.*?\\)(?!.*?\\2)(.*)).)+?.*?(?=\\1)(?:(?!OPERATION\\s*\\().)*(?=\\2$)"
to find the balanced OPERATION( ) string in group 0.
https://regex101.com/r/EsaDtC/1
Then use this
(?s)^OPERATION\((.*?)\)$
on that last matched string to get the inner contents of the
operation, which is in group 1.

Finally i'm using two different REGEX :
//First Regex catches main operation content (Group 2):
\s*(OPERATION\s*\(\s*)(.*)(\);)
//Second Regex catches next full sub "OPERATION(...);" (Group 0):
^(?:\s*(OPERATION\s*\(\s*))(.*)(?:\)\s*\;\s*)(?=\,)|^(?:\s*(OPERATION\s*\(\s*))(.*)(?:\)\s*\;\s*)$
Then i can use Fisst Regex to detect if this is an operation (match.find()), catch it's content in Group(2) and then for each param (separated by comma) i can check if it's a sub operation with second regex. If it's a sub Operation i call recursively the same function that uses First Regex again... and so on.
private static Pattern operationPattern=Pattern.compile("^\\s*(OPERATION\\s*\\(\\s*)(.*)(\\);)",Pattern.CASE_INSENSITIVE);
private static Pattern subOperationPattern=Pattern.compile("^(?:\\s*(OPERATION\\s*\\(\\s*))(.*)(?:\\)\\s*\\;\\s*)(?=\\,)|^(?:\\s*(OPERATION\\s*\\(\\s*))(.*)(?:\\)\\s*\\;\\s*)$",Pattern.CASE_INSENSITIVE);
private void parseOperationText(String strText ){
Matcher matcher=operationPattern.matcher(strText);
if(matcher.find()) {
//This is an OPERATION
subOperations=new ArrayList<>();
String strChain = matcher.group(2);
int commaIdx = strChain.indexOf(",");
if (commaIdx == -1) {
//Operation without parameter
operationType = strChain;
} else {
//Operation with parameters
operationType = strChain.substring(0, commaIdx);
strChain = strChain.substring(commaIdx + 1);
while (strChain.length()>0) {
matcher = subOperationPattern.matcher(strChain);
if (matcher.find()) {
String subOpText=matcher.group(0);
strChain=StringUtils.stripStart(strChain.substring(matcher.end())," ");
if(strChain.startsWith(",")){
strChain=strChain.substring(1);
}
subOperations.add(new Operation(subOpText));
}
else{
commaIdx = strChain.indexOf(",");
if(commaIdx==-1)
{
subOperations.add(new Operation(strChain));
strChain="";
}
else{
subOperations.add(new Operation(strChain.substring(0,commaIdx)));
strChain=strChain.substring(commaIdx+1);
}
}
}
}
}
else {
//Fixed value: we store the value as is
fieldValue = strText;
operationType = OperationType.NONE;
}
}
public Operation(String text){
parseOperationText(text);
}

(Swift) How to remove characters more than once using IndexOf

I'm making an iOS app that parses JSON data from a google spreadsheet. One of the issues with Google JSON data is that it includes unnecessary data that has to be removed. I'm new to iOS programming.
/*O_o*/google.visualization.Query.setResponse({"version":"0.6","reqId":"0","status":"ok","sig":"1400846503","table":{JSON DATA I NEED}});
I have done this in JAVA on Android using this code
int start = result.indexOf("{", result.indexOf("{") + 1);
int end = result.lastIndexOf("}");
String jsonResponse = result.substring(start, end);
My swift code
var something = "My google JSON Data"
let Start = String(something).characters.indexOf("{")!;
let substring1: String = something.substringFromIndex(Start);
something = substring1;
let End = String(something).characters.indexOf(")")!.distanceTo(something.endIndex);
let index3 = something.endIndex.advancedBy(-End);
let substring4: String = something.substringToIndex(index3)
What I'm asking is how do I get the index of the 2nd "{"

You should use NSJsonSerializer, but if you want to do it your way:
extension String {
func indexOf(target: String) -> Int {
if let range = self.rangeOfString(target) {
return self.startIndex.distanceTo(range.startIndex)
} else {
return -1
}
}
func indexOf(target: String, startIndex: Int) -> Int {
let startRange = self.startIndex.advancedBy(startIndex)
if let range = self.rangeOfString(target, options: .LiteralSearch, range: startRange..<self.endIndex) {
return self.startIndex.distanceTo(range.startIndex)
} else {
return -1
}
}
}
let end = myString.indexOf("{", startIndex: myString.indexOf("{") + 1)

Lucene multi word tokens with delimiter

I am just starting with Lucene so it's probably a beginners question. We are trying to implement a semantic search on digital books and already have a concept generator, so for example the contexts I generate for a new article could be:
|Green Beans | Spring Onions | Cooking |
I am using Lucene to create an index on the books/articles using only the extracted concepts (stored in a temporary document for that purpose). Now the standard analyzer is creating single word tokens: Green, Beans, Spring, Onions, Cooking, which of course is not the same.
My question: is there an analyzer that is able to detect delimiters around tokens (|| in our example), or an analyzer that is able to detect multi-word constructs?
I'm afraid we'll have to create our own analyzer, but I don't quite know where to start for that one.

Creating an analyzer is pretty easy. An analyzer is just a tokenizer optionally followed by token filters. In your case, you'd have to create your own tokenizer. Fortunately, you have a convenient base class for this: CharTokenizer.
You implement the isTokenChar method and make sure it returns false on the | character and true on any other character. Everything else will be considered part of a token.
Once you have the tokenizer, the analyzer should be straightforward, just look at the source code of any existing analyzer and do likewise.
Oh, and if you can have spaces between your | chars, just add a TrimFilter to the analyzer.

I came across this question because I am doing something with my Lucene mechanisms which creates data structures to do with sequencing, in effect "hijacking" the Lucene classes. Otherwise I can't imagine why people would want knowledge of the separators ("delimiters") between tokens, but as it was quite tricky I thought I'd put it here for the benefit of anyone who might need to.
You have to rewrite your own versions of StandardTokenizer and StandardTokenizerImpl. These are both final classes so you can't extend them.
SeparatorDeliveringTokeniserImpl (tweaked from source of StandardTokenizerImpl):
3 new fields:
private int startSepPos = 0;
private int endSepPos = 0;
private String originalBufferAsString;
Tweak these methods:
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos - zzStartRead);
if( originalBufferAsString == null ){
originalBufferAsString = new String( zzBuffer, 0, zzBuffer.length );
}
// startSepPos == -1 is a "flag condition": it means that this token is the last one and it won't be followed by a sep
if( startSepPos != -1 ){
// if the flag is NOT set, record the start pos of the next sep...
startSepPos = zzMarkedPos;
}
}
public final void yyreset(java.io.Reader reader) {
zzReader = reader;
zzAtBOL = true;
zzAtEOF = false;
zzEOFDone = false;
zzEndRead = zzStartRead = 0;
zzCurrentPos = zzMarkedPos = 0;
zzFinalHighSurrogate = 0;
yyline = yychar = yycolumn = 0;
zzLexicalState = YYINITIAL;
if (zzBuffer.length > ZZ_BUFFERSIZE)
zzBuffer = new char[ZZ_BUFFERSIZE];
// reset fields responsible for delivering separator...
originalBufferAsString = null;
startSepPos = 0;
endSepPos = 0;
}
(inside getNextToken:)
if ((zzAttributes & 1) == 1) {
zzAction = zzState;
zzMarkedPosL = zzCurrentPosL;
if ((zzAttributes & 8) == 8) {
// every occurrence of a separator char leads here...
endSepPos = zzCurrentPosL;
break zzForAction;
}
}
And make a new method:
String getPrecedingSeparator() {
String sep = null;
if( originalBufferAsString == null ){
sep = new String( zzBuffer, 0, endSepPos );
}
else if( startSepPos == -1 || endSepPos <= startSepPos ){
sep = "";
}
else {
sep = originalBufferAsString.substring( startSepPos, endSepPos );
}
if( zzMarkedPos < startSepPos ){
// ... then this is a sign that the next token will be the last one... and will NOT have a trailing separator
// so set a "flag condition" for next time this method is called
startSepPos = -1;
}
return sep;
}
SeparatorDeliveringTokeniser (tweaked from source of StandardTokenizer):
Add this:
private String separator;
String getSeparator(){
// normally this delivers a preceding separator... but after incrementToken returns false, if there is a trailing
// separator, it then delivers that...
return separator;
}
(inside incrementToken:)
while(true) {
int tokenType = scanner.getNextToken();
// added NB this gives you the separator which PRECEDES the token
// which you are about to get from scanner.getText( ... )
separator = scanner.getPrecedingSeparator();
if (tokenType == SeparatorDeliveringTokeniserImpl.YYEOF) {
// NB at this point sep is equal to the trailing separator...
return false;
}
...
Usage:
In my FilteringTokenFilter subclass, called TokenAndSeparatorExamineFilter, the methods accept and end look like this:
#Override
public boolean accept() throws IOException {
String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
// a preceding separator can only be an empty String if we are currently
// dealing with the first token and if the sequence starts with a token
if (!sep.isEmpty()) {
// ... do something with the preceding separator
}
// then get the token...
String token = getTerm();
// ... do something with the token
// my filter does no filtering! Every token is accepted...:
return true;
}
#Override
public void end() throws IOException {
// deals with trailing separator at the end of a sequence of tokens and separators (if there is one, i.e. if it doesn't end with a token)
String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
// NB will be an empty String if there is no trailing separator
if (!sep.isEmpty()) {
// ... do something with this trailing separator
}
}

Fetch all the hyperlinks from a webpage and recursively doing that in java

1 .Fetch all contents from a Webpage
2. fetch hyperlinks from the webpage.
3. Repeat the 1 & 2 from the fetched hyperlink
4. repeat the process untill 200 hyperlinks regietered or no more hyperlink to fetch.
I wrote a sample programs but due to poor understanding of recursion , my loop became an infinite loop.
Suggest me to solve the code matching the expectation.
import java.net.URL;
import java.net.URLConnection;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Content
{
private static final String HTML_A_HREF_TAG_PATTERN =
"\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
Pattern pattern;
public Content ()
{
pattern = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
}
private void fetchContentFromURL(String strLink) {
String content = null;
URLConnection connection = null;
try {
connection = new URL(strLink).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
}catch ( Exception ex ) {
ex.printStackTrace();
return;
}
fetchURL(content);
}
private void fetchURL ( String content )
{
Matcher matcher = pattern.matcher( content );
while(matcher.find()) {
String group = matcher.group();
if(group.toLowerCase().contains( "http" ) || group.toLowerCase().contains( "https" )) {
group = group.substring( group.indexOf( "=" )+1 );
group = group.replaceAll( "'", "" );
group = group.replaceAll( "\"", "" );
System.out.println("lINK "+group);
fetchContentFromURL(group);
}
}
System.out.println("DONE");
}
/**
* #param args
*/
public static void main ( String[] args )
{
new Content().fetchContentFromURL( "http://www.google.co.in" );
}
}
I am open for any other solution as well but want to stick with core java Api only no 3rd party.

One possible option here is to remember all visited links to avoid cyclic paths. Here's how to archive it with additional Set storage for already visited links:
public class Content {
private static final String HTML_A_HREF_TAG_PATTERN =
"\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))";
private Pattern pattern;
private Set<String> visitedUrls = new HashSet<String>();
public Content() {
pattern = Pattern.compile(HTML_A_HREF_TAG_PATTERN);
}
private void fetchContentFromURL(String strLink) {
String content = null;
URLConnection connection = null;
try {
connection = new URL(strLink).openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
if (scanner.hasNext()) {
content = scanner.next();
visitedUrls.add(strLink);
fetchURL(content);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
private void fetchURL(String content) {
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
String group = matcher.group();
if (group.toLowerCase().contains("http") || group.toLowerCase().contains("https")) {
group = group.substring(group.indexOf("=") + 1);
group = group.replaceAll("'", "");
group = group.replaceAll("\"", "");
System.out.println("lINK " + group);
if (!visitedUrls.contains(group) && visitedUrls.size() < 200) {
fetchContentFromURL(group);
}
}
}
System.out.println("DONE");
}
/**
* #param args
*/
public static void main(String[] args) {
new Content().fetchContentFromURL("http://www.google.co.in");
}
}
I also fixed some other issues in fetching logic, now it works as expected.

inside the fetchContentFromURL method you should record which url u r currently fetching, and if that url has already be fetched then skip it. otherwise two page A, B, which has a link point to each other will cause your code keep fetching.

In addition to JK1's answer, for achieving target 4 of your question, you might want to maintain the count of hyperlinks as instance variable. A rough pseudo code might be(you can adjust the exact count. Also as an alternate, you can use HashSet length to know the number of Hyperlinks your program has parsed till now):
if (!visitedUrls.contains(group) && noOfHyperlinksVisited++ < 200) {
fetchContentFromURL(group);
}
However, I was not sure whether you want a total of 200 hyperlinks OR want to traverse to a depth of 200 links from starting page. In case it is later, you might wish to explore Breadth First Search, which will let you know when you have reached your target depth.

implementing Public Suffix extraction using java

i need to extract the top domain of an url and i got his http://publicsuffix.org/index.html
and the java implementation is in http://guava-libraries.googlecode.com and i could not find
any example to extract domain name
say example..
example.google.com
returns google.com
and bing.bing.bing.com
returns bing.com
can any one tell me how can i implement using this library with an example....

It looks to me like InternetDomainName.topPrivateDomain() does exactly what you want. Guava maintains a list of public suffixes (based on Mozilla's list at publicsuffix.org) that it uses to determine what the public suffix part of the host is... the top private domain is the public suffix plus its first child.
Here's a quick example:
public class Test {
public static void main(String[] args) throws URISyntaxException {
ImmutableList<String> urls = ImmutableList.of(
"http://example.google.com", "http://google.com",
"http://bing.bing.bing.com", "http://www.amazon.co.jp/");
for (String url : urls) {
System.out.println(url + " -> " + getTopPrivateDomain(url));
}
}
private static String getTopPrivateDomain(String url) throws URISyntaxException {
String host = new URI(url).getHost();
InternetDomainName domainName = InternetDomainName.from(host);
return domainName.topPrivateDomain().name();
}
}
Running this code prints:
http://example.google.com -> google.com
http://google.com -> google.com
http://bing.bing.bing.com -> bing.com
http://www.amazon.co.jp/ -> amazon.co.jp

I recently implemented a Public Suffix List API:
PublicSuffixList suffixList = new PublicSuffixListFactory().build();
assertEquals(
"google.com", suffixList.getRegistrableDomain("example.google.com"));
assertEquals(
"bing.com", suffixList.getRegistrableDomain("bing.bing.bing.com"));
assertEquals(
"amazon.co.jp", suffixList.getRegistrableDomain("www.amazon.co.jp"));

EDIT: Sorry I've been a little too fast. I didn't think of co.jp. co.uk, and so on. You will need to get a list of possible TLDs from somewhere. You could also take a look at http://commons.apache.org/validator/ to validate a TLD.
I think something like this should work: But maybe there exists some Java-Standard Function.
String url = "http://www.foobar.com/someFolder/index.html";
if (url.contains("://")) {
url = url.split("://")[1];
}
if (url.contains("/")) {
url = url.split("/")[0];
}
// You need to get your TLDs from somewhere...
List<String> magicListofTLD = getTLDsFromSomewhere();
int positionOfTLD = -1;
String usedTLD = null;
for (String tld : magicListofTLD) {
positionOfTLD = url.indexOf(tld);
if (positionOfTLD > 0) {
usedTLD = tld;
break;
}
}
if (positionOfTLD > 0) {
url = url.substring(0, positionOfTLD);
} else {
return;
}
String[] strings = url.split("\\.");
String foo = strings[strings.length - 1] + "." + usedTLD;
System.out.println(foo);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I remove the subdomain part of a URL - java

What you need is a Public Sufix List, such as the one available at https://publicsuffix.org/. Basically, there is no algorithm that can tell you which suffixes are public, so you need a list. And you’d better used one that is public and well-maintained.

Related

Extracting Operation(...); and sub Operation from String using REGEX

(Swift) How to remove characters more than once using IndexOf

Lucene multi word tokens with delimiter

Fetch all the hyperlinks from a webpage and recursively doing that in java

implementing Public Suffix extraction using java

Categories

Resources