Lucene multi word tokens with delimiter - java

I am just starting with Lucene so it's probably a beginners question. We are trying to implement a semantic search on digital books and already have a concept generator, so for example the contexts I generate for a new article could be:
|Green Beans | Spring Onions | Cooking |
I am using Lucene to create an index on the books/articles using only the extracted concepts (stored in a temporary document for that purpose). Now the standard analyzer is creating single word tokens: Green, Beans, Spring, Onions, Cooking, which of course is not the same.
My question: is there an analyzer that is able to detect delimiters around tokens (|| in our example), or an analyzer that is able to detect multi-word constructs?
I'm afraid we'll have to create our own analyzer, but I don't quite know where to start for that one.

Creating an analyzer is pretty easy. An analyzer is just a tokenizer optionally followed by token filters. In your case, you'd have to create your own tokenizer. Fortunately, you have a convenient base class for this: CharTokenizer.
You implement the isTokenChar method and make sure it returns false on the | character and true on any other character. Everything else will be considered part of a token.
Once you have the tokenizer, the analyzer should be straightforward, just look at the source code of any existing analyzer and do likewise.
Oh, and if you can have spaces between your | chars, just add a TrimFilter to the analyzer.

I came across this question because I am doing something with my Lucene mechanisms which creates data structures to do with sequencing, in effect "hijacking" the Lucene classes. Otherwise I can't imagine why people would want knowledge of the separators ("delimiters") between tokens, but as it was quite tricky I thought I'd put it here for the benefit of anyone who might need to.
You have to rewrite your own versions of StandardTokenizer and StandardTokenizerImpl. These are both final classes so you can't extend them.
SeparatorDeliveringTokeniserImpl (tweaked from source of StandardTokenizerImpl):
3 new fields:
private int startSepPos = 0;
private int endSepPos = 0;
private String originalBufferAsString;
Tweak these methods:
public final void getText(CharTermAttribute t) {
t.copyBuffer(zzBuffer, zzStartRead, zzMarkedPos - zzStartRead);
if( originalBufferAsString == null ){
originalBufferAsString = new String( zzBuffer, 0, zzBuffer.length );
}
// startSepPos == -1 is a "flag condition": it means that this token is the last one and it won't be followed by a sep
if( startSepPos != -1 ){
// if the flag is NOT set, record the start pos of the next sep...
startSepPos = zzMarkedPos;
}
}
public final void yyreset(java.io.Reader reader) {
zzReader = reader;
zzAtBOL = true;
zzAtEOF = false;
zzEOFDone = false;
zzEndRead = zzStartRead = 0;
zzCurrentPos = zzMarkedPos = 0;
zzFinalHighSurrogate = 0;
yyline = yychar = yycolumn = 0;
zzLexicalState = YYINITIAL;
if (zzBuffer.length > ZZ_BUFFERSIZE)
zzBuffer = new char[ZZ_BUFFERSIZE];
// reset fields responsible for delivering separator...
originalBufferAsString = null;
startSepPos = 0;
endSepPos = 0;
}
(inside getNextToken:)
if ((zzAttributes & 1) == 1) {
zzAction = zzState;
zzMarkedPosL = zzCurrentPosL;
if ((zzAttributes & 8) == 8) {
// every occurrence of a separator char leads here...
endSepPos = zzCurrentPosL;
break zzForAction;
}
}
And make a new method:
String getPrecedingSeparator() {
String sep = null;
if( originalBufferAsString == null ){
sep = new String( zzBuffer, 0, endSepPos );
}
else if( startSepPos == -1 || endSepPos <= startSepPos ){
sep = "";
}
else {
sep = originalBufferAsString.substring( startSepPos, endSepPos );
}
if( zzMarkedPos < startSepPos ){
// ... then this is a sign that the next token will be the last one... and will NOT have a trailing separator
// so set a "flag condition" for next time this method is called
startSepPos = -1;
}
return sep;
}
SeparatorDeliveringTokeniser (tweaked from source of StandardTokenizer):
Add this:
private String separator;
String getSeparator(){
// normally this delivers a preceding separator... but after incrementToken returns false, if there is a trailing
// separator, it then delivers that...
return separator;
}
(inside incrementToken:)
while(true) {
int tokenType = scanner.getNextToken();
// added NB this gives you the separator which PRECEDES the token
// which you are about to get from scanner.getText( ... )
separator = scanner.getPrecedingSeparator();
if (tokenType == SeparatorDeliveringTokeniserImpl.YYEOF) {
// NB at this point sep is equal to the trailing separator...
return false;
}
...
Usage:
In my FilteringTokenFilter subclass, called TokenAndSeparatorExamineFilter, the methods accept and end look like this:
#Override
public boolean accept() throws IOException {
String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
// a preceding separator can only be an empty String if we are currently
// dealing with the first token and if the sequence starts with a token
if (!sep.isEmpty()) {
// ... do something with the preceding separator
}
// then get the token...
String token = getTerm();
// ... do something with the token
// my filter does no filtering! Every token is accepted...:
return true;
}
#Override
public void end() throws IOException {
// deals with trailing separator at the end of a sequence of tokens and separators (if there is one, i.e. if it doesn't end with a token)
String sep = ((SeparatorDeliveringTokeniser) input).getSeparator();
// NB will be an empty String if there is no trailing separator
if (!sep.isEmpty()) {
// ... do something with this trailing separator
}
}

Related

The operator && is undefined for the argument String,boolean on JAVA?

I'm a junior java Developer, entrusted with a Java Tool.
I have the following problem:
This tool takes in input 2 CSV files with specific fields.
The tool then generates 2 csv files as Output. First and Second Output.
Both Output files have the same fields, the first Output is based on some conditions and the second output on some other.
These 2 Output files contain different reconciliations for some data.Some records of the file have the same ID.
Example:
record1 = ID10-name One,
record2 = ID10-Blue Two,
record3 = ID10-name Three
One of the conditions is as follows:
if (line.getName (). toLowerCase (). contains ("Blue" .toLowerCase ())
|| line.getName (). equalsIgnoreCase ("Orange")) {
return true;
The method who implement this is a boolean ,and all the logic of the tool it's based on this logic.The tool scrolls/processing line by line .
Iterator <BaseElaboration> itElab = result.iterator ();
while (itElab.hasNext ()) {
BaseLine Processing = itElab.next ();
On the SECOND OUTPUT file,I find a line/record that has the name beginning with Blue.The tool rightly, it takes the line and inserts it in the Second Output File,cause all record who has name (getName), with Blue or Orange go on it
I should instead clump all the lines for the same ID even if only one of them has the name with blue.
Currently the tool do this :
FIRST FILE OUTPUT
record1 = ID10-name
record3 = ID10-name Three
SECOND FILE OUTPUT
record2 = ID10-Blue Two
The expected output is
FIRST FILE OUTPUT
nothing cause one of the group of IDs,cointains a name with Blue
SECOND FILE OUTPUT
record1 = ID10-name
record2 = ID10-Blue Two
record3 = ID10-name Three
i think something like this,but doesnt work
if (line.getID() && line.getCollector().toLowerCase().contains("Blue".toLowerCase())
|| line.getName().equalsIgnoreCase("black")) {
return true;
How to group in java lines for lined for the same ID,and do the esclusione on the noutput
CODE
Output
private void creaCSVOutput() throws IOException, CsvDataTypeMismatchException, CsvRequiredFieldEmptyException, ParseException {
Writer writerOutput = new FileWriter(pathOutput);
Writer writerEsclusi = new FileWriter(pathOutputEsclusi);
StatefulBeanToCsv<BaseElaborazione> beanToCsv = new StatefulBeanToCsvBuilder<BaseElaborazione>(writerOutput)
.withSeparator(';').withQuotechar('"').build();
StatefulBeanToCsv<BaseElaborazione> beanToCsvEsclusi = new StatefulBeanToCsvBuilder<BaseElaborazione>(writerEsclusi)
.withSeparator(';').withQuotechar('"').build();
beanToCsv.write(CsvHelper.genHeaderBeanBase());
beanToCsvEsclusi.write(CsvHelper.genHeaderBeanBase());
Iterator<BaseElaborazione> itElab = result.iterator();
while (itElab.hasNext()) {
BaseElaborazione riga = itElab.next();
some set if and condition ecc
esclusi.add(riga);
itElab.remove();
}
}
for (BaseElaborazione riga : result) {
if(riga.getNota() == null || riga.getNota().isEmpty()) {
riga.setNota(mapNota.get(cuvNota.get(riga.getCuv())));
}
beanToCsv.write(riga);
}
for (BaseElaborazione riga : esclusi) {
if(riga.getNota() == null || riga.getNota().isEmpty()) {
riga.setNota(mapNota.get(cuvNota.get(riga.getCuv())));
}
beanToCsvEsclusi.write(riga);
}
writerOutput.close();
writerEsclusi.close();
}
The method for the esclusi( 2 output)
private boolean checkPerimetroJunk(BaseElaborazione riga) {
if (riga.getMercato().toLowerCase().contains("Energia Libero".toLowerCase())) {
if (riga.getStrategia().toLowerCase().startsWith("STRATEGIA FO".toLowerCase())
|| (riga.getStrategia().toLowerCase().contains("CREDITI CEDUTI".toLowerCase())
|| (riga.getAttivita().equalsIgnoreCase("Proposta di Recupero Stragiudiziale FO")
|| (riga.getAttivita().toLowerCase().contains("Cessione".toLowerCase())
|| (riga.getLegalenome().equalsIgnoreCase("Euroservice junk STR FO")
|| (riga.getLegalenome().equalsIgnoreCase("Euroservice_FO"))))))) {
onlyCUV=true;
}
else if(Collections.frequency(storedIds,riga.getCuv()) >= 1 ){
onlyCUV = true;
}
return onlyCUV;
}
else if (riga.getMercato().equals("MAGGIOR TUTELA")) {
if(riga.getCollector().toLowerCase().contains("Cessione".toLowerCase())
|| (riga.getCollector().equalsIgnoreCase("Euroservice_Fo"))
|| (riga.getAttivitaCrabb().toLowerCase().contains("*FO".toLowerCase())
|| (riga.getaNomeCluster().equalsIgnoreCase("Full Outsourcing")))) {
onlyCUV = true;
}
else if(Collections.frequency(storedIds,riga.getCuv()) >= 1 ){
onlyCUV = true;
}
return onlyCUV;
}
return false;
}
WHERE riga=lines
cessione ecc are people who have black ecc its an example
Now the part of MAGGIOR TUTELA ITS WORKING,but not working the part of LIBERO.I dont know why.

A Very Strange StringIndexOutOfBoundsException

Before asking this question , i spent around half an hour on google , but since i didn't find a solution i thought i maybe should ask here.
So basically i'm using Java Reader to read a text file and converting each line of information into an Object that i called Nation ( With a constructor of course ) and making an array out of all those objects.
The problem is that a single line on my text file goes to 75 characters. But i get an error telling me that the length is only 68 ! So Here's the part of the code where i read informations from the file :
static int lireRemplir (String nomFichier, Nation[] nations)
throws IOException
{
boolean existeFichier = true;
int n =0;
FileReader fr = null;
try {
fr = new FileReader(nomFichier);
}
catch (java.io.FileNotFoundException erreur) {
System.out.println("Probléme avec l'ouverture du fichier " + nomFichier);
existeFichier = false;
}
if (existeFichier) {
BufferedReader entree = new BufferedReader(fr);
boolean finFichier = false;
while (!finFichier) {
String uneLigne = entree.readLine();
if (uneLigne == null) {
finFichier=true;
}
else {
nations[n] = new Nation(uneLigne.charAt(0),uneLigne.substring(55,63),
uneLigne.substring(64,74),uneLigne.substring(1,15),uneLigne.substring(36,54));
n++;
}
}
entree.close();
}
return n;
}
The Error i get is :
Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
begin 64, end 74, length 68
Since i'm new here i tried to post an image of my text but i couldn't, so i'll just try to hand write an exemple:
2ETATS-UNIS WASHINGTON 9629047 291289535
4CHINE PEKIN 9596960 1273111290
3JAPON KYOTO 377835 12761000
There is alot of space between the words it's like an array!
If i change the 74 to 68 i get a result when i try to print my array , but the information is missing.
Here's my constructor:
public Nation(char codeContinent, String superficie, String population, String nom, String capitale) {
this.codeContinent = codeContinent;
this.superficie = superficie;
this.population = population;
this.nom = nom;
this.capitale = capitale;
}
I hope you could help me with this! If you need to know more about my code let me know ! Thank you very much.
To avoid Runtime Exceptions, you need to be careful with your code. In cases where you are dealing with indexes of a String or an array, please check for length of the String to be greater or equal to the maximum index you are using. Enclose you code that is throwing the exception within:
if(uneLigne.length() > 74) {
nations[n] = new Nation(uneLigne.charAt(0),uneLigne.substring(55,63),
uneLigne.substring(64,74),uneLigne.substring(1,15),uneLigne.substring(36,54));
} else {
//your logic to handle the line with less than 74 characters
}
This will ensure your code does not break even if any line is smaller than expected characters.
______________________________________________________________________________
Another approach
Adding the comment as an answer:
The other way would be to use split() method of String class or StringTokenizer class to get the array/tokens if the line is delimited with space or some other character. With this, you need not break the string using substring() method where you need to worry about the lengths and possible Runtime.
Check the below code snippet using split() method, for each line you read from file, you probably have to do this way:
Nation nation = null;
String uneLigne = "2ETATS-UNIS WASHINGTON 9629047 291289535";
String[] strArray = uneLigne.split(" ");
if(strArray.length > 3) {
nation = new Nation(getContinentCodeFromFirstElement(strArray[0]), strArray[1],
strArray[2], strArray[3], strArray[4]);
}
//getContinentCodeFromFirstElement(strArray[0]) is your private method to pick the code from your first token/element.
The simpliest way to solve your problem is to change the 74 by uneLigne.length.
Here's the new code.
nations[n] = new Nation(uneLigne.charAt(0),uneLigne.substring(55,63),
uneLigne.substring(64,uneLigne.length),uneLigne.substring(1,15),uneLigne.substring(36,54));

EditText validation not working as expected

I have developed android application and in there I have front end validation to a EditText field where its accept only three alpha and 4 digits.
It is tested in staging environment and front end validation is working perfectly (We don’t have back end validation). But after some time when we check
On our live database. We found some data with only digits relevant to above mentioned field. It seems somehow validation will not effect in some device
And we have received data with only digits. Is it possible or what can be the reason that we received invalid data.
// Check for id is valid format like "ABC1234".
String alphaLen = getResources().getString(R.string.rokaIdAlphaLen);
String numLen = getResources().getString(R.string.rokaIdNumericLen);
if (rokaId.length() > 0 && !Validate.validateRokaId(rokaId, alphaLen, numLen)) {
etRokaid.setError(getString(R.string.error_incorrect_format));
focusView = etRokaid;
cancel = true;
}
public static boolean validateRokaId(String params, String alphaLen, String numLen) {
boolean success = false;
int alphaLength = 0;
int numericLength = 0;
alphaLength = Integer.parseInt(alphaLen.trim());
numericLength = Integer.parseInt(numLen.trim());
if (params.length() == alphaLength + numericLength) {
if (params.substring(0, alphaLength).matches("[a-zA-Z]*")) {
if (params.substring(alphaLength, alphaLength+numericLength).matches("[0-9]*")) {
success = true;
} else {
success = false;
}
} else {
success = false;
}
} else {
success = false;
}
return success;
}
First of all you need to set Edit Text property android:digits in XML file for more security so no other special character to be inserted by the user even if you checked in validation.
android:digits="ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
Now for your format which is 3 character and 4 digit we create a Regex expression. You can create your own Regex Expression and test it from this site. I create this Regex from this site:
[A-Z]{3}\d{4}
public final static Pattern NAME_PATTERN = Pattern.compile("^[A-Z]{3}[0-9]{4}$");
Now just match this pattern.
if (NAME_PATTERN.matcher(edtText.getText().toString().trim()).matches())
{
// Write your logic if pattern match
}
else
{
// Write your logic if pattern not match
}

Iterating over tokens in HIDDEN channel

I am currently working on creating an IDE for the custom, very lua-like scripting language MobTalkerScript (MTS), which provides me with an ANTLR4 lexer. Since the specifications from the language file for MTS puts comments into the HIDDEN_CHANNEL channel, I need to tell the lexer to actually read from the HIDDEN_CHANNEL channel. This is how I tried to do that.
Mts3Lexer lexer = new Mts3Lexer(new ANTLRInputStream("<replace this with the input>"));
lexer.setTokenFactory(new CommonTokenFactory(false));
lexer.setChannel(Token.HIDDEN_CHANNEL);
Token token = lexer.emit();
int type = token.getType();
do {
switch(type) {
case Mts3Lexer.LINE_COMMENT:
case Mts3Lexer.COMMENT:
System.out.println("token "+token.getText()+" is a comment");
default:
System.out.println("token "+token.getText()+" is not a comment");
}
} while((token = lexer.nextToken()) != null && (type = token.getType()) != Token.EOF);
Now, if I use this code on the following input, nothing but token ... is not a comment gets printed to the console.
function foo()
-- this should be a single-line comment
something = "blah"
--[[ this should
be a multi-line
comment ]]--
end
The tokens containing the comments never show up, though. So I searched for the source of this problem and found the following method in the ANTLR4 Lexer class:
/** Return a token from this source; i.e., match a token on the char
* stream.
*/
#Override
public Token nextToken() {
if (_input == null) {
throw new IllegalStateException("nextToken requires a non-null input stream.");
}
// Mark start location in char stream so unbuffered streams are
// guaranteed at least have text of current token
int tokenStartMarker = _input.mark();
try{
outer:
while (true) {
if (_hitEOF) {
emitEOF();
return _token;
}
_token = null;
_channel = Token.DEFAULT_CHANNEL;
_tokenStartCharIndex = _input.index();
_tokenStartCharPositionInLine = getInterpreter().getCharPositionInLine();
_tokenStartLine = getInterpreter().getLine();
_text = null;
do {
_type = Token.INVALID_TYPE;
// System.out.println("nextToken line "+tokenStartLine+" at "+((char)input.LA(1))+
// " in mode "+mode+
// " at index "+input.index());
int ttype;
try {
ttype = getInterpreter().match(_input, _mode);
}
catch (LexerNoViableAltException e) {
notifyListeners(e); // report error
recover(e);
ttype = SKIP;
}
if ( _input.LA(1)==IntStream.EOF ) {
_hitEOF = true;
}
if ( _type == Token.INVALID_TYPE ) _type = ttype;
if ( _type ==SKIP ) {
continue outer;
}
} while ( _type ==MORE );
if ( _token == null ) emit();
return _token;
}
}
finally {
// make sure we release marker after match or
// unbuffered char stream will keep buffering
_input.release(tokenStartMarker);
}
}
The line that caught my eye was the following.
_channel = Token.DEFAULT_CHANNEL;
I don't know much about ANTLR, but apparently this line keeps the lexer in the DEFAULT_CHANNEL channel.
Is the way I tried to read from the HIDDEN_CHANNEL channel right or can't I use nextToken() with the hidden channel?
I found out why the lexer didn't give me any tokens containing the comments - I seem to have missed that the grammar file skips comments instead of putting them into the hidden channel. Contacted the author, changed the grammar file and now it works.
Note to myself: pay more attention to what you read.
For Go (golang) this snippet works for me:
import (
"github.com/antlr/antlr4/runtime/Go/antlr"
)
type antlrparser interface {
GetParser() antlr.Parser
}
func fullText(prc antlr.ParserRuleContext) string {
p := prc.(antlrparser).GetParser()
ts := p.GetTokenStream()
tx := ts.GetTextFromTokens(prc.GetStart(), prc.GetStop())
return tx
}
just pass your ctx.GetSomething() into fullText. Of course, as shown above, whitespace has to go to the hidden channel in the *.g4 file:
WS: [ \t\r\n] -> channel(HIDDEN);

How can I remove the subdomain part of a URL

I am trying to remove subdomain and leave only the domain name followed by the extension.
It is difficult to find the subdomain because I do not know how many dots to expect in a url. some urls end in .com some in .co.uk for example.
How can I remove the subdomain safely so that foo.bar.com becomes bar.com and foo.bar.co.uk becomes bar.co.uk
if(!rawUrl.startsWith("http://")&&!rawUrl.startsWith("https://")){
rawUrl = "http://"+rawUrl;
}
String url = new java.net.URL(rawUrl).getHost();
String urlWithoutSub = ???
What you need is a Public Sufix List, such as the one available at https://publicsuffix.org/. Basically, there is no algorithm that can tell you which suffixes are public, so you need a list. And you’d better used one that is public and well-maintained.
just stumped upon this question and decided to write the following function.
Example Input -> Output:
http://example.com -> http://example.com
http://www.example.com -> http://example.com
ftp://www.a.example.com -> ftp://example.com
SFTP://www.a.example.com -> SFTP://example.com
http://www.a.b.example.com -> http://example.com
http://www.a.c.d.example.com -> http://example.com
http://example.com/ -> http://example.com/
https://example.com/aaa -> http://example.com/aaa
http://www.example.com/aa/bb../d -> http://example.com/aa/bb../d
FILE://www.a.example.com/ddd/dd/../ff -> FILE://example.com/ddd/dd/../ff
HTTPS://www.a.b.example.com/index.html?param=value -> HTTPS://example.com/index.html?param=value
http://www.a.c.d.example.com/#yeah../..! -> http://lmao.com/#yeah../..!
Same goes for second level domains
http://some.thing.co.uk/?ke - http://thing.co.uk/?ke
something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk/?ke - something.co.uk/?ke
www.something.co.uk - something.co.uk
https://www.something.co.uk - https://something.co.uk
Code:
public static String removeSubdomains(String url, ArrayList<String> secondLevelDomains) {
// We need our URL in three parts, protocol - domain - path
String protocol= getProtocol(url);
url = url.substring(protocol.length());
String urlDomain=url;
String path="";
if(urlDomain.contains("/")) {
int slashPos = urlDomain.indexOf("/");
path=urlDomain.substring(slashPos);
urlDomain=urlDomain.substring(0, slashPos);
}
// Done, now let us count the dots . .
int dotCount = Strng.countOccurrences(urlDomain, ".");
// example.com <-- nothing to cut
if(dotCount==1){
return protocol+url;
}
int dotOffset=2; // subdomain.example.com <-- default case, we want to remove everything before the 2nd last dot
// however, somebody had the glorious idea, to have second level domains, such as co.uk
for (String secondLevelDomain : secondLevelDomains) {
// we need to check if our domain ends with a second level domain
// example: something.co.uk we don't want to cut away "something", since it isn't a subdomain, but the actual domain
if(urlDomain.endsWith(secondLevelDomain)) {
// we increase the dot offset with the amount of dots in the second level domain (co.uk = +1)
dotOffset += Strng.countOccurrences(secondLevelDomain, ".");
break;
}
}
// if we have something.co.uk, we have a offset of 3, but only 2 dots, hence nothing to remove
if(dotOffset>dotCount) {
return protocol+urlDomain+path;
}
// if we have sub.something.co.uk, we have a offset of 3 and 3 dots, so we remove "sub"
int pos = Strng.nthLastIndexOf(dotOffset, ".", urlDomain)+1;
urlDomain = urlDomain.substring(pos);
return protocol+urlDomain+path;
}
public static String getProtocol(String url) {
String containsProtocolPattern = "^([a-zA-Z]*:\\/\\/)|^(\\/\\/)";
Pattern pattern = Pattern.compile(containsProtocolPattern);
Matcher m = pattern.matcher(url);
if (m.find()) {
return m.group();
}
return "";
}
public static ArrayList<String> getPublicSuffixList(boolean loadFromPublicSufficOrg) {
ArrayList<String> secondLevelDomains = new ArrayList<String>();
if(!loadFromPublicSufficOrg) {
secondLevelDomains.add("co.uk");secondLevelDomains.add("co.at");secondLevelDomains.add("or.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("gv.at");secondLevelDomains.add("ac.at");secondLevelDomains.add("ac.uk");secondLevelDomains.add("gov.uk");secondLevelDomains.add("ltd.uk");secondLevelDomains.add("fed.us");secondLevelDomains.add("isa.us");secondLevelDomains.add("nsn.us");secondLevelDomains.add("dni.us");secondLevelDomains.add("ac.ru");secondLevelDomains.add("com.ru");secondLevelDomains.add("edu.ru");secondLevelDomains.add("gov.ru");secondLevelDomains.add("int.ru");secondLevelDomains.add("mil.ru");secondLevelDomains.add("net.ru");secondLevelDomains.add("org.ru");secondLevelDomains.add("pp.ru");secondLevelDomains.add("com.au");secondLevelDomains.add("net.au");secondLevelDomains.add("org.au");secondLevelDomains.add("edu.au");secondLevelDomains.add("gov.au");
}
try {
String a = URLHelpers.getHTTP("https://publicsuffix.org/list/public_suffix_list.dat", false, true);
Scanner scanner = new Scanner(a);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if(!line.startsWith("//") && !line.startsWith("*") && line.contains(".")) {
secondLevelDomains.add(line);
}
}
scanner.close();
} catch (Exception e) {
e.printStackTrace();
}
return secondLevelDomains;
}

Categories