Search if a column value in database exists in a file - java

I have a column in DB which has a list of all vendors available. The system outputs a file after processing. this text file will contain one vendor name that will be present in the db column. is there a way to find which vendor is present in the text file from the list of vendors available in the db.
As an example, the column values can be,
Walmart
Target
MoreTex
Electra
the text file will have
"MoreTex, a textile company invoice number #384722 5119 09/22/14 Rome limited name dept terms card payment eggplant blah blah blah. total 329 tax Moretex, textile company Address visit www.moretextile.com"
I have to now find if the above text contains any of the vendors in the db. in the above example it matches with "moretex".
should i write something custom or will Lucene or sphinxsearch help here. the vendor list can grow to 100000+ and performance matters.
thanks

You can use org.apache.commons.io.FileUtils for better performance. Here's the code
File file = new File("file url");
String content = FileUtils.readFileToString(file);
String[] dbColumn = new String[no. of rows]; //your column values from DB
String colValue = null;
Boolean flag = false;
for(String read : dbColumn)
{
if(readFileToString.contains(read))
{
colVal = read;
flag = true;
}
}
if(flag)
System.out.println(colVal + "exists in file");

Related

Scrape information from Web Pages with Java?

I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request
http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)
You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.
As #Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

Processing large number of records from a file in Java

I have million records in CSV file which has 3 columns id,firstName,lastName. I have to process this file in java and validate that id should be unique, firstName should not be null. If there are scenarios where id is not unique and/or firstName is null then I have to write these records in an output file with a fourth column as the reason("id not unique"/"firstName is NULL"). Performance should be good. Please suggest the best effective way.
You can use a collection (ArrayList) to store all the ID's in it in a loop and check if it doesn't already exist. If it doest, write it in a file.
The code should be like this:
if(!idList.contains(id)){
idList.add(id);
}else{
writer.write(id);
}
The above code should work in a loop for all the records being read from the CSV file.
You can use OpenCsv jar for the purpose you have specified. It's under Apache 2.0 licence.
You can download the jar from
http://www.java2s.com/Code/Jar/o/Downloadopencsv22jar.htm
below is the code for the same
Reader reader = Files.newBufferedReader(Paths.get(INPUT_SAMPLE_CSV_FILE_PATH));
CSVReader csvReader = new CSVReader(reader);
Writer writer = Files.newBufferedReader(Paths.get(OUTPUT_SAMPLE_CSV_FILE_PATH));
CSVWriter csvWriter = new CSVWriter(writer);
List<String[]> list = csvReader.readAll();
for (String[] row : list) {
//assuming First column to be Id
String id = row[0];
//assuming name to be second column
String name = row[1];
//assuming lastName to be third column
String lastName = row[2];
//Put your pattern here
if(id==null || !id.matches("pattern") || name==null || !name.matches("pattern")){
String[] outPutData = new String[]{id, name , lastName, "Invalid Entry"};
csvWriter.writeNext(outPutData);
}
}
let me know if this works or you need further help or clarifications.
If you want a good performance algorithm, you should not use ArrayList.contains(element) as explained here, uses O(n) complexity. Instead I suggest you to use a HashSet as the HashSet.Contains(element) operation has an O(1) complexity. To make things short, with ArrayList you would make 1,000,000^2 operations, while with HashSet you would use 1,000,000 operations.
In pseudo-code (to not give away the full answer and make you find the answer on your own) I would do this:
File outputFile
String[] columns
HashSet<String> ids
for(line in file):
columns = line.split(',')
if(ids.contains(columns.id):
outputFile.append(columns.id + " is not unique")
continue
if(columns.name == null):
outputFile.append("first name is null!")
continue
ids.add(columns.id)

using java select city and country from a given string that contains full address

I am writing a code in which I want user to provide a string of unknown length.. suppose he provided a string.. now I want to get city and country present in that string...
If anybody have any better idea, please share..
As your requirement, you have to build a case where you need to defined all the possibility city or country like Array city= new Array["America","England","China","Myanmar"]; after that now loop your array then read the user defined line from index 0 and each time move your character point +1(do in a loop too)(convert it in String) then search your city pattern to match with the character(String). Your program complexity will increase more and more due to your requirement, I think your complexity will raise up to O(n*n), it is not good for memory.
On my view of point, you should ask from user to get the actual requirement step by step like (Enter City :_ then Enter Country :__) it is better to handle the string.GOOD LUCK!
In the question you never specified the format of the input string but assuming the format is "city, country" then this works
String s = "the name of a city, the name of a country";
String city = s.substring(0, s.indexOf(", "));
String country = s.substring(s.indexOf(", ") + 2);
System.out.println("City = " + city);
System.out.println("Country = " + country);
Well, your questions are very interesting. The program you are writing now is depending on LOGIC and I think there is no such jar files available to get solution on it. It is better to get solution manually. Did you ever think about Dictionary program. Mostly Dictionary words are written in a text file and at run time, the program load that words into an array or some other Collections. This way you can also Load Your file at runtime into a 2D array or collection(mostly HashMap is used). So you can scan your file and load it.Suppose u want to read
Example:
Agra,India
London,England
Yangon,Myanmar
Tokyo,Japan
etc...
` String line;
FileInputStream fstream = new FileInputStream(dataFile);
//dataFile is your file directory
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
HashMap<String,String> divideCityCountry =new HashMap<String,String>();
while((line=br.readLine())!=-1)
{
String[] lineSplit = line.split(",");//use the ',' as delimiter
divideCityCountry.put(lineSplit[0], lineSplit[1]);
} `
Now you can check or get the city and country in the divideCityCountry HashMap().
Hope this may helpful.Good Luck!

How to read data from CSV if contains more than excepted separators?

I use CsvJDBC for read data from a CSV. I get CSV from web service request, so not loaded from file. I adjust these properties:
Properties props = new java.util.Properties();
props.put("separator", ";"); // separator is a semicolon
props.put("fileExtension", ".txt"); // file extension is .txt
props.put("charset", "UTF-8"); // UTF-8
My sample1.txt contains these datas:
code;description
c01;d01
c02;d02
my sample2.txt contains these datas:
code;description
c01;d01
c02;d0;;;;;2
It is optional for me deleted headers from CSV. But not optional for me change semi-colon separator.
EDIT: My query for resultSet: SELECT * FROM myCSV
I want to read code column in sample1.txt and sample2.txt with:
resultSet.getString(1)
and read full description column with many semi-colons (d0;;;;;2). Is it possible with CsvJdbc driver or need to change driver?
Thank you any advice!
This is a problem that occurs when you have messy, invalid input, which you need to try to interpret, that's being read by a too-high-level package that only handles clean input. A similar example is trying to read arbitrary HTML with an XML parser - close, but no cigar.
You can guess where I'm going: you need to pre-process your input.
The preprocessing may be very easy if you can make some assumptions about the data - for example, if there are guaranteed to be no quoted semi-colons in the first column.
You could try supercsv. We have implemented such a solution in our project. More on this can be found in http://supercsv.sourceforge.net/
and
Using CsvBeanReader to read a CSV file with a variable number of columns
Finally this problem solved without a CSVJdbc or SuperCSV driver. These drivers works fine. There are possible query data form CSV file and many features content. In my case I don't need query data from CSV. Unfortunately, sometimes the description column content one or more semi-colons and which it is my separator.
First I check code in answer of #Maher Abuthraa and modified to:
private String createDescriptionFromResult(ResultSet resultSet, int columnCount) throws SQLException {
if (columnCount > 2) {
StringBuilder data_list = new StringBuilder();
for (int ii = 2; ii <= columnCount; ii++) {
data_list.append(resultSet.getString(ii));
if (ii != columnCount)
data_list.append(";");
}
// data_list has all data from all index you are looking for ..
return data_list.toString();
} else {
// use standard way
return resultSet.getString(2);
}
}
The loop started from 2, because 1 column is code and only description column content many semi-colons. The CSVJdbc driver split columns by separator ; and these semi-colons disappears from columns data. So, I explicit add semi-colons to description, except the last column, because it is not relevant in my case.
This code work fine. But not solved my all problem. When I adjusted two columns in header of CSV I get error in row, which content more than two semi-colons. So I try adjust ignore of headers or add many column name (or simple ;) to a header. In superCSV ignore of headers option work fine.
My colleague opinion was: you are don't need CSV driver, because try load CSV which not would be CSV, if separator is sometimes relevant data.
I think my colleague has right and I loaded CSV data whith following code:
InputStream in = null;
try {
in = new ByteArrayInputStream(csvData);
List lines = IOUtils.readLines(in, "UTF-8");
Iterator it = lines.iterator();
String line = "";
while (it.hasNext()) {
line = (String) it.next();
String description = null;
String code = null;
String[] columns = line.split(";");
if (columns.length >= 2) {
code = columns[0];
String[] dest = new String[columns.length - 1];
System.arraycopy(columns, 1, dest, 0, columns.length - 1);
description = org.apache.commons.lang.StringUtils.join(dest, ";");
(...)
ok.. my solution to go and read all fields if columns are more than 2 ... like:
int ccc = meta.getColumnCount();
if (ccc > 2) {
ArrayList<String> data_list = new ArrayList<String>();
for (int ii = 1; ii < ccc; ii++) {
data_list.add(resultSet.getString(i));
}
//data_list has all data from all index you are looking for ..
} else {
//use standard way
resultSet.getString(1);
}
If the table is defined to have as many columns as there could be semi-colons in the source, ignoring the initial column definitions, then the excess semi-colons would be consumed by the database driver automatically.
The most likely reason for them to appear in the final column is because the parser returns the balance of the row to the terminator in the field.
Simply increasing the number of columns in the table to match the maximum possible in the input will avoid the need for custom parsing in the program. Try:
code;description;dummy1;dummy2;dummy3;dummy4;dummy5
c01;d01
c02;d0;;;;;2
Then, the additional ';' delimiters will be consumed by the parser correctly.

Reading a property file and saving to an object

I have property file called person.properties. I need to add several person entries in.
A person entry will have a Name, Age, Telephone. There will be many Person entries in this Property file.
ID : 1
Name: joe
Age: 30
Telephone: 444444
ID : 2
Name: Anne
Age: 20
Telephone: 575757
ID : 3
Name: Matt
Age: 17
Telephone : 7878787
ID : 4
Name: Chris
Age: 21
Telephone : 6767676
I need to read the property file and save each record in an Person object.
Person p = new Person();
p.setId(ADD THE FIRST VALUE OF ID FROM THE PROPERTY FILE);
p.setName(ADD THE FIRST VALUE OF NAME FROM THE PROPERTY FILE);
like wise.. and save it in an array.
I think, that i will not be able to read from the person.properties file above and save it to the person object as i require. Because i am having the same key in the property file. Therefore how can i achieve this?
You don't have to use the Property methods for this, you can simply read the file as a text file and parse it manually:
Scanner s = new Scanner(new File("propertyfile.properties"));
while (s.hasNextLine()) {
String id = s.nextLine().split(":")[1].trim();
String name = s.nextLine().split(":")[1].trim();
String age = s.nextLine().split(":")[1].trim();
String phone = s.nextLine().split(":")[1].trim();
}
The file format you describe is not really a properties file. Just read it yourself, using something like
public File openFile(String URI); // write this yourself
public void readFile(File names) {
BufferedReader br = new BufferedReader(new FileReader(name));
while(br.ready()) {
String next = br.readLine();
String[] split = next.split(" : ");
// handle each case, etc.
Modification of file
If you want to modify the key and write it back to the same position, you should use a database. Here are two free ones: MySQL and SQLite. It's possible to edit the file in that way, but it's much easier to just do it with a database, that's what it's designed for.
What you do is actually not the purpose of property files in java, I think. Nevertheless, here is how to handle property files:
Properties prop = new Properties();
try {
//load a properties file
prop.load(new FileInputStream("file.properties"));
//get the property value and print it out
System.out.println(prop.getProperty("name"));
System.out.println(prop.getProperty("age"));
System.out.println(prop.getProperty("telephone"));
} catch (IOException ex) {
ex.printStackTrace();
}
Could this help you or what you actually want to do?
I think for your approach a database style thingy would be better.

Categories