I have a program that extracts certain elements(article author names) from many articles, from the PubMed site. While the program works correctly in my pc (windows), when i try to run it on unix returns an empty list. I suspect this is because the syntax should be somewhat different in a unix system. The problem is the JSoup documentation does not mention something. Anyone know anything on this? My code is something like this:
Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
System.out.println("connected");
Elements authors = doc.select("div.auths >*");
System.out.println("number of elements is " + authors.size());
The final System.out.println always says the size is 0 therefore it cannot do anything more.
Thanks in advance
Complete Example:
protected static void searchLink(HashMap<String, HashSet<String>> authorsMap, HashMap<String, HashSet<String>> reverseAuthorsMap,
String fileLine
) throws IOException, ParseException, InterruptedException
{
JSONParser parser = new JSONParser();
JSONObject jsonObj = (JSONObject) parser.parse(fileLine.substring(0, fileLine.length() - 1 ));
String pmidString = (String)jsonObj.get("pmid");
System.out.println(pmidString);
Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
System.out.println("connected");
Elements authors = doc.select("div.auths >*");
System.out.println("found the element");
HashSet<String> authorsList = new HashSet<>();
System.out.println("authors list hashSet created");
System.out.println("number of elements is " + authors.size());
for (int i =0; i < authors.size(); i++)
{
// add the current name to the names list
authorsList.add(authors.get(i).text());
// pmidList variable
HashSet<String> pmidList;
System.out.println("stage 1");
// if the author name is new, then create the list, add the current pmid and put it in the map
if(!authorsMap.containsKey(authors.get(i).text()))
{
pmidList = new HashSet<>();
pmidList.add(pmidString);
System.out.println("made it to searchLink");
authorsMap.put(authors.get(i).text(), pmidList);
}
// if the author name has been found before, get the list of articles and add the current
else
{
System.out.println("Author exists in map");
pmidList = authorsMap.get(authors.get(i).text());
pmidList.add(pmidString);
authorsMap.put(authors.get(i).text(), pmidList);
//authorsMap.put((String) authorName, null);
}
// finally, add the pmid-authorsList to the map
reverseAuthorsMap.put(pmidString, authorsList);
System.out.println("reverseauthors populated");
}
}
I have a thread pool, and each thread uses this method to populate two maps. The fileline argument is a single line that I parse as json and keep the "pmid" field. Using this string I access the url of this article, and parse the HTML for the names of the authors. The rest should work (it does work in my pc), but because the authors.size is 0 always, the for directly below the number of elements System.out does not get executed at all.
I've tried an example doing exactly what you're trying:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test {
public static void main (String[] args) throws IOException {
String docId = "24312906";
if (args.length > 0) {
docId = args[0];
}
String url = "http://www.ncbi.nlm.nih.gov/pubmed/" + docId;
Document doc = Jsoup.connect(url).timeout(60000).userAgent("Mozilla/25.0").get();
Elements authors = doc.select("div.auths >*");
System.out.println("os.name=" + System.getProperty("os.name"));
System.out.println("os.arch=" + System.getProperty("os.arch"));
// System.out.println("doc=" + doc);
System.out.println("authors=" + authors);
System.out.println("authors.length=" + authors.size());
for (Element a : authors) {
System.out.println(" author: " + a);
}
}
}
My OS is Linux:
# uname -a
Linux graphene 3.11.0-13-generic #20-Ubuntu SMP Wed Oct 23 07:38:26 UTC 2013 x86_64 x86_64 x86_64
GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
Running that program produces:
os.name=Linux
os.arch=amd64
authors=Liu W
Chen D
authors.length=2
author: Liu W
author: Chen D
Which seems to work. Perhaps the issue is with fileLine? Can you print out the value of 'url':
System.out.println("url='" + "http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString+ "'");
Since you're not getting an exception from your code, I suspect you're getting a document, just not one your code is anticipating. Printing out the document so you can see what you've gotten back will probably help as well.
Related
I am trying to parse this csv file but when i got to print it I get "Input length =1" as the output. Here is my code: Can anyone provide an explanation as to why this is happening?
try {
List<String> lines = Files.readAllLines(Paths.get("src\\exam1_tweets.csv"));
for(String line : lines) {
line = line.replace("\"", "");
System.out.println(line);
}
}catch (Exception e) {
System.out.println(e.getMessage());
}
You want this change
List<String> lines = Files.readAllLines(Paths.get("src\\exam1_tweets.csv"),
StandardCharsets.ISO_8859_1);
It was encoding issue please read this.
To see full cause of errors you should use e.printStackTrace() in catch block.
java code:
import java.io.File;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
public class Sof {
public static final String USER_DIR = "user.dir";
public static void main(String[] args) {
try {
List<String> lines = Files.readAllLines(
Paths.get(System.getProperty(USER_DIR) + File.separator + "src" + File.separator + "main" + File.separator + "resources" + File.separator + "exam1_tweets.csv"),
StandardCharsets.ISO_8859_1);
for (String line : lines) {
line = line.replace("\"", "");
System.out.println(line);
}
} catch (Exception e) {
System.out.println("ERROR" + e.getMessage());
}
}
}
console:
Handle,Tweet,Favs,RTs,Latitude,Longitude
BillSchulhoff,Wind 3.2 mph NNE. Barometer 30.20 in, Rising slowly. Temperature 49.3 °F. Rain today 0.00 in. Humidity 32%,,,40.76027778,-72.95472221999999
danipolis,Pausa pro café antes de embarcar no próximo vôo. #trippolisontheroad #danipolisviaja Pause for? https://....,,,32.89834949,-97.03919589
KJacobs27,Good. Morning. #morning #Saturday #diner #VT #breakfast #nucorpsofcadetsring #ring #college? https://...,,,44.199476,-72.50417299999999
stncurtis,#gratefuldead recordstoredayus ?????? # TOMS MUSIC TRADE https://...,,,39.901474,-76.60681700000001
wi_borzo,Egg in a muffin!!! (# Rocket Baby Bakery - #rocketbabybaker in Wauwatosa, WI) https://...,,,43.06084924,-87.99830888
KirstinMerrell,#lyricwaters should've gave the neighbor a buzz. Iv got ice cream and moms baked goodies ??,,,36.0419128,-75.68186176
Jkosches86,On the way to CT! (# Mamaroneck, NY in Mamaroneck, NY) https://.../6rpe6MXDkB,,,40.95034402,-73.74092102
tmj_pa_retail,We're #hiring! Read about our latest #job opening here: Retail Sales Consultant [CWA MOB] Bryn Mawr PA - https://.../bBwxSPsL4f #Retail,,,40.0230237,-75.31517719999999
Vonfandango,Me... # Montgomery Scrap Corporation https://.../kpt7zM4xbL,,,39.10335,-77.13652 ....
I've made a csv parser/writer , easy to use thanks to its builder pattern
It parses csv file and gives you list of beans
here is the source code
https://github.com/i7paradise/CsvUtils-Java8/
I've joined a main class Demo.java to display how it works
let's say your file contains this
Item name;price
"coffe ""Lavazza""";13,99
"tea";0,00
"milk
in three
lines";25,50
riz;130,45
Total;158
and you want to parse it and store it into a collection of
class Item {
String name;
double price;
public Item(String name, double p) {
// ...
}
//...
}
you can parse it like this:
List<Item> items = CsvUtils.reader(Item.class)
//content of file; or you can use content(String) to give the content of csv as a String
.content(new File("path-to-file.csv"))
// false to not include the first line; because we don't want to parse the header
.includeFirstLine(false)
// false to not include the last line; because we don't want to parse the footer
.includeLastLine(false)
// mapper to create the Item instance from the given line, line is ArrayList<String> that returns null if index not found
.mapper(line -> new Item(
line.get(0),
Item.parsePrice(line.get(1)))
)
// finally we call read() to parse the file (or the content)
.read();
I am using JNDI in Java to perform DNS lookups in my application to resolve A records - running under Java 8 on Windows 7. However, I am having trouble resolving records unless I specify the complete host entry including domain name.
Java appears to be ignoring the DNS search list which is configured on the PC. I don't have a problem including the domain name, if that is what Java requires, but I can't find a public method to obtain the domains in the search list.
The following SSCCE uses the private method sun.net.dns.ResolverConfiguration to obtain the DNS search list, but I shouldn't use it as it is an internal proprietary API and may be removed in a future release.
import java.util.*;
import javax.naming.*;
import javax.naming.directory.*;
public class SSCCE {
public static void main(String[] args) {
String[] hostsToLookup = new String[] { "testhost", "testhost.mydomain.com" };
try {
System.out.println("DNS Search List:");
for (Object o: sun.net.dns.ResolverConfiguration.open().searchlist()) {
System.out.println(" " + o);
}
Properties p = new Properties();
p.put(Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.dns.DnsContextFactory");
InitialDirContext idc = new InitialDirContext(p);
for (String h : hostsToLookup) {
System.out.println("Host: " + h);
try {
Attributes attrs = idc.getAttributes(h, new String[] { "A" });
Attribute attr = attrs.get("A");
if (attr != null) {
for (int i = 0; i < attr.size(); i++) {
System.out.println(" " + attr.get(i));
}
}
}
catch (NameNotFoundException e) {
System.out.println(" undefined");
}
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}
When I run this using just the host part it doesn't resolve, but when I manually add the domain from the search list then it does:
DNS Search List:
mydomain.com
Host: testhost
undefined
Host: testhost.mydomain.com
192.0.2.1
Is it possible to either make Java honour the DNS search list using JNDI or is there a public method to obtain the DNS search list?
I'm trying out LDA with Spark 1.3.1 in Java and got this error:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
My .txt file looks like this:
put weight find difficult pull ups push ups now
blindness diseases everything eyes work perfectly except ability take light use light form images
role model kid
Dear recall saddest memory childhood
This is the code:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
Anyone know why this happened? I'm just trying LDA Spark for the first time. Thanks.
values[i] = Double.parseDouble(sarray[i]);
Why are you trying to convert each word of your text file into a Double?
That's the answer to your issue:
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
Your code is expecting the input file to be a bunch of lines of whitespace separated text that looks like numbers. Assuming your text is words instead:
Get a list of every word that appears in your corpus:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
Make a map giving each word an integer associated with it:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
Then instead of your parsedData doing what it's doing up there, make it look up each word, find the associated number, go to that location in an array, and add 1 to the count for that word.
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
This results in an RDD of vectors, where each vector is vocab.size() long, and each spot in the vector is the count of how many times that vocab word appeared in the line.
I modified this code slightly from what I'm currently using and didn't test it, so there could be errors in it. Good luck!
I am novice to java however, I cannot seem to figure this one out. I have a CSV file in the following format:
String1,String2
String1,String2
String1,String2
String1,String2
Each line are pairs. The 2nd line is a new record, same with the 3rd. In the real word the CSV file will change in size, sometimes it will be 3 records, or 4, or even 10.
My issues is how do I read the values into an array and dynamically adjust the size? I would imagine, first we would have to parse though the csv file, get the number of records/elements, then create the array based on that size, then go though the CSV again and store it in the array.
I'm just not sure how to accomplish this.
Any help would be appreciated.
You can use ArrayList instead of Array. An ArrayList is a dynamic array. ex.
Scanner scan = new Scanner(new File("yourfile"));
ArrayList<String[]> records = new ArrayList<String[]>();
String[] record = new String[2];
while(scan.hasNext())
{
record = scan.nextLine().split(",");
records.add(record);
}
//now records has your records.
//here is a way to loop through the records (process)
for(String[] temp : records)
{
for(String temp1 : temp)
{
System.out.print(temp1 + " ");
}
System.out.print("\n");
}
Just replace "yourfile" with the absolute path to your file.
You could do something like this.
More traditional for loop for processing the data if you don't like the first example:
for(int i = 0; i < records.size(); i++)
{
for(int j = 0; j < records.get(i).length; j++)
{
System.out.print(records.get(i)[j] + " ");
}
System.out.print("\n");
}
Both for loops are doing the same thing though.
You can simply read the CSV into a 2-dimensional array just in 2 lines with the open source library uniVocity-parsers.
Refer to the following code as an example:
public static void main(String[] args) throws FileNotFoundException {
/**
* ---------------------------------------
* Read CSV rows into 2-dimensional array
* ---------------------------------------
*/
// 1st, creates a CSV parser with the configs
CsvParser parser = new CsvParser(new CsvParserSettings());
// 2nd, parses all rows from the CSV file into a 2-dimensional array
List<String[]> resolvedData = parser.parseAll(new FileReader("/examples/example.csv"));
// 3rd, process the 2-dimensional array with business logic
// ......
}
tl;dr
Use the Java Collections rather than arrays, specifically a List or Set, to auto-expand as you add items.
Define a class to hold your data read from CSV, instantiating an object for each row read.
Use the Apache Commons CSV library to help with the chore of reading/writing CSV files.
Class to hold data
Define a class to hold the data of each row being read from your CSV. Let's use Person class with a given name and surname, to be more concrete than the example in your Question.
In Java 16 and later, more briefly define the class as a record.
record Person ( String givenName , String surname ) {}
In older Java, define a conventional class.
package work.basil.example;
public class Person {
public String givenName, surname;
public Person ( String givenName , String surname ) {
this.givenName = givenName;
this.surname = surname;
}
#Override
public String toString ( ) {
return "Person{ " +
"givenName='" + givenName + '\'' +
" | surname='" + surname + '\'' +
" }";
}
}
Collections, not arrays
Using the Java Collections is generally better than using mere arrays. The collections are more flexible and more powerful. See Oracle Tutorial.
Here we will use the List interface to collect each Person object instantiated from data read in from the CSV file. We use the concrete ArrayList implementation of List which uses arrays in the background. The important part here, related to your Question, is that you can add objects to a List without worrying about resizing. The List implementation is responsible for any needed resizing.
If you happen to know the approximate size of your list to be populated, you can supply an optional initial capacity as a hint when creating the List.
Apache Commons CSV
The Apache Commons CSV library does a nice job of reading and writing several variants of CSV and Tab-delimited formats.
Example app
Here is an example app, in a single PersoIo.java file. The Io is short for input-output.
Example data.
GivenName,Surname
Alice,Albert
Bob,Babin
Charlie,Comtois
Darlene,Deschamps
Source code.
package work.basil.example;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVRecord;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
public class PersonIo {
public static void main ( String[] args ) {
PersonIo app = new PersonIo();
app.doIt();
}
private void doIt ( ) {
Path path = Paths.get( "/Users/basilbourque/people.csv" );
List < Person > people = this.read( path );
System.out.println( "People: \n" + people );
}
private List < Person > read ( final Path path ) {
Objects.requireNonNull( path );
if ( Files.notExists( path ) ) {
System.out.println( "ERROR - no file found for path: " + path + ". Message # de1f0be7-901f-4b57-85ae-3eecac66c8f6." );
}
List < Person > people = List.of(); // Default to empty list.
try {
// Hold data read from file.
int initialCapacity = ( int ) Files.lines( path ).count();
people = new ArrayList <>( initialCapacity );
// Read CSV file.
BufferedReader reader = Files.newBufferedReader( path );
Iterable < CSVRecord > records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse( reader );
for ( CSVRecord record : records ) {
// GivenName,Surname
// Alice,Albert
// Bob,Babin
// Charlie,Comtois
// Darlene,Deschamps
String givenName = record.get( "GivenName" );
String surname = record.get( "Surname" );
// Use read data to instantiate.
Person p = new Person( givenName , surname );
// Collect
people.add( p ); // For real work, you would define a class to hold these values.
}
} catch ( IOException e ) {
e.printStackTrace();
}
return people;
}
}
When run.
People:
[Person{ givenName='Alice' | surname='Albert' }, Person{ givenName='Bob' | surname='Babin' }, Person{ givenName='Charlie' | surname='Comtois' }, Person{ givenName='Darlene' | surname='Deschamps' }]
Suppose I have a simple program which takes argument input in one of the following forms
do1 inputLocation outputLocation
do2 inputLocation outputLocation
do3 [30 or 60 or 90] inputLocation outputLocation
do4 [P D or C] inputLocation outputLocation
do5 [G H I] inputLocation outputLocation
I also have 5 functions with the same names in the program that I need to call. So far I thought of doing it this way (In 'semi pseudocode')
static void main(String[] args)
{
if (args.length == 3)
processTriple(args);
if (args.length == 4)
processQuadruple(args);
throw new UnsupportedOperationException("dasdhklasdha");
}
where the process functions look like this
processDouble(String args[])
{
String operation = "args[0]";
Location input = getInput(args[1]);
Location output = getInput(args[2]);
if (operation.equals("do1"))
do1(input,output);
if (operation.equals("do2"))
do2(input,output);
... etc
}
The way I'm doing it doesn't seem very extensible. If a function's arguments change, or new functions are added it seems like this would be a pain to maintain.
What's the "best" way of going about something like this
at this point I would use commons-cli or jargs. Unless you are trying to do something really special with arguments I would say focus in the real business of your app and don't deal with the mess of the application arguments
Use a command line parsing library.
Ive used JOpt Simple in the past with great results. It lets you abstract away the command line arg mess, and keep a really clean update-able list of arguments. An added benefit is it will generate the help output that standard command line utilities have.
Heres a quick example:
private void runWithArgs (String[] args) {
OptionParser parser = getOptionParser ();
OptionSet options = null;
try {
options = parser.parse (args);
}
catch (OptionException e) {
log.error ("Sorry the command line option(s): " + e.options () +
" is/are not recognized. -h for help.");
return;
}
if (options.has ("h")) {
try {
log.info ("Help: ");
parser.printHelpOn (System.out);
}
catch (IOException e) {
log.error ("Trying to print the help screen." + e.toString ());
}
return;
}
if (options.has ("i")) {
defaultHost = (String) options.valueOf ("i");
}
if (options.has ("p")) {
defaultPort = (Integer) options.valueOf ("p");
}
if (options.has ("q")) {
String queryString = (String) options.valueOf ("q");
log.info ("Performing Query: " + queryString);
performQuery (queryString, defaultHost, defaultPort);
return;
}
}
You can use Cédric Beust's JCommander library
Because life is too short to parse command line parameters
I even creatively violate the original intent of the library to parse NMEA 0183 sentences like $GPRTE as follows:
import java.util.List;
import com.beust.jcommander.Parameter;
import com.beust.jcommander.internal.Lists;
public class GPRTE {
#Parameter
public List<String> parameters = Lists.newArrayList();
#Parameter(names = "-GPRTE", arity = 4, description = "GPRTE")
public List<String> gprte;
}
Code snippet that processes NMEA 0183 sentence $GPRTE from $GPRTE,1,1,c,*37 into -GPRTE 1 1 c *37 to comply with JCommander parsing syntax:
/**
* <b>RTE</b> - route message<p>
* Processes each <b>RTE</b> message received from the serial port in following format:<p>$GPRTE,d1,d2,d3,d4<p>Example: $GPRTE,1,1,c,*37
* #param sequences result of {#link #Utils.process(String)} method
* #see <a href="http://www.gpsinformation.org/dale/nmea.htm#RTE">http://www.gpsinformation.org/dale/nmea.htm#RTE<a><p>*/
public static void processGPRTE(final String command){
final String NMEA_SENTENCE = "GPRTE";
final String PARAM = "\u0001";
final String DOLLAR = "\u0004";
final String COMMA = "\u0005";
String parsedString = command;
if (parsedString.contains("$"+NMEA_SENTENCE)){
parsedString = parsedString.replaceAll("\\$", DOLLAR+PARAM);
parsedString = parsedString.replaceAll(",", COMMA);
System.out.println("GPRTE: " + parsedString);
String[] splits = parsedString.split(DOLLAR);
for(String info: splits){
if (info.contains(PARAM+NMEA_SENTENCE)) {
info = info.replaceFirst(PARAM, "-");
System.out.println("GPRTE info: " + info);
String[] args = info.split(COMMA);
GPRTE cmd = new GPRTE();
new JCommander(cmd, processEmptyString(args));
List<String> message = cmd.gprte;
String data1 = SerialOutils.unescape(message.get(0));
System.out.println("GPRTE: data1 = " + data1);
String data2 = SerialOutils.unescape(message.get(1));
System.out.println("GPRTE: data2 = " + data2);
String data3 = SerialOutils.unescape(message.get(2));
System.out.println("GPRTE: data3 = " + data3);
String data4 = SerialOutils.unescape(message.get(3));
System.out.println("GPRTE: data4 = " + data4);
System.out.println("");
}
}
}
}
I've used args4j with successful results before as well.
Just another option.