I am using Java8 with Apache OpenNLP. I have a service that extracts all the nouns from a paragraph. This works as expected on my localhost server. I also had this running on an OpenShift server with no problems. However, it does use a lot of memory. I need to deploy my application to AWS Elastic Beanstalk Tomcat Server.
One solution is I could probably upgrade from AWS Elastic Beanstalk Tomcat Server t1.micro to another instance type. But I am on a small budget, and want to avoid the extra fees if possible.
When I run the app, and it tries to do the word chunking, it gets the following error:
dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space] with root cause
java.lang.OutOfMemoryError: Java heap space
at opennlp.tools.ml.model.AbstractModelReader.getParameters(AbstractModelReader.java:148)
at opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:75)
at opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:328)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:256)
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:179)
at opennlp.tools.parser.ParserModel.<init>(ParserModel.java:180)
at com.jobs.spring.service.lang.LanguageChunkerServiceImpl.init(LanguageChunkerServiceImpl.java:35)
at com.jobs.spring.service.lang.LanguageChunkerServiceImpl.getNouns(LanguageChunkerServiceImpl.java:46)
Question
Is there a way to either:
Reduce the amount of memory used when extracting the nouns from a paragraph.
Use a different api other than Apache OpenNLP that won't use as much memory.
A way to configure AWS Elastic Beanstalk Tomcat Server to cope with the demands.
Code Sample:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashSet;
import java.util.Set;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.util.InvalidFormatException;
#Component("languageChunkerService")
#Transactional
public class LanguageChunkerServiceImpl implements LanguageChunkerService {
private Set<String> nouns = null;
private InputStream modelInParse = null;
private ParserModel model = null;
private Parser parser = null;
public void init() throws InvalidFormatException, IOException {
ClassLoader classLoader = getClass().getClassLoader();
File file = new File(classLoader.getResource("en-parser-chunking.bin").getFile());
modelInParse = new FileInputStream(file.getAbsolutePath());
// load chunking model
model = new ParserModel(modelInParse); // line 35
// create parse tree
parser = ParserFactory.create(model);
}
#Override
public Set<String> getNouns(String sentenceToExtract) {
Set<String> extractedNouns = new HashSet<String>();
nouns = new HashSet<>();
try {
if (parser == null) {
init();
}
Parse topParses[] = ParserTool.parseLine(sentenceToExtract, parser, 1);
// call subroutine to extract noun phrases
for (Parse p : topParses) {
getNounPhrases(p);
}
// print noun phrases
for (String s : nouns) {
String word = s.replaceAll("[^a-zA-Z ]", "").toLowerCase();// .split("\\s+");
//System.out.println(word);
extractedNouns.add(word);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (modelInParse != null) {
try {
modelInParse.close();
} catch (IOException e) {
}
}
}
return extractedNouns;
}
// recursively loop through tree, extracting noun phrases
private void getNounPhrases(Parse p) {
if (p.getType().equals("NN")) { // NP=noun phrase
// System.out.println(p.getCoveredText()+" "+p.getType());
nouns.add(p.getCoveredText());
}
for (Parse child : p.getChildren())
getNounPhrases(child);
}
}
UPDATE
Tomcat8 config:
First of all you should try to optimize your code. This starts by using regex and precompile statements before using replaceAll since replaceAll replaced the stuff in the memory. (https://eyalsch.wordpress.com/2009/05/21/regex/)
Second you should not store the parsed sentence in an array. Third hint: you should try to allocate the memory to your array using bytebuffer. Another Hint which may affect you at the most, you should use a BufferedReader to read your chunked file. (out of memory error, java heap space)
After this you should already see less memory usage. If those tips didnt help you please provide a memory dump/allocation graph.
One more tip: A HashSet takes 5.5x more memory then an unordered List. (Performance and Memory allocation comparision between List and Set)
Related
I need to copy all of the contents of a stream of VectorSchemaRoots into a single object:
Stream<VectorSchemaRoot> data = fetchStream();
VectorSchemaRoot finalResult = VectorSchemaRoot.create(schema, allocator);
VectorLoader = new VectorLoader(finalResult);
data.forEach(current -> {
VectorUnloader unloader = new VectorUnloader(current);
ArrowRecordBatch batch = unloader.getRecordBatch();
loader.load(batch);
current.close();
})
However, I am getting the following error:
java.lang.IllegalStateException: Memory was leaked by query. Memory was leaked.
Also getting this further down the stack track:
Could not load buffers for field date: Timetamp(MILLISECOND, null) not null. error message: A buffer can only be associated between two allocators that shame the same root
I use the same allocator for everything, does anyone know why I am getting this issue?
The "leak" is probably just a side effect of the exception, because the code as written is not exception-safe. Use try-with-resources to manage the ArrowRecordBatch instead of manually calling close():
try (ArrowRecordBatch batch = unloader.getRecordBatch()) {
loader.load(batch);
}
(though, depending on what load does, this may not be enough).
I can't say much else about why you're getting the exception without seeing more code and the full stack trace.
Could you try with something like this:
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VectorLoader;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.VectorUnloader;
import org.apache.arrow.vector.ipc.message.ArrowRecordBatch;
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import java.util.Arrays;
import java.util.Collections;
import java.util.stream.Stream;
public class StackOverFlowSolved {
public static void main(String[] args) {
try(BufferAllocator allocator = new RootAllocator()){
// load data
IntVector ageColumn = new IntVector("age", allocator);
ageColumn.allocateNew();
ageColumn.set(0, 1);
ageColumn.set(1, 2);
ageColumn.set(2, 3);
ageColumn.setValueCount(3);
Stream<VectorSchemaRoot> streamOfVSR = Collections.singletonList(VectorSchemaRoot.of(ageColumn)).stream();
// transfer data
streamOfVSR.forEach(current -> {
Field ageLoad = new Field("age",
FieldType.nullable(new ArrowType.Int(32, true)), null);
Schema schema = new Schema(Arrays.asList(ageLoad));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema,
allocator.newChildAllocator("loaddata", 0, Integer.MAX_VALUE))) {
VectorUnloader unload = new VectorUnloader(current);
try (ArrowRecordBatch recordBatch = unload.getRecordBatch()) {
VectorLoader loader = new VectorLoader(root);
loader.load(recordBatch);
}
System.out.println(root.contentToTSVString());
}
current.close();
});
}
}
}
age
1
2
3
I've read through a few similar questions on SO and GCP docs - but did not get a definitive answer...
Is there a way to batch insert data from my Java service into BigQuery directly, without using intermediary files, PubSub, or other Google services?
The key here is the "batch" mode: I do not want to use streaming API as it costs a lot.
I know there are other ways to do batch inserts using Dataflow, Google Cloud Storage, etc. - I am not interested in those, I need to do batch inserts programmatically for my use case.
I was hoping to use the REST batch API but it looks like it is deprecated now: https://cloud.google.com/bigquery/batch
Alternatives that are pointed to by the docs are:
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll REST request - but it looks like it will be working in the streaming mode inserting one row at a time (and cost a lot)
a Java client library: https://developers.google.com/api-client-library/java/google-api-java-client/dev-guide
After following through the links and references I ended up finding this specific API method promising: https://googleapis.dev/java/google-api-client/latest/index.html?com/google/api/client/googleapis/batch/BatchRequest.html
with the following usage pattern:
Create an BatchRequest object from this Google API client instance.
Sample usage:
client.batch(httpRequestInitializer)
.queue(...)
.queue(...)
.execute();
Is this API using the batch mode, not streaming one, and is the right way to go ?
thank you!
The "batch" version of writing data is called a "load job" in the Java client library. The bigquery.writer method creates an object which can be used to write data bytes as a batch load job. Set the format options based on the type of file you'd like to serialize to.
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.FormatOptions;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobId;
import com.google.cloud.bigquery.JobStatistics.LoadStatistics;
import com.google.cloud.bigquery.TableDataWriteChannel;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.WriteChannelConfiguration;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.channels.Channels;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.UUID;
public class LoadLocalFile {
public static void main(String[] args) throws IOException, InterruptedException {
String datasetName = "MY_DATASET_NAME";
String tableName = "MY_TABLE_NAME";
Path csvPath = FileSystems.getDefault().getPath(".", "my-data.csv");
loadLocalFile(datasetName, tableName, csvPath, FormatOptions.csv());
}
public static void loadLocalFile(
String datasetName, String tableName, Path csvPath, FormatOptions formatOptions)
throws IOException, InterruptedException {
try {
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests.
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableId tableId = TableId.of(datasetName, tableName);
WriteChannelConfiguration writeChannelConfiguration =
WriteChannelConfiguration.newBuilder(tableId).setFormatOptions(formatOptions).build();
// The location and JobName must be specified; other fields can be auto-detected.
String jobName = "jobId_" + UUID.randomUUID().toString();
JobId jobId = JobId.newBuilder().setLocation("us").setJob(jobName).build();
// Imports a local file into a table.
try (TableDataWriteChannel writer = bigquery.writer(jobId, writeChannelConfiguration);
OutputStream stream = Channels.newOutputStream(writer)) {
// This example writes CSV data from a local file,
// but bytes can also be written in batch from memory.
// In addition to CSV, other formats such as
// Newline-Delimited JSON (https://jsonlines.org/) are
// supported.
Files.copy(csvPath, stream);
}
// Get the Job created by the TableDataWriteChannel and wait for it to complete.
Job job = bigquery.getJob(jobId);
Job completedJob = job.waitFor();
if (completedJob == null) {
System.out.println("Job not executed since it no longer exists.");
return;
} else if (completedJob.getStatus().getError() != null) {
System.out.println(
"BigQuery was unable to load local file to the table due to an error: \n"
+ job.getStatus().getError());
return;
}
// Get output status
LoadStatistics stats = job.getStatistics();
System.out.printf("Successfully loaded %d rows. \n", stats.getOutputRows());
} catch (BigQueryException e) {
System.out.println("Local file not loaded. \n" + e.toString());
}
}
}
Resources:
https://cloud.google.com/bigquery/docs/batch-loading-data#loading_data_from_local_files
https://cloud.google.com/bigquery/docs/samples/bigquery-load-from-file
system test which writes JSON from memory
I'm using Java to convert JSON to Avro and store these to GCS using Google DataFlow.
The Avro schema is created on runtime using SchemaBuilder.
One of the fields I define in the schema is an optional LONG field, it is defined like this:
SchemaBuilder.FieldAssembler<Schema> fields = SchemaBuilder.record(mainName).fields();
Schema concreteType = SchemaBuilder.nullable().longType();
fields.name("key1").type(concreteType).noDefault();
Now when I create a GenericRecord using the schema above, and "key1" is not set, when putting the resulted GenericRecord to the context of my DoFn: context.output(res); I get the following error:
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.avro.UnresolvedUnionException: Not in union ["long","null"]: 256
I also tried doing the same thing with withDefault(0L) and got the same result.
What do I miss?
Thanks
It works fine for me when trying as below and you can try to print the schema that will help to compare also you can remove the nullable() for long type to try.
fields.name("key1").type().nullable().longType().longDefault(0);
Provided the complete code that I used to test:
import org.apache.avro.AvroRuntimeException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.SchemaBuilder.FieldAssembler;
import org.apache.avro.SchemaBuilder.RecordBuilder;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericRecordBuilder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import java.io.File;
import java.io.IOException;
public class GenericRecordExample {
public static void main(String[] args) {
FieldAssembler<Schema> fields;
RecordBuilder<Schema> record = SchemaBuilder.record("Customer");
fields = record.namespace("com.example").fields();
fields = fields.name("first_name").type().nullable().stringType().noDefault();
fields = fields.name("last_name").type().nullable().stringType().noDefault();
fields = fields.name("account_number").type().nullable().longType().longDefault(0);
Schema schema = fields.endRecord();
System.out.println(schema.toString());
// we build our first customer
GenericRecordBuilder customerBuilder = new GenericRecordBuilder(schema);
customerBuilder.set("first_name", "John");
customerBuilder.set("last_name", "Doe");
customerBuilder.set("account_number", 999333444111L);
Record myCustomer = customerBuilder.build();
System.out.println(myCustomer);
// writing to a file
final DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
try (DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter)) {
dataFileWriter.create(myCustomer.getSchema(), new File("customer-generic.avro"));
dataFileWriter.append(myCustomer);
System.out.println("Written customer-generic.avro");
} catch (IOException e) {
System.out.println("Couldn't write file");
e.printStackTrace();
}
// reading from a file
final File file = new File("customer-generic.avro");
final DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
GenericRecord customerRead;
try (DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, datumReader)){
customerRead = dataFileReader.next();
System.out.println("Successfully read avro file");
System.out.println(customerRead.toString());
// get the data from the generic record
System.out.println("First name: " + customerRead.get("first_name"));
// read a non existent field
System.out.println("Non existent field: " + customerRead.get("not_here"));
}
catch(IOException e) {
e.printStackTrace();
}
}
}
If I understand your question correctly, you're trying to accept JSON strings and save them in a Cloud Storage bucket, using Avro as your coder for the data as it moves through Dataflow. There's nothing immediately obvious from your code that looks wrong to me. I have done this, including saving the data to Cloud Storage and to BigQuery.
You might consider using a simpler, and probably less error prone approach: Define a Java class for your data and use Avro annotations on it to enable the coder to work properly. Here's an example:
import org.apache.avro.reflect.Nullable;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
#DefaultCoder(AvroCoder.class)
public class Data {
public long nonNullableValue;
#Nullable public long nullableValue;
}
Then, use this type in your DnFn implementations like you likely already are. Beam should be able to move the data between workers properly using Avro, even when the fields marked #Nullable are null.
I use the following code to create a graph with Neo4j Graph Database:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.lucene.unsafe.batchinsert.LuceneBatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserter;
import org.neo4j.unsafe.batchinsert.BatchInserterIndex;
import org.neo4j.unsafe.batchinsert.BatchInserterIndexProvider;
import org.neo4j.unsafe.batchinsert.BatchInserters;
public class Neo4jMassiveInsertion implements Insertion {
private BatchInserter inserter = null;
private BatchInserterIndexProvider indexProvider = null;
private BatchInserterIndex nodes = null;
private static enum RelTypes implements RelationshipType {
SIMILAR
}
public static void main(String args[]) {
Neo4jMassiveInsertion test = new Neo4jMassiveInsertion();
test.startup("data/neo4j");
test.createGraph("data/enronEdges.txt");
test.shutdown();
}
/**
* Start neo4j database and configure for massive insertion
* #param neo4jDBDir
*/
public void startup(String neo4jDBDir) {
System.out.println("The Neo4j database is now starting . . . .");
Map<String, String> config = new HashMap<String, String>();
inserter = BatchInserters.inserter(neo4jDBDir, config);
indexProvider = new LuceneBatchInserterIndexProvider(inserter);
nodes = indexProvider.nodeIndex("nodes", MapUtil.stringMap("type", "exact"));
}
public void shutdown() {
System.out.println("The Neo4j database is now shuting down . . . .");
if(inserter != null) {
indexProvider.shutdown();
inserter.shutdown();
indexProvider = null;
inserter = null;
}
}
public void createGraph(String datasetDir) {
System.out.println("Creating the Neo4j database . . . .");
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetDir)));
String line;
int lineCounter = 1;
Map<String, Object> properties;
IndexHits<Long> cache;
long srcNode, dstNode;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split("\t");
cache = nodes.get("nodeId", parts[0]);
if(cache.hasNext()) {
srcNode = cache.next();
}
else {
properties = MapUtil.map("nodeId", parts[0]);
srcNode = inserter.createNode(properties);
nodes.add(srcNode, properties);
nodes.flush();
}
cache = nodes.get("nodeId", parts[1]);
if(cache.hasNext()) {
dstNode = cache.next();
}
else {
properties = MapUtil.map("nodeId", parts[1]);
dstNode = inserter.createNode(properties);
nodes.add(dstNode, properties);
nodes.flush();
}
inserter.createRelationship(srcNode, dstNode, RelTypes.SIMILAR, null);
}
lineCounter++;
}
reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Comparing with other graph database technologies (titan, orientdb) it needs too much time. So may i am doing something wrong. Is there a way to boost up the procedure?
I use neo4j 1.9.5 and my machine has a 2.3 Ghz CPU (i5), 4GB RAM and 320GB disk and I am running on Macintosh OSX Mavericks (10.9). Also my heap size is at 2GB.
Usually I can import about 1M nodes and 200k relationships per second on my macbook.
Flush & Search
Please don't flush & search on every insert, that totally kills performance.
Keep your nodeIds in a HashMap from your data to node-id, and only write to lucene during the import.
(If you care about memory usage you can also go with something like gnu-trove)
RAM
Memory Mapping
You also use too little RAM (I usually use heaps between 4 and 60GB depending on the data set size) and you don't have any config set.
Please check as sensible config something like this, depending on you data volume I'd raise these numbers.
cache_type=none
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1000M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
Heap
And make sure to give it enough heap. You might also have a disk that might be not the fastest. Try to increase your heap to at least 3GB. Also make sure to have the latest JDK, 1.7.._b25 had a memory allocation issue (it allocated only a tiny bit of memory for the
I have a list of links, containing links to html and xml pages, how can I extract the xml links from the list? in java
thanks
You could use a list of common filename extensions to divine the type of data stored at a given URL, but that often won't be very reliable, particularly with Web 2.0 sites (just look at the URL of this SO question itself). In addition, a link to a PHP script (.php) or other dynamic content site could return either HTML or XML. Or it could return something else entirely, such as a JPG file.
There are a lot of simple heuristics you can use for detecting HTML vs. XML, simply by looking at the beginning of the file. For example, you could look for the <!DOCTYPE ...> declaration, check for the <?xml ...?> directive, and check to see if the file contains a root <html> tag. Of course, these should all be case-insensitive checks.
You can also try to identify the type of file based on its MIME type (for example, text/html or text/xml). Unfortunately, many servers return incorrect or invalid MIME types, so you often have to read the beginning of the file anyway to divine its content, as you can see in my first two inadequate versions of a getMimeType() method below. The third attempt worked better, but the third-party MimeMagic library still provided disappointing results. Nevertheless, you could use the additional heuristics that I mentioned earlier to either replace or improve the getMimeType() method.
package com.example.mimetype;
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.FileNameMap;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import net.sf.jmimemagic.Magic;
import net.sf.jmimemagic.MagicException;
import net.sf.jmimemagic.MagicMatchNotFoundException;
import net.sf.jmimemagic.MagicParseException;
public class MimeUtils {
// After calling this method, you can retrieve a list of URLs for each mimetype.
public static Map<String, List<String>> sortLinksByMimeType(List<String> links) {
Map<String, List<String>> mapMimeTypesToLinks = new HashMap<String, List<String>>();
for (String url : links) {
try {
String mimetype = getMimeType(url);
System.out.println(url + " has mimetype " + mimetype);
// If this mimetype hasn't already been initialized, initialize it.
if (! mapMimeTypesToLinks.containsKey(mimetype)) {
mapMimeTypesToLinks.put(mimetype, new ArrayList<String>());
}
List<String> lst = mapMimeTypesToLinks.get(mimetype);
lst.add(url);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return mapMimeTypesToLinks;
}
public static String getMimeType(String url) throws MalformedURLException, IOException, MagicParseException, MagicMatchNotFoundException, MagicException {
// first attempt at determining MIME type--returned null for all URLs that I tried
// FileNameMap filenameMap = URLConnection.getFileNameMap();
// return filenameMap.getContentTypeFor(url);
// second attempt at determining MIME type--worked better, but still returned null for many URLs
// URLConnection c = new URL(url).openConnection();
// InputStream in = c.getInputStream();
// String mimetype = URLConnection.guessContentTypeFromStream(in);
// in.close();
// return mimetype;
URLConnection c = new URL(url).openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
byte[] content = new byte[100];
in.read(content);
in.close();
return Magic.getMagicMatch(content, false).getMimeType();
}
public static void main(String[] args) {
List<String> links = new ArrayList<String>();
links.add("http://stackoverflow.com/questions/10082568/how-to-differentiate-xml-from-html-links-in-java");
links.add("http://stackoverflow.com");
links.add("http://stackoverflow.com/feeds");
links.add("http://amazon.com");
links.add("http://google.com");
sortLinksByMimeType(links);
}
}
I'm not certain if your links are some sort of Link object, but as long as you can access the value as a string this should work I think.
List<String> xmlLinks = new ArrayList<String>();
for (String link : list) {
if (link.endsWith(".xml") || link.contains(".xml")) {
xmlLinks.add(link);
}
}